Downloading and managing data with MADAS

For our tutorial, we will use data from NOMAD. NOMAD is a free and FAIR online database of materials-science data, including results from both theory and experiments. As such, it is a rich source of data for analytics and machine learning.

NOMAD follows a user-centric approach to data management, allowing users to upload raw data, which is then transformed and archived on the NOMAD platform. As such, it supports many different ways of representing data, including user-defined schemas. This rich metadata is a valuable source for data analytics, as it allows to keep track of the whole provenance of the data, allowing to find and understand outliers and creating trustable results. However, the verbosity of the schemata leads to significant complexity, making it hard to find the relevant information for a given application. Furthermore, the flexible approach of the NOMAD data schema allows that the central database contains different versions of the same schema, based on when the data was processed. While in those cases the data provenance is preserved, bringing the data to an application may require processing of the data before it can be used.

MADAS as a framework allows to connect to the NOMAD API, download data, store it in a local database, apply transformations to the data, and extract the transformed data for downstream applications.

In this tutorial you are going to learn how to:

Let’s get started!

Use an API to download data

The first step of downloading data is to select a suitable dataset. Within MADAS, we can query the NOMAD database for suitable entries. This is achieved by connecting to the NOMAD API, and submitting queries and downloading data from there.
If you want to learn more about how NOMAD makes this possible, you can find more information in the NOMAD documentation.

First we import the API interface for NOMAD:

[1]:
from madas.apis import NOMAD_API

You can find the full documentation of the NOMAD API class here.

The most important functions are get_calculations_by_search and get_calculation. The former allows to query the NOMAD database and downloading all matching entries as Material objects (see also the documentation here). Materials will be uniquely identified by their entry id.
If you already know the entry ids of the NOMAD entries that you are interested in, get_calculation can be used to retrieve this entry.
To demostrate how this works with MADAS, we have selected a high-throughput dataset [1] computed using the hybrid HSE06 exchange-correlation functional with density functional theory from NOMAD. To limit the amount of retrieved data, we restrict our search to cubic structures.
To simplify writing NOMAD queries, it is possible to create them via the GUI (see also this section of the NOMAD documentation), and copying the query by clicking on the <> symbol below the search bar.
[2]:
# Create an API object
api = NOMAD_API()
[3]:
# define the NOMAD query
query = {
    "results.material.symmetry.crystal_system:any": [
      "cubic"
    ],
    "results.method.simulation.dft.xc_functional_type:any": [
      "hybrid"
    ],
    "datasets.dataset_name:any": [
      "Materials Database from All-electron Hybrid Functional DFT Calculations"
    ],
    "results.properties.available_properties:all": [
      "dos_electronic"
    ]
  }

We can test if our query worked by downloading only a single calculation at first:

[4]:
materials = api.get_calculations_by_search(query, max_entries=1)
Found 10 entries
Possibly not all entries discovered due to max_entries limit
Downloading 1 entries
Finished download.

Let’s investigate what we got back from the API:

[5]:
example_material = materials[0]
print(example_material)
Material(mid = 0LFocy2yAB3pqdEz41liXtLWNGvR, formula = Ca, data = {'archive'}, properties = set())
Printing the Material reveals some of it’s properties:
- the mid is used to uniquely identify the material. It is obtained from the NOMAD entry id.
- the formula is the reduced formula and can be used as a human readable identifier when working with the data.
- the data attribute contains all data that was downloaded from NOMAD.
- the property attribute is still empty, as it contains properties that are derived from the downloaded data.

We can verify that we recieved the same data by creating a link to the NOMAD website with the entry id:

[6]:
print(f"https://www.nomad-lab.eu/entry/id/{example_material.mid}")
https://www.nomad-lab.eu/entry/id/0LFocy2yAB3pqdEz41liXtLWNGvR

We can visualize the unit cell of the material using some untility functions:

[7]:
from madas.plotting import plot_material
[8]:
plot_material(example_material,
              repeat=[1,1,1], # repetitions of the unit cell
              show_unit_cell=2) # show the whole unit cell in the plot
../_images/tutorials_1_data_download_and_management_18_0.png

MADAS uses the Atomic Simulation Environment (ASE) for representing atomic structures (and much more). The ase.Atoms object of each Material can be used to extract it directly and get access to many convenient functions of the ase famework.

[9]:
print(type(example_material.atoms))
<class 'ase.atoms.Atoms'>

Let’s inspect the data attibute. For the API, we have recieved the archive as a Python dictinary:

[10]:
print(example_material.data.keys())
dict_keys(['archive'])

Within the archive, we find the information NOMAD has stored about this entry:

[11]:
print(example_material.data['archive'].keys())
dict_keys(['processing_logs', 'run', 'workflow2', 'metadata', 'results', 'm_ref_archives'])

The material properties are currently emtpy:

[12]:
print(example_material.properties)
{}
MADAS provides a convenient way of accessing the information stored in the data and properties attributes of a Material. As an example, we can find the reduced formula in the NOMAD archive.
This information can be extracted from the Material object by specifying the path as follows:
[13]:
example_material.get_data_by_path('archive/run/0/system/0/chemical_composition_reduced')
[13]:
'Ca'

You can use a different path to extract any information from the data of a Material. Similarly, for properties, Material.get_property_by_path() can be used.

More information on how to work with the NOMAD API most efficiently can be found in the tutorial about using the NOMAD API.

Now, running get_calculations_by_search will return a list of Material objects. To store these on our machine, we will make use of a database.

Store materials data in a local database

First, import the MaterialsDatabase class:

[14]:
from madas import MaterialsDatabase

and create a MaterialsDatabase object. Here we specify the filepath, which tells the database where to store the information on our local machine. Furthermore, we pass it the the NOMAD_API object. Note that the default API of the MaterialsDatabase is also the NOMAD API.

[15]:
db = MaterialsDatabase(filename='materials_database.db', api = api, log_mode='silent')

Initially, the database is emtpy:

[16]:
print(len(db))
0
Using the function fill_database will call the get_calculations_by_search function of the NOMAD_API. We therefore pass it the query, and it will automatically download all entries it can find with this query and store them in a local file.
This may take some minutes based on your internet connection.
[17]:
db.fill_database(query)
The actual file handling is done using a Backend class, which can be any type of storage.
For more information, see also the Backend documentation page, and in the tutorial for creating a custom Backend.

During the processing of the query, the MaterialsDatabase writes some log entries. These can be very useful for debugging and for finding failing entries when large or long queries are processed. You can set where these messages are written (to the terminal and/or) to file) by using the log_mode attribute when crating the MaterialsDatabase object. You can inspect the log file by using the MaterialsDatabase.log_file_path attribute.

[18]:
with open(db.log_file_path, 'r') as f_: # open the log file for reading
    logfile=f_.readlines() # read the log files line by line

print(''.join(logfile[:5])) # print the last 5 log entries
print('...')
print(''.join(logfile[-5:])) # print the last 5 log entries
2026-04-09 18:32:34,716 - materials_database_log - INFO - Retrieving data...
2026-04-09 18:32:40,698 - materials_database_api - INFO - Found 191 entries
2026-04-09 18:32:40,698 - materials_database_api - INFO - Download data for 191 entries
2026-04-09 18:33:03,804 - materials_database_api - INFO - Finished download.
2026-04-09 18:33:03,805 - materials_database_log - INFO - Got data for 191 entries.

...
2026-04-09 18:33:12,349 - materials_database_api - INFO - Wrote material with id qo3iMLZM-TX3FXLMTOJoW4HCJojV.
2026-04-09 18:33:12,381 - materials_database_api - INFO - Wrote material with id xyQyh8qKd5KUJx75KYtLup9sAvP6.
2026-04-09 18:33:12,414 - materials_database_api - INFO - Wrote material with id AL_NDle5ybphhGeeterPG9tKxshp.
2026-04-09 18:33:12,454 - materials_database_api - INFO - Wrote material with id V4sjmkEC0kBNBsSTpFvzMCHuaMsS.
2026-04-09 18:33:12,488 - materials_database_api - INFO - Wrote material with id UUO7gDxGEe2jLENT_yxR1Ygy8c7A.

After downloading, the database contains all entries that were found:

[19]:
len(db)
[19]:
191

We can get every entry of the database in the order they have been added by using the index:

[20]:
db[0]
[20]:
Material(mid = lYczghNfQInQhaVu7F4TcyaBFPkg, formula = Cd2In4O8, data = {'archive'}, properties = set())

Or by using the mid of the Material object we want to recover.

[21]:
db[db[0].mid]
[21]:
Material(mid = lYczghNfQInQhaVu7F4TcyaBFPkg, formula = Cd2In4O8, data = {'archive'}, properties = set())

When we inspect these material in our database, we see they contain the full archive. However, we want to make the data easier accessible. To do so, we will see next how to manipulate data in the database.

Derive properties in the database

To access properties in the database, we can just iterate over its elements. Because this can take some time, especially for larger databases, we will use tqdm for showing progress bar. MADAS provides a wrapper for tqdm that automatically selects the correct layout of the progress bar, both in a notebook and the command line.

[22]:
from madas.utils import tqdm

Then, we can access the information in each Material. As an example, we will compute the volume of the unit cell using the ase.Atoms objects. Note that this information is also contained in the data property, as NOMAD also computes the unit cell volume.

[23]:
volumes = [] # create an empty list for our volumes
for entry in tqdm(db): # for every entry in the database
    volume = entry.atoms.get_volume() # compute the volume
    volumes.append(volume) # append it to the volume list

Already with this information, we can create a histogram that describes our data. We use matplotlib for the visualization;

[24]:
import matplotlib.pyplot as plt

plt.figure(figsize=(7,4))
plt.hist(volumes, bins=50)
plt.xlabel('Volume [ų]')
plt.ylabel('Count')
plt.show()
../_images/tutorials_1_data_download_and_management_57_0.png

We can see a distribution of volumes with a peak around 100 ų and few entries with higher volumes.

Let’s find some more useful information. For this, we will explore what is contained in the NOMAD archives that we have downloaded.
To do so, MADAS provides some usefull utilities:
[25]:
from madas.utils import print_key_paths, resolve_nested_dict
The first function, print_key_paths, can be used to print all paths that a key in a nested dictionary has. This can help discovering the location of a specific information in the data.
We pick a specific example for illustration purposes:
[26]:
example_material = db["2KE2kVq3XhM1eU67AsxN890ohcop"]

Then, we search for the band gap in the data of the example material:

[27]:
print_key_paths('band_gap', example_material.data)
/archive/run/0/calculation/0/band_gap
/archive/run/0/calculation/0/dos_electronic/0/band_gap
/archive/run/0/calculation/0/band_structure_electronic/0/band_gap
/archive/results/properties/electronic/band_gap
/archive/results/properties/electronic/dos_electronic_new/0/data/0/band_gap
/archive/results/properties/electronic/band_structure_electronic/0/band_gap
We see several paths: - The first three paths point to the run section of the archive, which is where data that the NOMAD parsers extract from calculations is stored.
- The second three paths point to the results section, which is used to aggregate information for the NOMAD GUI.

We can use that aggregated data. To do so, we first inspect its contents with resolve_nested_dict, which follows the specified path and returns the data at the end:

[28]:
resolve_nested_dict(example_material.data, 'archive/results/properties/electronic/band_gap')
[28]:
[{'index': 0,
  'value': 2.307134352960001e-19,
  'energy_highest_occupied': -1.07345834478e-18,
  'energy_lowest_unoccupied': -8.427449094839999e-19,
  'provenance': {'label': 'dos',
   'dos': '#/run/0/calculation/0/dos_electronic/0/total/0'}},
 {'index': 0,
  'value': 2.3181573282019203e-19,
  'type': 'direct',
  'energy_highest_occupied': -1.0737309038646226e-18,
  'energy_lowest_unoccupied': -8.419151710444306e-19,
  'provenance': {'label': 'band_structure',
   'band_structure': '#/run/0/calculation/0/band_structure_electronic/0/segment/0'}}]
We can see that NOMAD stores the band gap value, the highest occupied, and the lowest unoccupied electronic state. Furthermore, there are two possible values of the gap: one is extracted from the band structure, one from the DOS. We choose the value from the DOS, because we also use the DOS in later tutorials. Thereby, we keep the data consistent. Finally, we notice that this data is provided in Joules, and we can transform it to a more accessible unit.
First, we extract the gap data, a dictionary, from the list:
[29]:
gap_data = resolve_nested_dict(example_material.data, 'archive/results/properties/electronic/band_gap')

Then we compute the gap and transform the unit to eV:

[30]:
from scipy.constants import electron_volt

gap = None
for gap_info in gap_data:
    if gap_info['provenance']['label']=='dos':
        gap = gap_info['value']/ electron_volt
print(gap)
1.4400000000000006

We can write a small function to apply the same transformation to all entries:

[31]:
def get_gap(material) -> float:
    gap_data = resolve_nested_dict(material.data, 'archive/results/properties/electronic/band_gap')
    gap = None
    for gap_info in gap_data:
        if gap_info['provenance']['label']=='dos':
            gap = gap_info['value']/ electron_volt
    if gap is None:
        raise ValueError('Could not find gap from DOS')
    return gap

And extract these values:

[32]:
band_gaps = []
for entry in tqdm(db):
    band_gaps.append(get_gap(entry))

We can plot the data:

[33]:
plt.figure(figsize=(7,4))
plt.hist(band_gaps, bins=50)
plt.xlabel('Band gap [eV]')
plt.ylabel('Count')
plt.show()
../_images/tutorials_1_data_download_and_management_76_0.png

Here we see a large peak for small or no band gap, and a long flat distribution of band gaps up until 10 eV.

Now that we have extacted both volumes and band gaps, we can plot them together:

[34]:
plt.figure(figsize=(7,7))
plt.scatter(volumes, band_gaps)
plt.xlabel('Volume [ų]')
plt.ylabel('Band gap [eV]')
plt.show()
../_images/tutorials_1_data_download_and_management_79_0.png

There seems to be not very much correlation between those quantities.

We can derive many useful qantities from the data in our database. But, in some cases, the generation of these quantites may take long, or we want to first create the quantity and then process it further later on, without having to bring the necessary code. For these cases, it is possible to update the data in the database using functions. Here, we will add these two quantites to our database. At the moment, the entries only contain the data we downloaded:

[35]:
print(db[0])
Material(mid = lYczghNfQInQhaVu7F4TcyaBFPkg, formula = Cd2In4O8, data = {'archive'}, properties = set())

We can use the function update_derived_properties of the database to update our entries. It takes a list of property names and a list of functions that are used to derive these properties as inputs.

[36]:
db.update_derived_properties(
    [
        'band_gap',
        'volume'
    ],
    [
        get_gap, # we defined this function above
        lambda entry: entry.atoms.get_volume() # this is a lambda expression
    ]
)

For the band gaps, we use the existing function, for the volume (because it requires very little instructions), we can use a Python lambda expression, i.e., an unnamed function that is defined only where it is executed.

Now, the properties of our entries have been updated:

[37]:
db[0]
[37]:
Material(mid = lYczghNfQInQhaVu7F4TcyaBFPkg, formula = Cd2In4O8, data = {'archive'}, properties = {'volume', 'band_gap'})

For many applications, the data should be available as lists, arrays, or Pandas dataframes. We can extract these from a MADAS MaterialsDatabase.

Retrieve properties

Properties can be retrieved by name:

[38]:
band_gaps = db.get_properties('band_gap')

For dataframes, we can use another function:

[39]:
df = db.get_property_dataframe(['volume', 'band_gap'])
[40]:
print(type(df))
<class 'pandas.core.frame.DataFrame'>
[41]:
df.describe()
[41]:
volume band_gap
count 191.000000 191.000000
mean 217.414811 1.255602
std 246.428808 2.051329
min 7.709944 0.000000
25% 64.001745 0.000000
50% 114.783390 0.000000
75% 269.220632 2.195000
max 1240.477268 9.540000

We can also pass a property or data path to the function to obtain the property directly:

[42]:
db.get_property_dataframe(['volume', 'band_gap', 'archive/results/material/chemical_formula_reduced'])
[42]:
volume band_gap archive/results/material/chemical_formula_reduced
lYczghNfQInQhaVu7F4TcyaBFPkg 195.532456 2.41 CdIn2O4
rZF4gjJ48EGz2BBJuuCLJFzkPWVD 53.789343 1.62 Cu3N
Og3YctYdzQznelLm078NKakyvkNO 44.070003 0.0 CeS
luGPZXx92gJIG_ZdF-kw7bY00knV 178.910746 1.61 Cs3Sb
J5MoOxPpWnOl42x2aQJ2Qn_TBVeX 92.121642 0.0 Ca2H6Ir
... ... ... ...
qo3iMLZM-TX3FXLMTOJoW4HCJojV 95.321992 6.86 BaCl2
xyQyh8qKd5KUJx75KYtLup9sAvP6 263.534744 0.0 Be13Ca
AL_NDle5ybphhGeeterPG9tKxshp 29.800045 0.0 CeO
V4sjmkEC0kBNBsSTpFvzMCHuaMsS 87.985549 0.0 Cd3In
UUO7gDxGEe2jLENT_yxR1Ygy8c7A 77.888396 0.0 Ca2Ir

191 rows × 3 columns

References

[1] Akhil S. Nair, Lucas Foppa, and Matthias Scheffler
Materials Database from All-electron Hybrid Functional DFT Calculations.