Some new MP entries have poor ICSD provenance


Some of the new entries on the MaterialsProject, which were generated from missing ICSD entries, are missing light atoms.

For example, consider NaN, mp-1080032, which was built from sodium amide, NaNH2, from ICSD #25633. However, the ICSD entry presumably does not have hydrogen positions in the cif.

An important data provenance check is that the composition of the ICSD compound should match the composition of the generated POSCAR. If the compositions don’t match (for example, ICSD_NaNH2 and MP_NaN), then these entries should definitely not be included on the MP database.


Thanks for the tip, we’ll check out this issue.


An important data provenance check is that the composition of the ICSD compound should match the composition of the generated POSCAR.

Agreed, we could do this. In general we trust the ICSD, and there’s a limit to what we can do if we get bad CIFs, but it would be good to avoid calculating obviously bad materials if possible.

If the compositions don’t match (for example, ICSD_NaNH2 and MP_NaN), then these entries should definitely not be included on the MP database.

This is actually ok. The Materials Project is designed to be resilient against ‘bad’ structures. There are many unstable or metastable structures that are calculated, and the reported ‘energy above hull’ gives a measure of this stability. Ideally, we wouldn’t calculate obviously unphysical materials but sometimes we simply don’t know before performing the calculation whether the material is reasonable or not; if it’s not reasonable, it should have a large energy above hull reported. If you do a query on MP and order by energy above hull you’ll find a few of these very unstable materials.

We still retain these materials in our database because it can save other people from having to perform these calculations if we already know it results in an unstable structure, so this information is useful to report.

In this case, our hypothetical NaN mp-1080032 has an energy above hull of 0.174 eV/atom, which (though high) is actually not too bad, but if you look at the structure this is because there’s been significant structural relaxation compared to the structure reported from the ICSD.

We will have to discuss further internally how to address this however, because the reported tag “sodium amide” is incorrect, so trying to identify these faulty tags would be a good idea else it could be misleading. Thanks for the report!


Also note this material did have a warning shown:

Warnings like this are included to signal possibly bad inputs, since a large lattice parameter change during relaxation usually indicates there were issues with the input structure.


Dear Matthew,

I have to disagree with you on your last comment. If you look at the NaN structure, you can see why it is unreasonable to include in the MP. The initial structure is based off of an ionic lattice of (Na)+(NH2)-. By removing the hydrogens, the structure relaxes into a diazenide, with ionic (N=N)- ions, which is an unphysical representation of the ICSD entry. Furthermore, in my paper on nitrogen-rich nitrides, ( I show in the supplemental information that GGA systematically overstabilizes pernitrides/diazenides. Because there is no +U correction for nitrides on the Materials Project, structures like this have artificially low formation energies and small E_abv_hulls. So we should really get rid of them at the provenance level, not using an ‘energy’ screening. After all, in my Metastability paper (Sci. Adv. 2016), one of the major takeaways is that using energy above the hull alone is a bad metric for synthesizability.

There is also another problem: this material technically has ICSD provenance, but it relaxed unphysically. Suppose someone were to do an analysis of bond-lengths on all ICSD compounds on the MaterialsProject. Here they would get spurious data that there are compounds with short N-N or short O-O bonds, when in reality, this is just bad data. In my opinion, this kind of data really detracts from the brand of the MP, we should be careful to keep this out.

Here are a few other egregious examples:
Li2N (instead of Li2NH), mp-1062345
AlO2 (instead of AlOOH), mp-1062345

I think the example of AlO2 (which even has negative formation energy) is really the kind of stuff we need to diligently curate.


I think we agree that ideally the ICSD provenance should be removed (or at least a functional warning about some failing element of that provenance should be supplied), but I’m unconvinced that the structure and corresponding data should be removed from MP entirely. Presumably, other structures with these chemical moieties exist within the database, and if we’re not removing them all, it seems odd to remove one simply because the ICSD provenance is bad. A different way to put this is that an MP user could submit that structure through the crystal toolkit, and it wouldn’t have ICSD provenance, and it would still be integrated into the database.


I agree with Joey. The tagging of provenance is what we need to fix. Otherwise, this is what DFT defines as the structure-property relationship for this set of atoms and we have to respect that.

@wenhaosun , Can you submit a PR to pymatgen to do some extra checking on parsing CIFs. Since you’ve looked into this, I imagine you have some ideas on how to perform in place validation for them.


I think removing the ICSD provenance is a great starting point. @shyamd, it should be relatively straightforward to implement a simple composition check in the CIFparser, to see if the nominal composition of the CIF file is the same as the structural composition. I can do this, although probably not anytime soon.

However, I still think the question of whether or not structures such as NaN or AlO2 ought to be on the MP at all should still be an open discussion. Unlike a structure predicted by a DMSP ionic substitution algorithm, which at least has some chemically-motivated provenance, a structure like NaN or AlO2 is made from truly unphysical origins. If we include these structures, well, you could in principle pollute the MP with millions of structures of arbitrary compositions and structures, which would all have true DFT structure-property-energy relationships (and some may even be energetically competitive), but this would likely confuse naive users of the MP. I think the MP needs to take the responsibility as gatekeepers and make sure that the structures on the MP have at least a reasonable physical origin.

(As a side note, from my conversations with user of the MP, I find that many people are confused by the hypothetical MP structures. For example, consider TiO2, where <10 of the 34 entries have actually been observed; but this fact is not immediately apparent from the MaterialsExplorer window. I think it would be highly beneficial if an ‘In ICSD’ column was added to be in the default results display. For the hypothetical structures, it would also be valuable if the provenance was specified more explicitly: what structure was this phase substituted from, what was the substitution probability, etc.)