Serious problem: incomplete calculations labeled as final

SVB · March 6, 2019, 12:22am

It is well known that the final energy of a VASP relaxation run is meaningless due to basis set changes during relaxation, and the MaterialsProject has been so popular and relied upon exactly because all the data was supposed to strictly follow the best practices, including multiple-task runs, at minimum including relaxation+static sequence, as well as strict error checking procedures. Looks like there are now entries in MP that do NOT have a static run, yet labeled as a properly calculated material used for further analysis. An example is mp-1079094, which has a single GGA relaxation task (involving a HUGE cell shape change!) yet selected as a ground state.

The issue appears to be very serious. In particular, mp-1079094 shows as the ground state for Sb2Te3, likely because of ~1ev/atom (!!!) error in the final energy. Specifically, mp-1079094 is a rare high-P phase (a single ICSD entry), which, however, is calculated to be >1 eV/atom lower in energy than the common phase (mp-1201, a few dozen ICSD entries). Plotting a phase diagram along GeTe - Sb2Te3 makes it very clear that mp-1079094 is a problematic entry: there is a clear line connecting mp-938 (GeTe) and mp-1201 (common Sb2Te3), and energies of most Ge-Sb-Te polymorphs lie within 0.2eV/atom from this line (with a few higher-energy structures, forming a very typical distribution), yet there is a single entry 1eV below that line, and that entry is the mp-1079094, which we know should be wrong because it uses the energy from a relaxation run…

It is very concerning, as looks like the basic data integrity principles may have been compromised…

mkhorton · March 6, 2019, 3:02am

Hi @SVB,

This is not a response to the overall issue raised here, since we’re still investigating. However, I wanted to clarify the following point:

An example is mp-1079094, which has a single GGA relaxation task (involving a HUGE cell shape change!)

Every “optimization” task on Materials Project is actually performed as a double optimization run, precisely to account for cases such as this where there is a huge cell shape change or huge volume change. So in the case of mp-1079094, although there was overall a dramatic change from initial and final volume, the change over the course of the second optimization step was actually much smaller (~0.05%). This alone shouldn’t account for the discrepancy you’ve flagged.

Myself or another member of the MP team will update this thread as we investigate this issue further. Thanks for bring this to our attention.

Best,

Matt

SVB · March 6, 2019, 8:25am

Thanks Matt, this is relieving. I wasn’t able to tell from looking at the task that there were two relaxation runs (BTW is there a way I can tell?), but that should indeed have taken care of most of the basis set error. I would guess the issue is more subtle then (problem with a VASP run? maybe k-mesh symmetrization failing?). Thanks a lot for investigating this!

mkhorton · March 7, 2019, 5:57am

Yes, each task has its own task details page, which for this particular task is available here: https://materialsproject.org/tasks/mp-1079094/

You can see there are two graphs underneath, “iterative steps in first relaxation” and “iterative steps in second relaxation”.

However, for the issue you reported, unfortunately there definitely is an issue here, mostly related to a subset of the new materials added in the most recent database release (v2019.02). We’re aware of the cause and hope to fix it in the next database release, and will include an explanation of what happened along with a list of tasks affected.

SVB · March 8, 2019, 11:14pm

Thanks Matt, I wasn’t realizing the two plots implied two reruns. Pretty obvious in retrospect, though.

Is there a way to mask those “new” materials from the queried data for the sake of querying for e_above_hull and other phase-diagram-related analysis? The only solution that I came up with so far is to seek for the specific known-to-produce-problems (e.g. mp-1079094) materials in PhaseDiagram.qhull_data, exclude it from there, and then manually pass it to qhull and redo the entire phase diagram analysis… Inconvenient, error-prone, and requires major reworking in all the existing scripts and workflows… Are there any simpler solutions, maybe?

Thanks!

SVB · March 12, 2019, 11:48pm

Just wanted to check if there is any updates / solution to the issue with the wrong entries/ compromised data? Phase diagrams involving Sb2Te3 are still all totally wrong… I just downloaded the MP files and reran the calculations for a few Sb2Te3 isomorphs: for mp-1201, I reproduce the MP energy (and the downloaded geometry is indeed fully relaxed), whereas for mp-1079094, the downloaded (presumably fully relaxed?) structure undergoes huge relaxation, after which it has energy ~0.2eV/atom higher than mp-1201, rather than 1eV/atom lower as MP shows.

I don’t quite understand why such entries are not taken offline, rolling down to the older database version if necessary? The data presented at MP site (and API) are just terribly wrong, and each individual incorrect compound potentially spoils ALL phase diagrams, reactions, etc., in the chemical systems involving it as a subsystem! So far mp-1079094 is the only one I have seen a 1eV/atom error, but clearly there are other cases: e.g. plotting Mg-Al phase diagram suggests a problem with mp-1185596; I take it from your earlier reply that there are also other cases you’ve encountered. Why keep a compromised data online? It makes the entire database unreliable (read “useless”)…

Thanks,
–Sergey

mkhorton · March 13, 2019, 11:39pm

Hi @SVB,

We’re still discussing this internally. Rolling back to a previous database version until this issue is resolved is definitely an option, but would also mean that certain mp-ids would go temporarily “missing”, which is problematic for other reasons. This might however be the best option for the time being.

Beyond this specific issue, we’re also discussing ways to better share historical versions of the database, and to allow users to opt-in to preview new versions of the database, so that we can better handle this in future if any similar issues occur.

Resolving this is absolutely a priority for us. Thanks for your patience while we get it fixed.

Matt

shyamd · March 30, 2019, 9:04pm

Hi @SVB,

First off, thank you for pointing this out and then doing some extra work to show the energy is off.

We figured out the issue. That is that the volume change was large enough that the original K-point settings was no longer valid for the final structure. The result is an under-converged calculation with respect to K-points. In our previous workflow code this was managed better, but it appears to be an issue with structure optimizations in atomate, our current workflow code.

Due to a number of calculations having this issue, including some older calculations, we’ve been developing a method for officially deprecating calculations and possibly materials. This is try and ensure that we don’t take away access to data but still have a way of noting that a materials detail page may have been constructed by mistake.

The plan is to have this out for April release in a few days, when you should see an updated materials set and e_above_hulls.

Thanks again for pointing this out and please continue to post any issues you notice.

SVB · April 10, 2019, 9:51am

Thank you @shyamd,

When can we expect to see this new release? Doesn’t look like anything has changed yet (at least judging by the website results)?

Thanks for looking into this.

P.S.
Frankly, I don’t quite see why you have to keep the database in a useless state for so long if the only problem is a conflict between an old (correct but incomplete) and new (more complete but wrong) versions. Can’t you just expose both of them? I may be naive, but looks to me like you could run the older database on a second web server instance… Say, use the same https address, but a different port number, say 9000 – then the default pymatgen installation would query the latest database version, returning all mpIDs (but some with totally wrong entries); whereas simply replacing “materialsproject.org/rest” with “materialsproject.org:9000/rest” throughout the pymatgen installation would query the older version with more reliable numbers (but some IDs missing). As far as I understand, only python files need to be changed in pymatgen installation – if so, interested users could be advised to do that manually, and you wouldn’t even need to release any API updates/fixes! I may be missing something, but at least at first sight this looks like a 1-2-3 solution: run a second database process, run a second web server process, done!
(This, of course, would not have changed the website results, so looking for a more thorough solution would still be appropriate – but it would eventually make MP data usable again, while instead for the last couple of months it has been random noise, or more accurately one has been gambling to get either data or random noise…)