Description
The emergence of big-data-driven techniques as a fundamental paradigm of science has forced the evaluation of the way that researchers manage, document, and share their data [1]. As a result, a variety of domain-specific projects have developed innovative tools for ensuring FAIR - Findable, Accessible, Interoperable, Reusable - or TRUE - Transparent, Reproducible, Usable by others, Extensible - data management [2]. For example, the Novel Materials Discovery (NOMAD) Laboratory is a user-driven platform for sharing and exploiting computational materials science data, with a focus on data from Ab-initio calculations [3]. In practice, 4 main challenges, known as the 4V's of big data, arise when developing data management procedures: volume (the amount of data), variety (the heterogeneity of form and meaning of data), velocity (the rate at which data may change or new data arrive) and veracity (the uncertainty of data quality). In contrast to Ab-initio data, the data generated by soft matter simulations (e.g., atomistic molecular dynamics simulations and multiscale modeling techniques) pose a particular challenge, due primarily to issues associated with the first 2 V's.
To address reproducibility of soft matter simulations, Cummings, McCabe, and coworkers have developed the Molecular Simulation Design Framework (MoSDeF), an open-source Python software stack that enables facile use of multiple open-source molecular simulation engines, while at the same time ensuring maximum reproducibility [4,5]. This suite provides support for constructing topologies and configurations, implementing and saving force fields, and generating simulation input files for popular molecular simulation software. In this way, researchers can implement complex simulation workflows in a fully scriptable fashion that is maximally reproducible [6].
In the context of accessibility and data sharing, various communities have developed niche repositories and management tools. Recently, FAIRmat - a consortium of the German research-data infrastructure (NFDI) - was formed to continue to raise awareness and acceptance of FAIR data practices [7]. One of the primary tasks of FAIRmat is to extend the NOMAD infrastructure to a wide variety of materials science data, including data from soft matter simulations. Additionally, FAIRmat aims to assist the community in advancing metadata schemas and ontologies, enabling efficient exchange of FAIR research data and big-data analyses that aim to revolutionize the development of novel materials.
Proper interoperability requires some standardization of the simulation data and workflows. To tackle these challenges, it is essential that involved parties, including leading software developers outside the data management sphere, come together to set appropriate standards. Communication between individual projects will facilitate efficient development of tools and avoid duplication of work and "reinventing the wheel”.
The FAIRmat and MOSDEF consortia have come together to organize a CECAM flagship workshop to address data management challenges in soft matter simulations. In particular, the workshop will highlight ongoing projects aiming to 1. develop and standardize simulation metadata (e.g., force field and run-time parameters), and 2. develop tools and infrastructure for storage and interoperability of simulations. Hosted by the Max Planck Institute for Polymer Research in Mainz, Germany, this in-person meeting will take place in September 2023.
To identify the overarching challenges in this budding discipline, and to effectively prepare for discussing the standardization of simulation data and workflows at the in-person workshop, a virtual pre-meeting will be held Feb. 27-28 from 15:00-19:30 CET. In this pre-meeting, representatives from leading data management projects for soft matter simulations will provide overviews of their current capabilities as well as outlooks for the coming year. Represented projects include FAIRmat/NOMAD, MOSDEF, OPTIMADE, OpenKIM, ColabFit, among others.
We look forward to seeing you there!
References
[1] C. Draxl, M. Scheffler, MRS Bull., 43, 676-682 (2018)
[2] M. Wilkinson, M. Dumontier, I. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L. da Silva Santos, P. Bourne, J. Bouwman, A. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. Evelo, R. Finkers, A. Gonzalez-Beltran, A. Gray, P. Groth, C. Goble, J. Grethe, J. Heringa, P. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. Lusher, M. Martone, A. Mons, A. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, Sci. Data., 3, 160018 (2016)
[3] C. Draxl, M. Scheffler, J. Phys. Mater., 2, 036001 (2019)
[4] A. Summers, J. Gilmer, C. Iacovella, P. Cummings, C. MCabe, J. Chem. Theory Comput., 16, 1779-1793 (2020)
[5] P. Cummings, C. MCabe, C. Iacovella, A. Ledeczi, E. Jankowski, A. Jayaraman, J. Palmer, E. Maginn, S. Glotzer, J. Anderson, J. Ilja Siepmann, J. Potoff, R. Matsumoto, J. Gilmer, R. DeFever, R. Singh, B. Crawford, AIChE. J., 67, (2021)
[6] M. Thompson, J. Gilmer, R. Matsumoto, C. Quach, P. Shamaprasad, A. Yang, C. Iacovella, C. McCabe, P. Cummings, Molecular Physics, 118, e1742938 (2020)
[7] M. Scheffler, M. Aeschlimann, M. Albrecht, T. Bereau, H. Bungartz, C. Felser, M. Greiner, A. Groß, C. Koch, K. Kremer, W. Nagel, M. Scheidgen, C. Wöll, C. Draxl, Nature, 604, 635-642 (2022)
Schedule
The workshop takes place on Feb. 27-28, from 15:00-19:30 CET
Exact times TBA
Speaker | Affiliation | TopicLuca Ghiringhelli |
Luca Ghiringhelli | FAIRmat/Humboldt-Universität zu Berlin | Molecular dynamics support in NOMAD |
Johannes Bergsma | École Polytechnique Fédérale de Lausanne | Trajectory support for OPTIMADE |
TBD | TBD | MOSDEF |
Ilia Nikiforov | University of Minnesota | OpenKIM |
Eric Fuemmeler | University of Minnesota | ColabFit |