Bereiche | Tage | Auswahl | Suche | Aktualisierungen | Downloads | Hilfe
MM: Fachverband Metall- und Materialphysik
MM 70: Topical Session (Symposium MM): Big Data in Materials Science - Managing and exploiting the raw material of the 21st century
MM 70.4: Vortrag
Freitag, 16. März 2018, 10:30–10:45, H 0107
Cluster analysis of chemical libraries based on molecular fingerprinting — •Annika Stuke1, Lei Xie2, Milica Todorović1, and Patrick Rinke1 — 1Department of Applied Physics, Aalto University, Finland — 2Department of Computer Science, Hunter College, the City University of New York, USA
Machine learning models promise to greatly accelerate the process of discovering new and better materials. However, it is difficult for learning models to achieve a robust and high prediction performance with imbalanced chemical datasets, in which certain classes of chemical structures are overrepresented. Learning algorithms are easily influenced by the larger classes, leading to biased results. We present an efficient method to generate diverse subsets from large chemical databases with cluster analysis. Databases are split into different clusters with an extended exclusion sphere algorithm based on the pairwise Tanimoto similarity calculated from Morgan fingerprints [1]. A diverse subset is then generated by picking molecules with different substructures from each cluster. The method has been successfully employed to select structurally diverse subsets of a dataset of 64k organic molecules from the Cambridge Crystal Structure Database [2]. We demonstrate the effect of this method on the prediction performance of machine learning models based on kernel ridge regression and neural networks for spectral properties of molecules. [1] D. Butina, J. Chem. Inf. Comput. Sci. 39, 747 (1999), [2] C. Schober et al., J. Phys. Chem. Lett. 7, 3973 (2016)