Computational Chemistry, Invited Lecture
CC-011

Learning how to do chemical reactions from data

A. C. Vaucher1, A. Cardinale1, J. Geluykens1, V. H. Nair1, O. Schilter1, P. Schwaller1, A. Logallo1, F. Zipoli1, T. Laino1*
1IBM Research Europe, Säumerstrasse 4, 8803 Rüschlikon, Switzerland

The synthesis of novel compounds is a complex task requiring 1) an adequate retrosynthetic analysis, 2) the formulation of adequate reaction conditions for each reaction step, and 3) the actual synthesis in the laboratory. The knowledge and experience matured by chemists in decades of practice are key components to characterize a successful design across steps (1) to (3).
With more than two hundred years of reported experiments in the chemical literature, as well as recent advances in automation and machine learning, data-driven approaches provide multiple opportunities to assist chemists and accelerate the material design process. In recent years, different groups worldwide reported several algorithms to support chemists in the retrosynthetic problem [1-3]. Such computational approaches are highly useful to assist chemists in the identification of optimal synthetic pathways of target molecules starting from suitable building blocks. In particular, the approaches relying on machine-learning do not require the manual encoding of reaction rules and are able to learn directly from corpuses of reaction data. The algorithms for retrosynthesis, however, provide only a design of the synthesis of given targets, without specifying the details of the reaction steps to implement that specific transformation.
Here, we present a machine-learning model to predict the sequence of operations to perform any reaction step in the laboratory. To train this model, we used a dataset containing examples of reaction equations and associated laboratory operations, processed from reported experimental procedures in patents. To this aim, we designed a separate machine-learning model trained to extract sequences of operations from experimental procedures in patents [4] and to convert them to a sequence of structured, computer-friendly operations. We used this dataset to design an architecture to predict the exact sequence of laboratory operations given a reaction target.
Finally, we present how the models for the retrosynthetic analysis and for the prediction of laboratory operations can be coupled to commercial chemical robots to provide a platform able to autonomously synthesize molecules.

[1] Law, J.; Zsoldos, Z.; Simon, A.; Reid, D.; Liu, Y.; Khew, S. Y.; Johnson, A. P.; Major, S.; Wade, R. A.; Ando, H. Y., J. Chem. Inf. Model. 2009, 49, 593–602.
[2] Segler, M. H. S.; Preuss, M.; Waller, M. P., Nature 2018, 555, 604–610.
[3] Schwaller, P.; Petraglia, R.; Zullo, V.; Nair, V. H.; Haeuselmann, R. A.; Pisoni, R.; Bekas, C.; Iuliano, A.; Laino, T., Chem. Sci. 2020, 11, 3316–3325.
[4] Vaucher, A. C.; Zipoli, F.; Geluykens, J.; Nair, V. H.; Schwaller, P.; Laino, T., under review, 2020.