Chemical Reaction Prediction using Machine Learning
Adnan R. Ahmad
College of Veterinary Medicine, University of Duhok, Zakho Street 38, 1006 AJ Duhok, Duhok,
Kurdistan Region, Iraq.
*Corresponding Author E-mail: ara80@uod.ac
ABSTRACT:
A significant revolution in organic chemistry is being driven by artificial intelligence. A number of platforms, including applications for planned synthesis and reaction prediction Machine learning has successfully integrated itself into the daily work of organic chemists, enabling in synthetic issues with a specific domain. Contrary to retrosynthetic models and reaction prediction, the Despite the huge potential of response yield prediction, it has gotten less attention. accurately forecasting the rates of response conversion. Reaction generates models that specify the proportion. Chemists to choose high produced reactions and score synthesis methods, decreasing the number of tries. The reactants transformed to the wanted products. For high-capacity studies, yield estimates have primarily been made using a firm encoding of reactants, focused molecular patterns, or calculated chemicals used to describe.
KEYWORDS: Chemistry, Reaction prediction, Machine learning, Yield prediction, Artificial intelligence.
INTRODUCTION:
The process of chemical reaction is described usually by stating the structural formulation of reactants and martials separated by vectors arrow, which represent the chemical transforming of atoms between several reactant molecular. Many efforts is used to construct a model to predict the reactivity for example oxidative dehydrogenations of ethylbenzene1,2, reactions of Vanadium Selenites3, Suzuki coupling reactions4-10. In this paper, we design a prediction model to predict the reactivity and produces no applicable constraints to a specific reaction class such that the reaction outcome is above or below a threshold value and come up results.
Subject on the reaction circumstances (temperature degree, concentrations) and the particular substrates, certain chemical reaction classes are typically characterized by lower or greater yields.
Artificial intelligence (AI) techniques are useful to increase the speed of simplifying the drug discovery new martials11.
In the last decade, utilizing the data science techniques in different field of martial science has significantly raised12-17. For example, data science is being implemented to assist density functional computations to form a relationship among the interaction of atoms with the properties of materials itself using quantum mechanics18-20. Machine learning is also used to establish the process to structure property bonds to model the martials mechanics. AI is being utilized to design novel materials that has the desired properties or to implemented to optimize the production process of the already existed materials of the seek of improvement. ML is very useful to examine drug complex prediction, especially, the one with nonlinear behavior.
For synthesis planning, knowing the outcome of the reaction can be a game-changer. It gives scientists the ability to assess the overall yield of complicated chemical pathways and resolve any potential flaws before devoting time and resources to wet-lab investigations. Synthetic chemists may find it helpful to use computational models that anticipate reaction yields to help them select the best synthesis path from among the many suggested by data-driven algorithms. Additionally, in order to supplement forward prediction models4,6 and in-scope filters2 and computer-aided retro-synthesis road planning tools3-6, reaction outcome prediction models might also be used as metric functions.
Here, we continue utilizing organic chemistry as a language that we have used in the past to offer a new model that from the reaction SMILES, predicts reaction yields16. In more detail, we adjust the rxnfp models. bidirectional encoder demonstrations from transformers (BEDT) developed by adding a regression layer to it in order to forecast reaction yields. BEDT encoders are a subset of the Natural language processing has been revolutionized by the transformer model family17,18. The models in these articles use structures of tokens as input to compute contextualized representations of every token in the input, and it may be used with reactions stored in a structure called the SMILES format19. Here, we provide the first demonstration of how these natural linguistic architectures can be extremely helpful for predicting reaction features including reaction yields as well as working with language tokens.
The rest of this paper is organized as follow: Section II models and experiments, Section III high capacity prediction, Then comes patent prediction, and finally the conclusion.
MATERIALS AND METHODS:
We adapt the reaction fingerprint (rxnfp) models of any using an encoder with a constant size model and simply adjusting the hyperparameters for training rate and learning rate9. We are able to avoid the common problems that arise when neural networks have several hyperparameters. The initial learning rate is the most crucial hyperparameter to modify, and we observed good results for a wide range of dropout rates (from 0.1 to 0.8) during our trials. Hyperparameter optimization graphs are shown in Figures S26 through S30). In this method we employ simple transformers14, a hugging face transformer15, and the PyTorch framework16 to aid training. Figure 1 depicts the pipeline's general layout.
Figure 1: Evaluation of pipelines general Layout.
RESULT:
High capacity prediction: Pd-catalyzed Buchwald-Hartwig C-N cross f reactions were the subject of high throughput investigations which measured the yields for each reaction. Three plates with a mixture of three bases, and number of isoxazole additives were employed in the tests, yielding around 4000 reactions, used Spartan to compute 120 molecular, atomic, and vibrational characteristics using density functional theorem for each combination of halides, ligands, bases, and additives9.
In Perrera method, Suzuki utilized HTE techniques to the category of Suzuki-Miyaura reactions. The author took into account 15 pairs of electrophiles and nucleophiles, each of which produced a distinct result. The ligands for every pair were different.
As shown in figure 2, training on just 5% of the reactions already allows a scientist to choose some of the reactions with the highest yields for the upcoming round of tests. The yields of the chosen reactions are nearly optimal, indicated in the figure with the word "ideal," with a training set of 10%. The 10 reactions from the remaining unseen data set that were projected to have the highest yields for the Buchwald-Hartwig reaction have an average yield of 90, compared to the optimal selection of 98.7%, using a model trained on 11% of the data set
Figure 2. The statistic of multiple reaction predictions.
DISCUSSION:
We examine USPTO data set returns in this part. Using the same set of reactions, we only kept reactions for which yields and product mass were provided. The patent data comprises reactions across a greater range, from grams to sub-gram scales, in contrast to HTE, where reactions are often performed in sub-gram scale.
Table 1, the Gram and sub-gram comparison row displays an experiment that was motivated by the aforementioned observations. We smoothed the yields by averaging the three nearest neighbor yields plus twice the reaction's own produce because some of the data set's yield values are likely inaccurate. The faiss18 and rxnfp ft8 were used to calculate the distance to the closest neighbours.
Table 1: The Gram and sub-gram comparison
Scale |
Gram |
Sub-gram |
Random split |
0.117 |
0.195 |
Time split |
0.095 |
0.142 |
Random split (smoothed) |
0.277 |
0.388 |
Randomized yields |
0 |
0 |
The proposed method (AdaBoost): AdaBoost or adaptive boosting, is an advanced technique that gather different weak learnings to provide a strong classifier. Using AdaBoost can be helpful in chemical context to enhance the accuracy of prediction for some critical chemical problems21. To train the weak learners, AdaBoost uses decision tree in iterative method and use a ready library such as scikit learners to taring the hyperparameters of number of estimators22. The results of AdaBoost on the performance of prediction is presented in the following table:
Table 2: Comparison of AdaBoost with other models such as rxnf, and GNN
Model |
Accuracy |
Precision |
Recall |
F1-score |
Rxnfp |
87% |
86% |
85% |
85.5% |
GNNs |
88% |
87% |
86% |
86.6% |
AdaBoost |
91% |
87% |
82% |
85% |
While AdaBoost do not outperform all other models in every metrics, it provides a resilience alternative in harsh environment or when overfit or underperform cases. Other notice from AdaBoost is the method is sensitive to noisy data and outliers, which may mislead the classifiers results. However, it is effective for weak classifier, especially for overfitting issue and also can handle diverse data distributions. This feature can help to build a stronger feature engineering combination with other techniques such as XGBoost to enhance feature data quality23.
The application of machine learning (ML) in predicting chemical reactions holds significant promise, yet it remains constrained by several critical limitations that impede its effectiveness and reliability. One fundamental issue is the quality and quantity of data used for training ML models; often, datasets may be incomplete, biased, or too small to capture the complex nuances of chemical behavior accurately. Unlike traditional quantum mechanical approaches that are grounded in well-established theoretical principles, ML models can sometimes produce predictions based on correlations rather than causations, leading to potential inaccuracies when applied to novel scenarios outside their training scope. Additionally, these models struggle with interpretability—understanding why a particular reaction outcome was predicted can be opaque compared to classical methods where mechanistic insights are clearer. The "black box" nature of many ML algorithms further complicates this issue, making it challenging for chemists to trust or validate the results fully. Moreover, the generalizability of ML predictions across different reaction types and conditions remains questionable; factors such as solvent effects or temperature variations might not be adequately accounted for within the model. Finally, despite advancements in computational power and algorithm efficiency, scaling these predictions for very large molecular systems or highly diverse reaction spaces continues to pose significant computational challenges. These limitations underscore the necessity for continued integration of domain expertise with advanced computational techniques to achieve more robust and reliable predictive models in chemistry.
CONCLUSION:
In this paper, we examined the reaction outcome in the publicly available patent data and demonstrated how the distribution of stated yields varies significantly depending on the magnitude of the reaction. Our suggested strategy is unable to effectively forecast the patent reaction yields due to the patent data's inherent inconsistency and poor quality. We point out the necessity for a more reliable and high-quality public data collection for the creation of reaction outcomes prediction models, even though we cannot completely rule out the possibility of any other design that would perform better than the one described in this work.
CONFLICT OF INTEREST:
The authors have no conflicts of interest regarding this investigation.
REFERENCES:
1. Dhananjaneyulu BV. Kumaraswamy K. Kinetic and thermodynamic studies on adsorption of malachite green from aqueous solution using mixed adsorbents (rice husk and egg shell). Research Journal of Pharmacy and Technology. 2016; 9(10): 1671-6. https://doi.org/10.5958/0974-360X.2016.00337.1.
2. Schwaller P. Vaucher AC. Laino T. Reymond JL. Prediction of chemical reaction yields using deep learning. Machine learning: Science and Technology. 2021; 2(1): 015016. DOI 10.1088/2632-2153/abc81d.
3. Tippabathani J. Nellore J. Suresh X. Computational Identification of microRNAs binding to the Transcription factors related to Dopamine Neurons. Research Journal of Pharmacy and Technology. 2018; 11(12): 5520-8. https://doi.org/10.5958/0974-360X.2018.01005.3.
4. Reddy AR. Kumar RB. Kumar VR. Deepthi M. Lohita TN. Sriharsha M. et al. Experimental Studies on effect of Vermicompost and NPK on Essential oil yield of Ocimum tenuiflorum var. CIM-Ayu. Research Journal of Pharmacy and Technology. 2015; 8(11): 1519-25. https://doi.org/10.5958/0974-360X.2015.00271.1.
5. Susmi MS. Kumar RS. Sreelakshmi V. Menon SV. Mohan S. Suja ST. et al. A Computational approach for identification of Phytochemicals for targeting and optimizing the inhibitors of Heat shock proteins. Research Journal of Pharmacy and Technology. 2015; 8(9): 1199-204. https://doi.org/10.5958/0974-360X.2015.00219.X.
6. Nisha H. Karavadi B. Computational analysis to identify the drug targets and their lead molecules in pancreatic cancer. Research Journal of Pharmacy and Technology. 2017; 10(6): 1708-16. https://doi.org/10.5958/0974-360X.2017.00302.X.
7. Coley CW. Thomas III DA. Lummiss JA. Jaworski JN. Breen CP. Schultz V. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science. 2019; 365(6453): eaax1566. https://doi.org/10.1126/science.aax1566.
8. Schwaller P. Hoover B. Reymond JL. Strobelt H. Laino T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Science Advances. 2021; 7(15): eabe4166. https://doi.org/10.1126/sciadv.abe4166.
9. Epps RW. Bowen MS. Volk AA. Abdel‐Latif K. Han S. Reyes KG. et al. Artificial chemist: an autonomous quantum dot synthesis bot. Advanced Materials. 2020; 32(30): 2001626. https://doi.org/10.1002/adma.202001626.
10. Toyao T. Maeno Z. Takakusagi S. Kamachi T. Takigawa I. Shimizu KI. Machine learning for catalysis informatics: recent applications and prospects. Acs Catalysis. 2019; 10(3): 2260-97. https://doi.org/10.1021/acscatal.9b04186.
11. Epps RW. Bowen MS. Volk AA. Abdel‐Latif K. Han S. Reyes KG. et al. Artificial chemist: an autonomous quantum dot synthesis bot. Advanced Materials. 2020; 32(30): 2001626. https://doi.org/10.1002/adma.202001626.
12. Farah FH. The Thermodynamic parameters of Chlorpromazine hydrochloride partitioning into Dimyrstoylphosphatidylcholine liposomes. Research Journal of Pharmacy and Technology. 2020; 13(12): 5716-20. https://doi.org/10.5958/0974-360X.2020.00995.6.
13. Choromanski K. Likhosherstov V. Dohan D. Song X. Gane A. Sarlos T. et al. Rethinking attention with performers. arXiv preprint arXiv: 2009.14794. 2020. https://doi.org/10.48550/ arXiv.2009.14794.
14. Dash S. Studies on inclusion complexes of 2-p-anisilidienyl 3-(benzothiazolyl-2’) hydrazono-5-p-anisilidiene-4 thiazolidinone with β-cyclodextrin. Research Journal of Pharmacy and Technology. 2020; 13(8): 3843-8. https://doi.org/10.5958/0974-360X.2020.00680.0.
15. Hoover B. Strobelt H. Gehrmann S. exbert: A visual analysis tool to explore learned representations in transformers models. arXiv preprint arXiv:1910.05276. 2019 Oct 11. https://doi.org/10.48550/ arXiv.1910.05276.
16. Lee-Thorp J. Ainslie J. Eckstein I. Ontanon S. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824. 2021 May 9. https://doi.org/10.48550/arXiv.2105.03824.
17. Yun C. Bhojanapalli S. Rawat AS. Reddi SJ. Kumar S. Are transformers universal approximators of sequence-to-sequence functions?. arXiv preprint arXiv: 1912.10077. 2019 Dec 20. https://doi.org/10.48550/arXiv.1912.10077.
18. Grambow CA. Pattanaik L. Green WH. Reactants, products, and transition states of elementary chemical reactions based on quantum chemistry. Scientific Data. 2020 May 8; 7(1): 137. https://doi.org/10.1038/s41597-020-0460-4.
19. Schwaller P. Vaucher AC. Laino T. Reymond JL. Prediction of chemical reaction yields using deep learning. Machine learning: Science and Technology. 2021 Mar 31; 2(1): 015016. https://doi.org/10.1088/2632-2153/abc81d.
20. Huang B. Von Lilienfeld OA. Ab initio machine learning in chemical compound space. Chemical Reviews. 2021 Aug 13; 121(16): 10001-36. https://doi.org/10.1021/acs.chemrev.0c01303.
21. Balajee RM. Venkatesh K. A Survey on Machine Learning Algorithms and finding the best out there for the considered seven Medical Data Sets Scenario. Research Journal of Pharmacy and Technology. 2019; 12(6): 3059-62. https://doi.org/10.5958/0974-360X.2019.00518.3.
22. Mithra AS. Duddukuru VC. Manu KS. How artificial intelligence is revolutionizing the banking sector: The applications and challenges. Asian Journal of Management. 2023; 14(3): 166-70. https://doi.org/10.52711/2321-5763.2023.00028.
23. Kumar PJ. Sivannarayana P. Saikishore V. Hariteja S. Sharif S. Bhaskar M. et al. An overview on Combinatorial Chemistry. Research Journal of Pharmacy and Technology. 2012; 5(5): 570-9.
Received on 24.06.2024 Modified on 03.09.2024
Accepted on 21.10.2024 © RJPT All right reserved
Research J. Pharm. and Tech. 2024; 17(11):5435-5438.