Neural Networks for the Prediction of Organic Chemistry Reactions
pubs.acs.orgOne of the authors here. Ask us anything!
This paper is just a first step - what we'd really like to use this for is designing recipes for synthesizing new molecules.
I would also be remiss if I didn't link to a closely-related paper from another group that came out at the same time: http://pubs.acs.org/doi/abs/10.1021/ci5006614
What's worrisome is that on the graphical abstract you have nucleophilic substitution in the neopentyl position, a reaction every chemist knows won't proceed rapidly. And it takes place in dimethyl ether solvent, which every chemist knows it's a gas.
It looks a bit what you see in bad teaching materials, chemistry that is almost correct, but won't work well for some reason we are not telling the kids about. Please alleviate my concerns! :-)
This is Jennifer, another of the authors. You're absolutely right about the graphical abstract figure. For this first paper, we used very simple rules to generate our data set of reaction, so you end up with reactions that don't fully capture a human chemist's intuition.
We hope in the future to have access to real experimental data sets for real chemistry, complete with accurate temperatures, pressures, solvents, and reaction yields. Then our algorithm would be able to use all the information and predict reactions accurately. Right now, there just aren't many well-curated data sets with this kind of detail that would work for this kind of training. Happy to receive feedback from any experimental chemists out there with data from their research that they'd like to train on.
That diagram is just meant to show what type the inputs and outputs of the neural network are. I'm not a chemist myself, but the other two authors are.
Heh, not a chemist either, but the parent probably has the same feeling I get when I watch "hackers" in crime dramas break into systems and see a bunch of HTML and CSS on their terminals.
I am interested as to why you chose to focus on reaction prediction. As you acknowledge in the introduction, the acquistion of this skill is a routine part of graduate education in synthetic chemistry.
On the other hand, the key difficulty in synthetic chemistry, and the one that occupies the majority of a chemist's time is the identification of the correct reagent(s), the correct solvent, and the correct time, temperature, and concentration such that the desired reaction proceeds in a convenient amount of time and with the correct chemo- and regio-selectivity, that the reaction conditions are tolerated by the rest of the molecule, and that the product can be easily isolated from the reaction byproducts.
In my opinion, as long as these problems remain, then being able to turn retrosynthetic analysis over to a machine appears to me to provide little benefit.
Good points. However, even if chemists already know how to predict reactions, giving this knowledge to a machine will allow a much faster search over possible synthetic routes.
I agree that reaction outcomes depend on many other factors besides the reagents. In the future, I'm sure we'll create reaction prediction frameworks that also take these other factors as inputs. The problem right now is that there aren't many datasets that include these extra factors.
We're not advocating turning retrosynthetic analysis over to machines yet. These are just baby steps.
Did you consider working with a chemistry CRO (say WuXi) to train models using their proprietary datasets? Is there some solution that's a win-win for everyone?
We are definitely considering it. In fact the next iteration of this project will be in collaboration with Wiley ChemPlanner and their database of reactions.
I took three semesters of orgo in undergrad, and if it taught me anything, it's that there are exceptions to nearly every rule. There are so many complicated molecular orbital interactions, requiring years of study. And even then there are always things that break these rules in unexpected ways or produce several products.
How do you overcome this? Can you predict yield percentages of each product? What about chirality?
Can your system design synthesis pathways? Can it optimize for final product yield? How does it handle the thermodynamics and kinetics of reactions?
In any case, cool project. It's a very difficult domain.
One of the things that I didn't quite get when I started taking organic chemistry was that I really couldn't figure out (didn't have enough background knowledge) why many reactions happened the way they did and that I just had to memorize things.
(Accounting is the same way though for very different reasons. You could justify recording some transactions in about 5 different ways--but FASB says only a particular one is OK.)
At the grad level students get really good at rationalizing and predicting reactions. We did bimonthly exercise called mechanism club where someone would pick some chemical reactions from the literature and basically volunteers would come up to the chalkboard and push electrons till the reaction was complete.
Thanks for the kind words! I would agree that the function we're trying to learn is very complicated, with many exceptions. But that only improves our comparative advantage, at least compared to novice chemists. We might also be able to help our system take advantage of the knowledge of physics that expert chemists use to predict reaction outcomes by giving our network access to the output of a physical reaction simulator.
Extending the system to (try to) predict yield percentages or chirality is straightforward. The hard part, in my mind, is that there aren't a fixed number of reaction types. As molecules get bigger, we'll have to move away from predicting one of a fixed set of reaction types, to directly predicting products - but this is a much harder problem.
Our system probably isn't good enough to design synthesis pathways yet, but that is the eventual goal. A system that also predicted yield would of course help with that, and that would be another straightforward extension.
First of all, interesting and fascinating work! It is nice to see some chemistry over here.
I'm a grad student in computational chemistry. I am fascinated by the idea that our imagination, or limits of our chemical intuition, is the limiting factor for all kinds of cool advances. Through that I have recently been studying machine learning and I am interested in using it for catalyst optimization and design.
What is your opinion on the state of computationally assisted inverse design of molecules and the role of machine learning in it? The problem is a bit more open-ended compared to reaction optimization, but I could imagine that after proper formulation of the design guidelines the computer could help a lot.
Thank you! I guess all you need to do to get chemistry on hacker news is add neural nets to it.
I honestly don't know much about synthesis or the state of computationally assisted inverse design of molecules, except that it seems like a great idea, and that it is still early days. As you say, proper formulation of the guidelines is still necessary - as far as I understand, right now the synthesis people still use a lot of judgement at each step, and all these little choices will have to be recorded to build a useful dataset.
What is the state of the art here? What models do we currently have to predict chemical/organic inputs/outcomes?
Us chemists use retrosynthesis to disconnect the target cpd into accessible fragments of appropiate polarity that can be conveniently joined (that's three issues in one), and consider electron flow (i.e. arrow pushing) to think about what reactions would proceed smoothly.
Manipulation of 2D-connectivity, as seen in the paper is new and not for humans.
For predicting of reactions using computers, there are several methods out there, some of which use physical (quantum mechanics) calculations. Each reaction can take a long time to run, which is why we've turned to machine learning.
We chose to use fingerprints, i.e. a vector representation of the graphical features of a molecule, for the inputs of our reactions, which is often used when machine learning properties of molecules, or for classifying the entire reaction as David pointed mentioned before. There's one paper from the Baldi group that uses inputs that are more like orbitals, and tries to predict the mechanisms of reactions directly : http://pubs.acs.org/doi/abs/10.1021/ci3003039
Is RPA coming back to Harvard?
I skimmed through the paper (through my university's subscription) and found out that the source code & data is (or will be) available on GitHub (yay): https://github.com/jnwei/neural_reaction_fingerprint
I'm just an old geezer programmer, but my daughter is studying e-tox at UC Davis. I have discussed with her how important computer code is becoming in the ability to recreate results in studies.
This is a good example of providing that kind of transparency.
Somewhat off-topic, but there's a professor at UC Davis whose blog I've followed for several years who advocates open, reproducible science. Your daughter might be interested in the the workshops his lab runs: https://dib-training.readthedocs.io/en/pub/. The next one is Nov 17.
Thanks, passing it on.
Abstract:
"Reaction prediction remains one of the major challenges for organic chemistry and is a prerequisite for efficient synthetic planning. It is desirable to develop algorithms that, like humans, “learn” from being exposed to examples of the application of the rules of organic chemistry. We explore the use of neural networks for predicting reaction types, using a new reaction fingerprinting method. We combine this predictor with SMARTS transformations to build a system which, given a set of reagents and reactants, predicts the likely products. We test this method on problems from a popular organic chemistry textbook."
Very interesting, great concept! The paper is on my to-read list!
I am only afraid that the datasets you have used might not be of sufficiently quality for a neural network application. There are old recipes when the state of art in chemistry was at an earlier stage e.g. before the discovery of specific mechanisms, molecule classes, analytics and general concepts. Also, as mentioned in this thread, there are aspects of the synthetic chemists work and experience that might not be taken into consideration in this approach.
Yes, the datasets were the real limiting factor for this project, and as you point out, reactions depend on many more things than their reagents. Hopefully this work will inspire someone to build a better dataset of reaction setups and outcomes!
I'm not a chemist, and didn't read the paper, but would it be helpful if the neural network had additional inputs coming from e.g. a (simplified) Schrodinger equation solver?
Orbital energies, or other solutions from the Schrodinger equation, would probably help the prediction if they were included as inputs. If you were to do this, you'd have to be a little careful about the cost of doing the quantum mechanics calculation on a whole data set of reactants in your reaction database, but it could be feasible with a cheap method.
Forgive my ignorance:
What actually makes it hard to predict a chemical reaction? Can't we empirically deduce them from quantum mechanics?
Theoretically, yes, if we knew all of the physics laws accurately enough. However, there are a lot of behaviors that are difficult to predict in the aggregate. That's one of the reasons we end up with different specializations: quantum mechanics, atomic physics, molecular chemistry, organic chemistry, etc.
Another example would be protein folding. Even within the same "level" of chemistry, predicting the three-dimensional structure of a protein molecule based purely on the chemistry we understand and the protein sequence is a hard problem. We're getting better at it, but it's still hard.
I get "Server Busy".