Protein Design the AI Way

Here’s some of the latest work on de novo protein design, a field that has been changing very rapidly indeed. A few years ago, it was a collection of a few very-hard-won partial successes (and many other unreported failures). But the success of the machine-learning approaches to protein structure (AlphaFold, RoseTTAFold et al.) dramatically shook things up, and the shaking up continues.

As this overview goes into, one of these approaches, christened “hallucination” by the Baker lab at Washington, takes a more or less random string of amino acids that is refined by the ML tools above into something that should fold into a specific desired three-dimensional structure. You can also start from a bit less of a blank slate by the “inpainting” technique, where a smaller piece of a known protein structure is isolated and gets more build up around it computationally. Both these techniques have had successes, but plenty of failures as well - they can break down in making longer proteins, or fail to produce soluble candidates that don’t aggregate. Another way to generate new proteins is to adapt the techniques used in large language models, and I wrote about that technique here.

The latest methods, though, are based on the diffusion neural network approaches that are used to generate prompted images and artwork (Stable Diffusion, DALL-E, Midjourney and so on). Greatly simplified, these work by being trained on real data (photos or human-created images, in the case of the three mentioned), which then have Gaussian noise added to them (forward diffusion, and the neural network’s task is to “learn” to reverse this and produce an actual image again (denoising or reverse diffusion). This allows new images to be produced (from various types of noise!) that can resemble the images used to train the models but do not actually copy them. As many readers will be aware, this is setting off all kinds of fireworks in copyright law, as well as profoundly mixed emotions from human illustrators and artists, along reams of speculation about what we mean by words like “creativity”, “copy”, “inspiration”, “imitation”, “original”, etc.

There are no such concerns - well, not yet - with the extension of these techniques to protein design. There are several of these working right now (RFdiffusion, for example) In this case, you do the forward diffusion step with the structures of real proteins from the PDB. Then you let them grind away on de-noising until they can spit out plausible protein structures in the reverse direction. If you just hand such software a pile of noise, it generates proteins that aren’t in the PDB, but don’t seem to have anything wrong with them, and whose structures are as believable as any real one. Experimentally, about half of them turn out to be both soluble and show CD spectra consistent with the designed degree of helicity, etc. The new paper linked in the opening of this blog also shows cryo-EM structures of many proteins that match the aimed-for structures.

And if you add some spatial constraints along the way during the denoising step, you can aim the results toward specific shapes - and that includes complementary shapes to other protein surfaces. The paper tests this with the p53/MDM2 interaction, generating new replacement candidates for p53 in this pair, some of which are two to three orders of magnitude more potent binders than p53 itself. Binders to several other protein surfaces are shown as well.

Overall, Baker’s team estimates that they have about a 15% success rate with such designs, which is far, far above where things were just two or three years ago. And that rate may have already improved. The bottleneck is making and testing the proteins themselves; these techniques are spitting out so many plausible hits that it’s hard to keep up. Of course, that’s a better situation than we used to face, a long list of things where almost none of them actually worked. There are worse problems!

This ability to just come up with new proteins on demand is of course going to permanently alter protein science, to the point that it’s eventually going to be hard to explain to the young ‘uns what it used to be like without it. The possibilities for chemical biology, model systems, and eventually outright therapeutics are so numerous that it’s hard to even know where to start. But to get these to work, we’ll need to know more than a shape we’re trying to mimic: toxicity, immunogenicity, and binding to other proteins that we haven’t even studied are all factors that will have to be explored experimentally. All that knowledge (and what we know about these topics already) will surely get fed back into more advanced versions of protein-design software.

Another challenge will be attempting to design functional proteins like enzymes, or things that need to change their shape in different situations. The current diffusion models are working off more or less a static picture, and while that can take you very far indeed, it leaves a lot on the table. Extending these to more dynamic states is (as you’d imagine) a very hot topic of current research.

And keep in mind, these are all proteins that are based off of known folds and structural motifs (after all, the software was trained on the PDB). You have to wonder what else is out there: are there useful designs that evolution has never gotten around to exploring? That’s a longstanding question in the field, and answering that one is going to take abilities that we don’t have even now. But the ones we have, and the ones that we are gaining right now - they really are something to see. . .