Meet Evo, the DNA-trained AI that creates genomes from scratch

6 min read Original article ↗

ChatGPT, the famous artificial intelligence (AI) chatbot, can summarize Moby Dick, write computer code, and serve up a recipe for chicken à la king because it has much of the written information on the internet at its silicon fingertips. What if it could do the same for DNA?

That’s the advance behind a new study published today in Science. Researchers describe an AI model, schooled on billions of lines of genetic sequences, that can deduce how bacterial and viral genomes operate and use that information to design new proteins and even whole microbial genomes. The model, known as Evo, could help scientists probe evolution, investigate diseases, develop new treatments, and potentially answer a host of other biomedical questions.

“This work is extremely significant,” says computational biologist Arvind Ramanathan of Argonne National Laboratory, who wasn’t connected to the study. The tests the authors put Evo through, he says, provide “a great showcase of applications” for the AI.

Researchers have designed specialized AI models that perform particular tasks related to certain types of molecules. A well-known example is AlphaFold, which predicts the structure of proteins from their amino acid sequences. But ChatGPT and many other AIs are general-purpose programs, what some researchers call foundation models. Their versatility is advantageous because scientists don’t have to build and train a different model for each task, saving time and money. ChatGPT is known as a large language model (LLM) because it works with almost any kind of document containing words, whether that’s a government report or a recipe. In molecular biology, nothing is more fundamental than DNA, and scientists have developed a few foundation models that analyze DNA sequences as if they were words in an LLM. However, these AIs can only interpret and predict relatively short sections of DNA.

Developed to overcome such limitations, Evo is the brainchild of computational biologist Brian Hie of Stanford University and colleagues, including some researchers from the recently formed Arc Institute, which is funded by several philanthropists and focuses on high-risk, high-reward projects. One of the team’s improvements was to increase what’s called the context length, the search window the model uses as it tries to find patterns in DNA. Larger context lengths can increase the model’s ability to identify connections among genes or other DNA sequences. The design also allowed the team to boost Evo’s resolution to the level of individual nucleotides, the building blocks of DNA, whereas previous models had only been able to work with groups of nucleotides.

Once the researchers had built Evo, they gave it 4 weeks of training, during which the model educated itself on 80,000 microbe genomes as well as millions of sequences from bacteria-targeting viruses and semi-independent DNA loops known as plasmids. In theory, malicious users could exploit a model like Evo to design a biological weapon, Hie says, so the researchers banished from the AI’s training set sequences from any viruses that attack humans or other eukaryotes, the organisms whose cells boast nuclei. Overall, Evo learned from 300 billion nucleotides of sequence information.

To test the AI, the researchers asked it to predict the impact of mutations on protein performance. This knowledge is important for understanding how DNA glitches lead to disease and for designing new drugs. The team checked Evo’s predictions by comparing them with published experiments in which other scientists had induced the same mutations in bacterial cells. Evo bested previous AI models that infer mutation effects from DNA sequence data; it worked about as well as other AI models that rely on protein sequences.  

One reason that AI models like ChatGPT are so useful is that they can create new content. “We wanted to show our model had [this] capability,” Hie says. So he and his colleagues told Evo to design new versions of the CRISPR genome editor. This assignment is challenging because CRISPR includes two types of components that have to work together: DNA-slicing Cas proteins and RNA molecules that usher the enzymes to the genome locations to be edited.

Evo first boned up on more than 70,000 bacterial DNA sequences that encoded Cas proteins and their partner RNAs. Then the model devised millions of potential versions of the molecules. The researchers picked the 11 most promising variants of Cas9, the workhorse version of Cas in biotechnology, and synthesized the proteins in the lab.

In test tube experiments, the best of the Evo-designed Cas9 enzymes was as good at cutting DNA as a commercial version of the protein, the researchers found. To improve Cas proteins, scientists have traditionally searched for bacteria with more effective versions of the enzymes. With Evo, Hie says, “we don’t have to wait for evolution to create a new Cas9.” Like many LLMs, however, Evo also “hallucinated,” proposing Cas9s that had no chance of working. Despite its hallucinations, Hie says, the AI is still better at finding new molecular options than “brute-force screening or random guessing.”

In what Hie calls the “most futuristic and crazy” part of the study, the researchers asked Evo to generate DNA sequences that are long enough to serve as genomes for bacteria. They found that these mock genomes carried many of the genes needed by cells but were short on others that are necessary. Still, Hie believes the results could be a step toward AI-designed synthetic genomes.

Foundation models are important because “they enhance our ability to understand and characterize the genome,” says computational biologist Ramana Davuluri of Stony Brook University, who was not involved with the study. “I think this is a big step beyond current models.”

One reason the work stands out is how far the researchers went to experimentally confirm the model’s predictions, says computational biologist Yunha Hwang of the New York City–based nonprofit Tatta Bio, which focuses on improving genomic AI models. “Being able to do laboratory validation is very powerful,” says Hwang, who was not connected to the research. The enormous amount of data that Evo learned from also sets the study apart, adds statistician Chong Wu of the University of Texas MD Anderson Cancer Center. The more information the model absorbs, he says, the more reliable it is.

Much of the work on AI occurs in secret at companies. But the researchers have released Evo publicly so other researchers can use it, and Hie says the team has no plans to commercialize its creation. “For now, I see this as a research project.”