Simulating 500 million years of evolution with a language model

Editor’s summary

A protein sequence encodes the information needed to determine the three-dimensional structure and cellular function of said protein. Advances in machine learning and the availability of large public repositories of sequence, structural, and functional data are enabling researchers to understand this code and build on it. Hayes et al. now present ESM3, a protein language model that enables the programmed generation of protein structure and sequence in response to user prompts. The authors demonstrate versatility across a range of motif scaffolding and key word–prompted generation tasks. As an example of the functional sensitivity of ESM3, they produced highly diverged variants of green fluorescent protein that retain the ability to fold and produce the protein-derived chromophore. —Michael A. Funk

Abstract

More than 3 billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here, we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to alignment to improve its fidelity. We have prompted ESM3 to generate fluorescent proteins. Among the generations that we synthesized, we found a bright fluorescent protein at a far distance (58% sequence identity) from known fluorescent proteins, which we estimate is equivalent to simulating 500 million years of evolution.

Register and access this article for free

As a service to the community, this article is available for free.

Access the full article

View all access options to continue reading this article.

Supplementary Materials

The PDF file includes:

Materials and Methods

Figs. S1 to S24

Tables S1 to S17

References (62–117)

Download
8.91 MB

Other Supplementary Material for this manuscript includes the following:

MDAR Reproducibility Checklist

Download
151.83 KB

References and Notes

UniProt Consortium, UniProt: A hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

I. V. Grigoriev, H. Nordberg, I. Shabalov, A. Aerts, M. Cantor, D. Goodstein, A. Kuo, S. Minovitsky, R. Nikitin, R. A. Ohm, R. Otillar, A. Poliakov, I. Ratnere, R. Riley, T. Smirnova, D. Rokhsar, I. Dubchak, The genome portal of the department of energy joint genome institute. Nucleic Acids Res. 40, D26–D32 (2012).

A. L. Mitchell, A. Almeida, M. Beracochea, M. Boland, J. Burgin, G. Cochrane, M. R. Crusoe, V. Kale, S. C. Potter, L. J. Richardson, E. Sakharova, M. Scheremetjew, A. Korobeynikov, A. Shlemov, O. Kunyavskaya, A. Lapidus, R. D. Finn, MGnify: The microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).

M. Varadi, D. Bertoni, P. Magana, U. Paramval, I. Pidruchna, M. Radhakrishnan, M. Tsenkov, S. Nair, M. Mirdita, J. Yeo, O. Kovalevskiy, K. Tunyasuvunakool, A. Laydon, A. Žídek, H. Tomlinson, D. Hariharan, J. Abrahamson, T. Green, J. Jumper, E. Birney, M. Steinegger, D. Hassabis, S. Velankar, AlphaFold Protein Structure Database in 2024: Providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52, D368–D375 (2024).

Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, A. Dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido, A. Rives, Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

E. C. Alley, G. Khimulya, S. Biswas, M. AlQuraishi, G. M. Church, Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

M. Heinzinger, A. Elnaggar, Y. Wang, C. Dallago, D. Nechaev, F. Matthes, B. Rost, Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20, 723 (2019).

A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, R. Fergus, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U.S.A. 118, e2016239118 (2021).

A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos Jr., C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser, N. Naik, Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

N. Ferruz, S. Schmidt, B. Höcker, ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

R. Verkuil, O. Kabeli, Y. Du, B. I. Wicky, L. F. Milles, J. Dauparas, D. Baker, S. Ovchinnikov, T. Sercu, A. Rives, Language models generalize beyond natural proteins. bioRxiv 521521 [Preprint] (2022); https://doi.org/10.1101/2022.12.21.521521.

A. Elnaggar, M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, D. Bhowmik, B. Rost, ProtTrans: Understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 14, 7112–7127 (2022).

D. Hesslow, N. Zanichelli, P. Notin, I. Poli, D. Marks, RITA: A study on scaling up generative protein sequence models. arXiv:2205.05789 [q-bio.QM] (2022).

E. Nijkamp, J. A. Ruffolo, E. N. Weinstein, N. Naik, A. Madani, ProGen2: Exploring the boundaries of protein language models. Cell Syst. 14, 968–978.e3 (2023).

S. Alamdari, N. Thakkar, R. van den Berg, A. X. Lu, N. Fusi, A. P. Amini, K. K. Yang, Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv 556673 [Preprint] (2023); https://doi.org/10.1101/2023.09.11.556673.

M. Heinzinger, K. Weissenow, J. G. Sanchez, A. Henkel, M. Mirdita, M. Steinegger, B. Rost, Bilingual language model for protein sequence and structure. bioRxiv 550085 [Preprint] (2024); https://doi.org/10.1101/2023.07.23.550085.

J. Meier, R. Rao, R. Verkuil, J. Liu, T. Sercu, A. Rives, Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv 450648 [Preprint] (2021); https://doi.org/10.1101/2021.07.09.450648.

J. Vig, A. Madani, L. R. Varshney, C. Xiong, R. Socher, N. F. Rajani, BERTology meets biology: Interpreting attention in protein language models. arXiv:2006.15222 [cs.CL] (2020).

B. Chen, X. Cheng, P. Li, Y. Geng, J. Gong, S. Li, Z. Bei, X. Tan, B. Wang, X. Zeng, C. Liu, A. Zeng, Y. Dong, J. Tang, L. Song, xTrimoPGLM: Unified 100B-scale pre-trained transformer for deciphering the language of protein. bioRxiv 547496 [Preprint] (2023); https://doi.org/10.1101/2023.07.05.547496.

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models. arXiv:2001.08361 [cs.LG] (2020).

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners. arXiv:2005.14165 [cs.CL] (2020).

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. d. Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, L. Sifre, Training compute-optimal large language models. arXiv:2203.15556 [cs.CL] (2022).

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O’Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Žemgulytė, E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Figurnov, F. B. Fuchs, H. Gladman, R. Jain, Y. A. Khan, C. M. R. Low, K. Perlin, A. Potapenko, P. Savy, S. Singh, A. Stecula, A. Thillaisundaram, C. Tong, S. Yakneen, E. D. Zhong, M. Zielinski, A. Žídek, V. Bapst, P. Kohli, M. Jaderberg, D. Hassabis, J. M. Jumper, Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024).

J. L. Watson, D. Juergens, N. R. Bennett, B. L. Trippe, J. Yim, H. E. Eisenach, W. Ahern, A. J. Borst, R. J. Ragotte, L. F. Milles, B. I. M. Wicky, N. Hanikel, S. J. Pellock, A. Courbet, W. Sheffler, J. Wang, P. Venkatesh, I. Sappington, S. V. Torres, A. Lauko, V. De Bortoli, E. Mathieu, S. Ovchinnikov, R. Barzilay, T. S. Jaakkola, F. DiMaio, M. Baek, D. Baker, De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100 (2023).

J. B. Ingraham, M. Baranov, Z. Costello, K. W. Barber, W. Wang, A. Ismail, V. Frappier, D. M. Lord, C. Ng-Thow-Hing, E. R. Van Vlack, S. Tie, V. Xue, S. C. Cowles, A. Leung, J. V. Rodrigues, C. L. Morales-Perez, A. M. Ayoub, R. Green, K. Puentes, F. Oplinger, N. V. Panwar, F. Obermeyer, A. R. Root, A. L. Beam, F. J. Poelwijk, G. Grigoryan, Illuminating protein space with a programmable generative model. Nature 623, 1070–1078 (2023).

Y. Lin, M. Lee, Z. Zhang, M. AlQuraishi, Out of many, one: Designing and scaffolding proteins at the scale of the structural universe with Genie 2, May 2024. arXiv:2405.15489 [q-bio.BM] (2024).

O. Shimomura, F. H. Johnson, Y. Saiga, Extraction, purification and properties of aequorin, a bioluminescent protein from the luminous hydromedusan, Aequorea. J. Cell. Comp. Physiol. 59, 223–239 (1962).

R. Y. Tsien, The green fluorescent protein. Annu. Rev. Biochem. 67, 509–544 (1998).

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs.CL] (2018).

H. Chang, H. Zhang, L. Jiang, C. Liu, W. T. Freeman, Maskgit: Masked generative image transformer. arXiv:2202.04200 [cs.CV] (2022).

B. Uria, I. Murray, H. Larochelle, A deep and tractable density estimator. arXiv:1310.1757 [stat.ML] (2014).

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, R. van den Berg. Structured denoising diffusion models in discrete state-spaces. arXiv:2107.03006 [cs.LG] (2023).

A. van den Oord, O. Vinyals, K. Kavukcuoglu, Neural discrete representation learning. arXiv:1711.00937 [cs.LG] (2017).

B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, C. H. Wu; UniProt Consortium, UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

L. Richardson, B. Allen, G. Baldi, M. Beracochea, M. L. Bileschi, T. Burdett, J. Burgin, J. Caballero-Pérez, G. Cochrane, L. J. Colwell, T. Curtis, A. Escobar-Zepeda, T. A. Gurbich, V. Kale, A. Korobeynikov, S. Raj, A. B. Rogers, E. Sakharova, S. Sanchez, D. J. Wilkinson, R. D. Finn, MGnify: The microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

T. H. Olsen, F. Boyles, C. M. Deane, Observed Antibody Space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 31, 141–146 (2022).

S. K. Burley, H. M. Berman, C. Bhikadiya, C. Bi, L. Chen, L. Di Costanzo, C. Christie, K. Dalenberg, J. M. Duarte, S. Dutta, Z. Feng, S. Ghosh, D. S. Goodsell, R. K. Green, V. Guranovic, D. Guzenko, B. P. Hudson, T. Kalro, Y. Liang, R. Lowe, H. Namkoong, E. Peisach, I. Periskova, A. Prlic, C. Randle, A. Rose, P. Rose, R. Sala, M. Sekharan, C. Shao, L. Tan, Y.-P. Tao, Y. Valasatava, M. Voigt, J. Westbrook, J. Woo, H. Yang, J. Young, M. Zhuravleva, C. Zardecki, RCSB Protein Data Bank: Biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47, D464–D474 (2019).

T. Paysan-Lafosse, M. Blum, S. Chuguransky, T. Grego, B. L. Pinto, G. A. Salazar, M. L. Bileschi, P. Bork, A. Bridge, L. Colwell, J. Gough, D. H. Haft, I. Letunić, A. Marchler-Bauer, H. Mi, D. A. Natale, C. A. Orengo, A. P. Pandurangan, C. Rivoire, C. J. A. Sigrist, I. Sillitoe, N. Thanki, P. D. Thomas, S. C. E. Tosatto, C. H. Wu, A. Bateman, InterPro in 2022. Nucleic Acids Res. 51, D418–D427 (2023).

M. van Kempen, S. Kim, C. Tumescheit, M. Mirdita, J. Soding, M. Steinegger, Foldseek: fast and accurate protein structure search. bioRxiv 479398 [Preprint] (2022); https://doi.org/10.1101/2022.02.07.479398.

C. Hsu, R. Verkuil, J. Liu, Z. Lin, B. Hie, T. Sercu, A. Lerer, A. Rives, Learning inverse folding from millions of predicted structures. bioRxiv 487779 [Preprint] (2022); https://doi.org/10.1101/2022.04.10.487779.

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, G. Irving, Fine-tuning language models from human preferences. arXiv:1909.08593 [cs.CL] (2019).

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human feedback. arXiv:2203.02155 [cs.CL] (2022).

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn, Direct preference optimization: Your language model is secretly a reward model. arXiv:2305.18290 [cs.LG] (2023).

R. Y. Pang, W. Yuan, K. Cho, H. He, S. Sukhbaatar, J. Weston. Iterative reasoning preference optimization. arXiv:2404.19733 [cs.CL] (2024).

Y. A. Labas, N. G. Gurskaya, Y. G. Yanushevich, A. F. Fradkov, K. A. Lukyanov, S. A. Lukyanov, M. V. Matz, Diversity and evolution of the green fluorescent protein family. Proc. Natl. Acad. Sci. U.S.A. 99, 4256–4261 (2002).

L. Gonzalez Somermeyer, A. Fleiss, A. S. Mishin, N. G. Bozhanova, A. A. Igolkina, J. Meiler, M.-E. Alaball Pujol, E. V. Putintseva, K. S. Sarkisyan, F. A. Kondrashov, Heterogeneity of the GFP fitness landscape and data-driven protein design. eLife 11, e75842 (2022).

S. Biswas, G. Kuznetsov, P. J. Ogden, N. J. Conway, R. P. Adams, G. M. Church, Toward machine-guided design of proteins. bioRxiv 337154 [Preprint] (2018); https://doi.org/10.1101/337154.

S. Biswas, G. Khimulya, E. C. Alley, K. M. Esvelt, G. M. Church, Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

M. Ormö, A. B. Cubitt, K. Kallio, L. A. Gross, R. Y. Tsien, S. J. Remington, Crystal structure of the Aequorea victoria green fluorescent protein. Science 273, 1392–1395 (1996).

K. S. Sarkisyan, D. A. Bolotin, M. V. Meer, D. R. Usmanova, A. S. Mishin, G. V. Sharonov, D. N. Ivankov, N. G. Bozhanova, M. S. Baranov, O. Soylemez, N. S. Bogatyreva, P. K. Vlasov, E. S. Egorov, M. D. Logacheva, A. S. Kondrashov, D. M. Chudakov, E. V. Putintseva, I. Z. Mamedov, D. S. Tawfik, K. A. Lukyanov, F. A. Kondrashov, Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

D. P. Barondeau, C. D. Putnam, C. J. Kassmann, J. A. Tainer, E. D. Getzoff, Mechanism and energetics of green fluorescent protein chromophore synthesis revealed by trapped intermediate structures. Proc. Natl. Acad. Sci. U.S.A. 100, 12111–12116 (2003).

C. Camacho, G. Coulouris, V. Avagyan, N. Ma, J. Papadopoulos, K. Bealer, T. L. Madden, BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).

M. Steinegger, J. Söding, MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

J. Y. Weinstein, C. Martí-Gómez, R. Lipsh-Sokolik, S. Y. Hoch, D. Liebermann, R. Nevo, H. Weissman, E. Petrovich-Kopitman, D. Margulies, D. Ivankov, D. M. McCandlish, S. J. Fleishman, Designed active-site library reveals thousands of functional GFP variants. Nat. Commun. 14, 2890 (2023).

A. M. Quattrini, E. Rodríguez, B. C. Faircloth, P. F. Cowman, M. R. Brugler, G. A. Farfan, M. E. Hellberg, M. V. Kitahara, C. L. Morrison, D. A. Paz-García, J. D. Reimer, C. S. McFadden, Palaeoclimate ocean conditions shaped the evolution of corals and their skeletons through deep time. Nat. Ecol. Evol. 4, 1531–1538 (2020).

J. M. Smith, Natural selection and the concept of a protein space. Nature 225, 563–564 (1970).

U. Kamath, J. Liu, J. Whitaker, “Distributed representations” in The Philosophy of Artificial Intelligence, G. E. Hinton, J. L. McClelland, D. E. Rumelhart, Eds. (Springer, 1986); pp. 203–261.

N. Tishby, F. C. Pereira, W. Bialek, The information bottleneck method. arXiv:physics/0004057 [physics.data-an] (1999).

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. arXiv:1706.03762 [cs.CL] (2017).

R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, T.-Y. Liu, On layer normalization in the transformer architecture. arXiv:2002.04745 [cs.LG] (2020).

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, D. Hassabis, Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

W. Kabsch, C. Sander, Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

J. Su, Y. Lu, S. Pan, B. Wen, Y. Liu, RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864 [cs.CL] (2021).

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, N. Fiedel. PaLM: Scaling language modeling with Pathways. arXiv:2204.02311 [cs.CL] (2022).

T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, B. Hallacy, B. Mann, A. Radford, A. Ramesh, N. Ryder, D. M. Ziegler, J. Schulman, D. Amodei, S. McCandlish, Scaling laws for autoregressive generative modeling. arXiv:2010.14701 [cs.LG] (2020).

N. Wies, Y. Levine, D. Jannai, A. Shashua. Which transformer architecture fits my data? A vocabulary bottleneck in self-attention. arXiv:2105.03928 [cs.LG] (2021).

J. Ingraham, V. Garg, R. Barzilay, T. Jaakkola, “Generative models for graph-based protein design.” In: Advances in Neural Information Processing Systems 32 (NeurlPS 2019), H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, R. Garnett, Eds. (MIT, 2019); https://proceedings.neurips.cc/paper/2019/file/f3a4ff4839c56a5f460c88cce3666a2b-Paper.pdf.

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, C. Re, FlashAttention: Fast and memory-efficient exact attention with IO-awareness. arXiv:2205.14135 [cs.LG] (2022).

J. Su, C. Han, Y. Zhou, J. Shan, X. Zhou, F. Yuan, SaProt: Protein language modeling with structure-aware vocabulary. bioRxiv 560349 [Preprint] (2024); .

A. van den Oord, O. Vinyals, K. Kavukcuoglu, Neural discrete representation learning. arXiv:1711.00937 [cs.LG] (2018).

A. Razavi, A. van den Oord, O. Vinyals, Generating diverse high-fidelity images with VQ-VAE2. arXiv:1906.00446 [cs.LG] (2019).

A. Roy, A. Vaswani, A. Neelakantan, N. Parmar, Theory and experiments on vector quantized autoencoders. arXiv:1805.11063 [cs.LG] (2018).

J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang, J. Baldridge, and Y. Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789 [cs.CV] (2022).

A. Bateman, M.-J. Martin, S. Orchard, M. Magrane, S. Ahmad, E. Alpi, E. H. Bowler-Barnett, R. Britto, H. Bye-A-Jee, A. Cukura, P. Denny, T. Dogan, T. G. Ebenezer, J. Fan, P. Garmiri, L. J. da Costa Gonzales, E. Hatton-Ellis, A. Hussein, A. Ignatchenko, G. Insana, R. Ishtiaq, V. Joshi, D. Jyothi, S. Kandasaamy, A. Lock, A. Luciani, M. Lugaric, J. Luo, Y. Lussi, A. MacDougall, F. Madeira, M. Mahmoudy, A. Mishra, K. Moulang, A. Nightingale, S. Pundir, G. Qi, S. Raj, P. Raposo, D. L. Rice, R. Saidi, R. Santos, E. Speretta, J. Stephenson, P. Totoo, E. Turner, N. Tyagi, P. Vasudev, K. Warner, X. Watkins, R. Zaru, H. Zellner, A. J. Bridge, L. Aimo, G. Argoud-Puy, A. H. Auchincloss, K. B. Axelsen, P. Bansal, D. Baratin, T. M. Batista Neto, M.-C. Blatter, J. T. Bolleman, E. Boutet, L. Breuza, B. C. Gil, C. Casals-Casas, K. C. Echioukh, E. Coudert, B. Cuche, E. de Castro, A. Estreicher, M. L. Famiglietti, M. Feuermann, E. Gasteiger, P. Gaudet, S. Gehant, V. Gerritsen, A. Gos, N. Gruaz, C. Hulo, N. Hyka-Nouspikel, F. Jungo, A. Kerhornou, P. Le Mercier, D. Lieberherr, P. Masson, A. Morgat, V. Muthukrishnan, S. Paesano, I. Pedruzzi, S. Pilbout, L. Pourcel, S. Poux, M. Pozzato, M. Pruess, N. Redaschi, C. Rivoire, C. J. A. Sigrist, K. Sonesson, S. Sundaram, C. H. Wu, C. N. Arighi, L. Arminski, C. Chen, Y. Chen, H. Huang, K. Laiho, P. McGarvey, D. A. Natale, K. Ross, C. R. Vinayaka, Q. Wang, Y. Wang, J. Zhang; UniProt Consortium, UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).

I. A. Chen, K. Chu, K. Palaniappan, A. Ratner, J. Huang, M. Huntemann, P. Hajek, S. J. Ritter, C. Webb, D. Wu, N. J. Varghese, T. B. K. Reddy, S. Mukherjee, G. Ovchinnikova, M. Nolan, R. Seshadri, S. Roux, A. Visel, T. Woyke, E. A. Eloe-Fadrosh, N. C. Kyrpides, N. N. Ivanova, The IMG/M data management and analysis system v.7: Content updates and new features. Nucleic Acids Res. 51, D723–D732 (2023).

P. Jones, D. Binns, H.-Y. Chang, M. Fraser, W. Li, C. McAnulla, H. McWilliam, J. Maslen, A. Mitchell, G. Nuka, S. Pesseat, A. F. Quinn, A. Sangrador-Vegas, M. Scheremetjew, S.-Y. Yong, R. Lopez, S. Hunter, InterProScan 5: Genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

P. Kunzmann, K. Hamacher, Biotite: A unifying open source computational biology framework in Python. BMC Bioinformatics 19, 346 (2018).

W. G. Touw, C. Baakman, J. Black, T. A. te Beek, E. Krieger, R. P. Joosten, G. Vriend, A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 43, D364–D368 (2015).

I. Loshchilov, F. Hutter, Decoupled weight decay regularization. arXiv:1711.05101 [cs.LG] (2017).

Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, S. Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel. arXiv:2304.11277 [cs.DC] (2023).

B. Lefaudeux, F. Massa, D. Liskovich, W. Xiong, V. Caggiano, S. Naren, M. Xu, J. Hu, M. Tintore, S. Zhang, P. Labatut, D. Haziza, L. Wehrstedt, J. Reizenstein, G. Sizov, “xformers: A modular and hackable transformer modelling library” (Github, 2022); https://github.com/facebookresearch/xformers.

Y. Dong, J.-B. Cordonnier, A. Loukas, Attention is not all you need: Pure attention loses rank doubly exponentially with depth. arXiv:2103.03404 [cs.LG] (2021).

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, R. Jenatton, L. Beyer, M. Tschannen, A. Arnab, X. Wang, C. Riquelme Ruiz, M. Minderer, J. Puigcerver, U. Evci, M. Kumar, S. V. Steenkiste, G. F. Elsayed, A. Mahendran, F. Yu, A. Oliver, F. Huot, J. Bastings, M. Collier, A. A. Gritsenko, V. Birodkar, C. N. Vasconcelos, Y. Tay, T. Mensink, A. Kolesnikov, F. Pavetic, D. Tran, T. Kipf, M. Lucic, X. Zhai, D. Keysers, J. J. Harmsen, N. Houlsby, Scaling vision transformers to 22 billion parameters. arXiv:2302.05442 [cs.CV] (2023).

M. Wortsman, P. J. Liu, L. Xiao, K. E. Everett, A. A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Kumar, R. Novak, J. Pennington, J. Sohl-Dickstein, K. Xu, J. Lee, J. Gilmer, S. Kornblith, Small-scale proxies for large-scale transformer training instabilities. arXiv:2309.14322 [cs.LG] (2024).

G. Yang, E. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, J. Gao, Tensor programs V: Tuning large neural networks via zeroshot hyperparameter transfer. arXiv:2203.03466 [cs.LG] (2021).

G. Yang, D. Yu, C. Zhu, S. Hayou, Tensor programs VI: Feature learning in infinite depth neural networks. arXiv:2310.02244 [cs.NE] (2023).

J. Haas, A. Barbato, D. Behringer, G. Studer, S. Roth, M. Bertoni, K. Mostaguir, R. Gumienny, T. Schwede, Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86, 387–398 (2018).

A. Kryshtafovych, T. Schwede, M. Topf, K. Fidelis, J. Moult, Critical assessment of methods of protein structure prediction (CASP)-Round XIV. Proteins 89, 1607–1617 (2021).

A. Kryshtafovych, M. Antczak, M. Szachniuk, T. Zok, R. C. Kretsch, R. Rangan, P. Pham, R. Das, X. Robin, G. Studer, J. Durairaj, J. Eberhardt, A. Sweeney, M. Topf, T. Schwede, K. Fidelis, J. Moult, New prediction categories in CASP15. Proteins 91, 1550–1557 (2023).

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] (2021).

L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 [stat.ML] (2020).

B. Hie, S. Candido, Z. Lin, O. Kabeli, R. Rao, N. Smetanin, T. Sercu, A. Rives, A high-level programming language for generative protein design. bioRxiv 521526 [Preprint] (2022); .

N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P. S. Langendijk-Genevaux, M. Pagni, C. J. A. Sigrist, The PROSITE database. Nucleic Acids Res. 34, D227–D230 (2006).

C. Zhang, X. Zhang, P. L. Freddolino, Y. Zhang, BioLiP2: An updated structure database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 52, D404–D412 (2024).

100

M. G. Azar, M. Rowland, B. Piot, D. Guo, D. Calandriello, M. Valko, R. Munos, A general theoretical paradigm to understand learning from human preferences. arXiv:2310.12036 [cs.AI] (2023).

101

K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, D. Kiela, KTO: Model alignment as prospect theoretic optimization. arXiv:2402.01306 [cs.LG] (2024).

102

L. Gao, J. Schulman, J. Hilton, Scaling laws for reward model overoptimization. arXiv:2210.10760 [cs.LG] (2023).

103

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, W. Zaremba. Evaluating large language models trained on code. arXiv:2107.03374 [cs.LG] (2021).

104

F. Yang, L. G. Moss, G. N. Phillips Jr., The molecular structure of green fluorescent protein. Nat. Biotechnol. 14, 1246–1251 (1996).

105

J. Ho, T. Salimans, Classifier-free diffusion guidance. arXiv:2207.12598 [cs.LG] (2022).

106

W. Kabsch, A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 32, 922–923 (1976).

107

S. M. Hartley, K. A. Tiernan, G. Ahmetaj, A. Cretu, Y. Zhuang, M. Zimmer, AlphaFold2 and RoseTTAFold predict posttranslational modifications. Chromophore formation in GFP-like proteins. PLOS ONE 17, e0267560 (2022).

108

J. Salazar, D. Liang, T. Q. Nguyen, K. Kirchhoff, Masked language model scoring. arXiv:1910.14659 [cs.CL] (2019).

110

K. Katoh, D. M. Standley, MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

111

T. J. Lambert, FPbase: A community-editable fluorescent protein database. Nat. Methods 16, 277–278 (2019).

113

Responsible AI × Biodesign, “Community values, guiding principles, and commitments for the responsible development of AI for protein design” (2024); https://responsiblebiodesign.ai/.

116

P. Notin, A. W. Kollasch, D. Ritter, L. van Niekerk, S. Paul, H. Spinner, N. Rollins, A. Shaw, R. Weitzman, J. Frazer, M. Dias, D. Franceschi, R. Orenbuch, Y. Gal, D. S. Marks, ProteinGym: Large-scale benchmarks for protein design and fitness prediction. bioRxiv 570727 [Preprint] (2023); .

117

T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, C. P. Schärfe, M. Springer, C. Sander, D. S. Marks, Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).