Accurate prediction of protein structures and interactions using a three-track neural network

15 min read Original article ↗

Deep learning takes on protein folding

In 1972, Anfinsen won a Nobel prize for demonstrating a connection between a protein’s amino acid sequence and its three-dimensional structure. Since 1994, scientists have competed in the biannual Critical Assessment of Structure Prediction (CASP) protein-folding challenge. Deep learning methods took center stage at CASP14, with DeepMind’s Alphafold2 achieving remarkable accuracy. Baek et al. explored network architectures based on the DeepMind framework. They used a three-track network to process sequence, distance, and coordinate information simultaneously and achieved accuracies approaching those of DeepMind. The method, RoseTTA fold, can solve challenging x-ray crystallography and cryo–electron microscopy modeling problems and generate accurate models of protein-protein complexes. —VV

Abstract

DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo–electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.

Register and access this article for free

As a service to the community, this article is available for free.

Access the full article

View all access options to continue reading this article.

Supplementary Material

Summary

Materials and Methods

Figs. S1 to S17

Tables S1 to S4

References (3782)

MDAR Reproducibility Checklist

Resources

File (abj8754_baek_sm.pdf)

File (abj8754_mdar_reproducibility_checklist.pdf)

References and Notes

1

A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, D. Hassabis, Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

2

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, K. Tunyasuvunakool, O. Ronneberger, R. Bates, A. Žídek, A. Bridgland, C. Meyer, S. A. A. Kohl, A. Potapenko, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, M. Steinegger, M. Pacholska, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, D. Hassabis, “High accuracy protein structure prediction using deep learning” in Fourteenth Critical Assessment of Techniques for Protein Structure Prediction: CASP14 Abstract Book (Protein Structure Prediction Center, 2020), pp. 22–24.

3

J. Yang, I. Anishchenko, H. Park, Z. Peng, S. Ovchinnikov, D. Baker, Improved protein structure prediction using predicted interresidue orientations. Proc. Natl. Acad. Sci. U.S.A. 117, 1496–1503 (2020).

4

I. Anishchenko, T. M. Chidyausiku, S. Ovchinnikov, S. J. Pellock, D. Baker, De novo protein design by deep network hallucination. bioRxiv 2020.07.22.211482 [Preprint] (2020); https://doi.org/10.1101/2020.07.22.211482.

5

S. Chaudhury, S. Lyskov, J. J. Gray, PyRosetta: A script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

6

F. B. Fuchs, D. E. Worrall, V. Fischer, M. Welling, SE(3)-Transformers: 3D roto-translation equivariant attention networks. arXiv:2006.10503 [cs.LG] (2020).

7

J. Haas, A. Barbato, D. Behringer, G. Studer, S. Roth, M. Bertoni, K. Mostaguir, R. Gumienny, T. Schwede, Continuous Automated Model EvaluatiOn (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86, 387–398 (2018).

8

L. J. McGuffin, R. Adiyaman, A. H. A. Maghrabi, A. N. Shuid, D. A. Brackenridge, J. O. Nealon, L. S. Philomina, IntFOLD: An integrated web resource for high performance protein structure and function prediction. Nucleic Acids Res. 47, W408–W413 (2019).

9

J. Haas, R. Gumienny, A. Barbato, F. Ackermann, G. Tauriello, M. Bertoni, G. Studer, A. Smolinski, T. Schwede, Introducing “best single template” models as reference baseline for the Continuous Automated Model Evaluation (CAMEO). Proteins 87, 1378–1387 (2019).

10

A. Waterhouse, M. Bertoni, S. Bienert, G. Studer, G. Tauriello, R. Gumienny, F. T. Heer, T. A. P. de Beer, C. Rempfer, L. Bordoli, R. Lepore, T. Schwede, SWISS-MODEL: Homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).

11

A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, J. Carreira, Perceiver: General perception with iterative attention. arXiv:2103.03206 [cs.CV] (2021).

12

N. Hiranuma, H. Park, M. Baek, I. Anishchenko, J. Dauparas, D. Baker, Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun. 12, 1340 (2021).

13

M. Steinegger, M. Meier, M. Mirdita, H. Vöhringer, S. J. Haunsberger, J. Söding, HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

14

A. J. Kooistra, S. Mordalski, G. Pándy-Szekeres, M. Esguerra, A. Mamyrbekov, C. Munk, G. M. Keserű, D. E. Gloriam, GPCRdb in 2021: Integrating GPCR sequence, structure and function. Nucleic Acids Res. 49, D335–D343 (2021).

15

B. J. Bender, B. Marlow, J. Meiler, Improving homology modeling from low-sequence identity templates in Rosetta: A case study in GPCRs. PLOS Comput. Biol. 16, e1007597 (2020).

16

L. S. Kremer, F. Distelmaier, B. Alhaddad, M. Hempel, A. Iuso, C. Küpper, C. Mühlhausen, R. Kovacs-Nagy, R. Satanovskij, E. Graf, R. Berutti, G. Eckstein, R. Durbin, S. Sauer, G. F. Hoffmann, T. M. Strom, R. Santer, T. Meitinger, T. Klopstock, H. Prokisch, T. B. Haack, Bi-allelic truncating mutations in TANGO2 cause infancy-onset recurrent metabolic crises with encephalocardiomyopathy. Am. J. Hum. Genet. 98, 358–362 (2016).

17

C. Rabouille, V. Kondylis, TANGOing along the protein secretion pathway. Genome Biol. 7, 213 (2006).

18

M. P. Milev, D. Saint-Dic, K. Zardoui, T. Klopstock, C. Law, F. Distelmaier, M. Sacher, The phenotype associated with variants in TANGO2 may be explained by a dual role of the protein in ER-to-Golgi transport and at the mitochondria. J. Inherit. Metab. Dis. 44, 426–437 (2021).

19

S. R. Lalani, P. Liu, J. A. Rosenfeld, L. B. Watkin, T. Chiang, M. S. Leduc, W. Zhu, Y. Ding, S. Pan, F. Vetrini, C. Y. Miyake, M. Shinawi, T. Gambin, M. K. Eldomery, Z. H. C. Akdemir, L. Emrick, Y. Wilnai, S. Schelley, M. K. Koenig, N. Memon, L. S. Farach, B. P. Coe, M. Azamian, P. Hernandez, G. Zapata, S. N. Jhangiani, D. M. Muzny, T. Lotze, G. Clark, A. Wilfong, H. Northrup, A. Adesina, C. A. Bacino, F. Scaglia, P. E. Bonnen, J. Crosson, J. Duis, G. H. B. Maegawa, D. Coman, A. Inwood, J. McGill, E. Boerwinkle, B. Graham, A. Beaudet, C. M. Eng, N. A. Hanchard, F. Xia, J. S. Orange, R. A. Gibbs, J. R. Lupski, Y. Yang, Recurrent muscle weakness with rhabdomyolysis, metabolic crises, and cardiac arrhythmia due to bi-allelic TANGO2 mutations. Am. J. Hum. Genet. 98, 347–357 (2016).

20

T. G. Wolfsberg, P. Primakoff, D. G. Myles, J. M. White, ADAM, a novel family of membrane proteins containing A Disintegrin And Metalloprotease domain: Multipotential functions in cell-cell and cell-matrix interactions. J. Cell Biol. 131, 275–278 (1995).

21

T. Klein, R. Bischoff, Active metalloproteases of the A Disintegrin and Metalloprotease (ADAM) family: Biological function and structure. J. Proteome Res. 10, 17–33 (2011).

22

S. Zhong, R. A. Khalil, A Disintegrin and Metalloproteinase (ADAM) and ADAM with thrombospondin motifs (ADAMTS) family in vascular biology and disease. Biochem. Pharmacol. 164, 188–204 (2019).

23

P. Orth, P. Reichert, W. Wang, W. W. Prosise, T. Yarosh-Tomaine, G. Hammond, R. N. Ingram, L. Xiao, U. A. Mirza, J. Zou, C. Strickland, S. S. Taremi, H. V. Le, V. Madison, Crystal structure of the catalytic domain of human ADAM33. J. Mol. Biol. 335, 129–137 (2004).

24

S. Takeda, T. Igarashi, H. Mori, S. Araki, Crystal structures of VAP1 reveal ADAMs’ MDC domain architecture and its unique C-shaped scaffold. EMBO J. 25, 2388–2396 (2006).

25

D. R. Flower, A. C. North, C. E. Sansom, The lipocalin protein family: Structural and sequence overview. Biochim. Biophys. Acta 1482, 9–24 (2000).

26

H. E. Van Wart, H. Birkedal-Hansen, The cysteine switch: A principle of regulation of metalloproteinase activity with potential applicability to the entire matrix metalloproteinase gene family. Proc. Natl. Acad. Sci. U.S.A. 87, 5578–5582 (1990).

27

M. Levy, A. H. Futerman, Mammalian ceramide synthases. IUBMB Life 62, 347–356 (2010).

28

J. L. Kim, B. Mestre, S.-H. Shin, A. H. Futerman, Ceramide synthases: Reflections on the impact of Dr. Lina M. Obeid. Cell. Signal. 82, 109958 (2021).

29

E. Winter, C. P. Ponting, TRAM, LAG1 and CLN8: Members of a novel family of lipid-sensing domains? Trends Biochem. Sci. 27, 381–383 (2002).

30

S. Spassieva, J.-G. Seo, J. C. Jiang, J. Bielawski, F. Alvarez-Vasquez, S. M. Jazwinski, Y. A. Hannun, L. M. Obeid, Necessary role for the Lag1p motif in (dihydro)ceramide synthase activity. J. Biol. Chem. 281, 33931–33938 (2006).

31

N. Vanni, F. Fruscione, E. Ferlazzo, P. Striano, A. Robbiano, M. Traverso, T. Sander, A. Falace, E. Gazzerro, P. Bramanti, J. Bielawski, A. Fassio, C. Minetti, P. Genton, F. Zara, Impairment of ceramide synthesis causes a novel progressive myoclonus epilepsy. Ann. Neurol. 76, 206–212 (2014).

32

Q. Cong, I. Anishchenko, S. Ovchinnikov, D. Baker, Protein interaction networks revealed by proteome coevolution. Science 365, 185–189 (2019).

33

Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).

34

C. R. Glassman, Y. K. Mathiharan, K. M. Jude, L. Su, O. Panova, P. J. Lupardus, J. B. Spangler, L. K. Ely, C. Thomas, G. Skiniotis, K. C. Garcia, Structural basis for IL-12 and IL-23 receptor sharing reveals a gateway for shaping actions on T versus NK cells. Cell 184, 983–999.e24 (2021).

35

E. F. Pettersen, T. D. Goddard, C. C. Huang, E. C. Meng, G. S. Couch, T. I. Croll, J. H. Morris, T. E. Ferrin, UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci. 30, 70–82 (2021).

36

M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, S. Ovchinnikov, J. Wang, D. Baker, RoseTTAFold: The first release of RoseTTAFold. Zenodo (2021); https://zenodo.org/record/5068265.

37

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need. arXiv:1706.03762 [cs.CL] (2017).

38

J. Ho, N. Kalchbrenner, D. Weissenborn, T. Salimans, Axial attention in multidimensional transformers. arXiv:1912.12180 [cs.CV] (2019).

39

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller, Rethinking attention with Performers. arXiv:2009.14794 [cs.LG] (2020).

40

R. Rao, J. Liu, R. Verkuil, J. Meier, J. F. Canny, P. Abbeel, T. Sercu, A. Rives, MSA Transformer, bioRxiv 2021.02.12.430858 [Preprint] (2021); .

41

F. Ju, J. Zhu, B. Shao, L. Kong, T.-Y. Liu, W.-M. Zheng, D. Bu, CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction. Nat. Commun. 12, 2535 (2021).

42

Y. Shi, Z. Huang, S. Feng, H. Zhong, W. Wang, Y. Sun, Masked label prediction: Unified message passing model for semi-supervised classification. arXiv:2009.03509 [cs.LG] (2020).

43

V. Mariani, M. Biasini, A. Barbato, T. Schwede, lDDT: A local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).

44

M. Mirdita, L. von den Driesch, C. Galiez, M. J. Martin, J. Söding, M. Steinegger, Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).

45

M. Steinegger, M. Mirdita, J. Söding, Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).

46

L. Zimmermann, A. Stephens, S.-Z. Nam, D. Rau, J. Kübler, M. Lozajic, F. Gabler, J. Söding, A. N. Lupas, V. Alva, A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core. J. Mol. Biol. 430, 2237–2243 (2018).

47

G. Bunkóczi, R. J. Read, Improvement of molecular-replacement models with Sculptor. Acta Crystallogr. D Biol. Crystallogr. 67, 303–312 (2011).

48

G. Bunkóczi, R. J. Read, phenix.ensembler: A tool for multiple superposition. Comput. Crystallogr. Newsl. 2, 8–9 (2011).

49

A. J. McCoy, R. W. Grosse-Kunstleve, P. D. Adams, M. D. Winn, L. C. Storoni, R. J. Read, Phaser crystallographic software. J. Appl. Crystallogr. 40, 658–674 (2007).

50

A. Vagin, A. Lebedev, MoRDa, an automatic molecular replacement pipeline. Acta Crystallogr. A Found. Adv. A71, S19 (2015). .

51

Y. Wang, J. Virtanen, Z. Xue, Y. Zhang, I-TASSER-MR: Automated molecular replacement for distant-homology proteins using iterative fragment assembly and progressive sequence truncation. Nucleic Acids Res. 45, W429–W434 (2017).

52

A. J. McCoy, R. D. Oeffner, A. G. Wrobel, J. R. M. Ojala, K. Tryggvason, B. Lohkamp, R. J. Read, Ab initio solution of macromolecular crystal structures without direct methods. Proc. Natl. Acad. Sci. U.S.A. 114, 3637–3641 (2017).

53

G. Bunkóczi, B. Wallner, R. J. Read, Local error estimates dramatically improve the utility of homology models for solving crystal structures by molecular replacement. Structure 23, 397–406 (2015).

54

T. C. Terwilliger, Maximum-likelihood density modification. Acta Crystallogr. D Biol. Crystallogr. 56, 965–972 (2000).

55

D. Liebschner, P. V. Afonine, M. L. Baker, G. Bunkóczi, V. B. Chen, T. I. Croll, B. Hintze, L. W. Hung, S. Jain, A. J. McCoy, N. W. Moriarty, R. D. Oeffner, B. K. Poon, M. G. Prisant, R. J. Read, J. S. Richardson, D. C. Richardson, M. D. Sammito, O. V. Sobolev, D. H. Stockwell, T. C. Terwilliger, A. G. Urzhumtsev, L. L. Videau, C. J. Williams, P. D. Adams, Macromolecular structure determination using x-rays, neutrons and electrons: Recent developments in Phenix. Acta Crystallogr. D Struct. Biol. 75, 861–877 (2019).

56

T. C. Terwilliger, R. W. Grosse-Kunstleve, P. V. Afonine, N. W. Moriarty, P. H. Zwart, L. W. Hung, R. J. Read, P. D. Adams, Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr. D Biol. Crystallogr. 64, 61–69 (2008).

57

P. Emsley, B. Lohkamp, W. G. Scott, K. Cowtan, Features and development of Coot. Acta Crystallogr. D Biol. Crystallogr. 66, 486–501 (2010).

58

P. V. Afonine, R. W. Grosse-Kunstleve, N. Echols, J. J. Headd, N. W. Moriarty, M. Mustyakimov, T. C. Terwilliger, A. Urzhumtsev, P. H. Zwart, P. D. Adams, Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. D Biol. Crystallogr. 68, 352–367 (2012).

59

C. J. Williams, J. J. Headd, N. W. Moriarty, M. G. Prisant, L. L. Videau, L. N. Deis, V. Verma, D. A. Keedy, B. J. Hintze, V. B. Chen, S. Jain, S. M. Lewis, W. B. Arendall III, J. Snoeyink, P. D. Adams, S. C. Lovell, J. S. Richardson, D. C. Richardson, MolProbity: More and better reference data for improved all-atom structure validation. Protein Sci. 27, 293–315 (2018).

60

R. J. Read, A. J. McCoy, Using SAD data in Phaser. Acta Crystallogr. D Biol. Crystallogr. 67, 338–344 (2011).

61

J. Xu, M. McPartlon, J. Li, Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. (2021).

62

J. Yang, Y. Zhang, I-TASSER server: New development for protein structure and function predictions. Nucleic Acids Res. 43, W174–W181 (2015).

63

D. Xu, Y. Zhang, Toward optimal fragment generations for ab initio protein structure assembly. Proteins 81, 229–239 (2013).

64

The UniProt Consortium, UniProt: The universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).

65

J. Pei, N. V. Grishin, The DBSAV database: Predicting deleteriousness of single amino acid variations in the human proteome. J. Mol. Biol. 433, 166915 (2021).

66

L. S. Johnson, S. R. Eddy, E. Portugaly, Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431 (2010).

67

S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Luciani, S. C. Potter, M. Qureshi, L. J. Richardson, G. A. Salazar, A. Smart, E. L. L. Sonnhammer, L. Hirsh, L. Paladin, D. Piovesan, S. C. E. Tosatto, R. D. Finn, The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).

68

S. Bienert, A. Waterhouse, T. A. P. de Beer, G. Tauriello, G. Studer, L. Bordoli, T. Schwede, The SWISS-MODEL Repository-new features and functionality. Nucleic Acids Res. 45, D313–D319 (2017).

69

B. Mészáros, G. Erdos, Z. Dosztányi, IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 46, W329–W337 (2018).

70

J. Hanson, K. K. Paliwal, T. Litfin, Y. Zhou, SPOT-Disorder2: Improved protein intrinsic disorder prediction by ensembled deep learning. Genomics Proteomics Bioinformatics 17, 645–656 (2019).

71

F. Gabler, S.-Z. Nam, S. Till, M. Mirdita, M. Steinegger, J. Söding, A. N. Lupas, V. Alva, Protein sequence analysis using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinformatics 72, e108 (2020).

72

H. Cheng, R. D. Schaeffer, Y. Liao, L. N. Kinch, J. Pei, S. Shi, B.-H. Kim, N. V. Grishin, ECOD: An evolutionary classification of protein domains. PLOS Comput. Biol. 10, e1003926 (2014).

73

R. Ayoub, Y. Lee, RUPEE: A fast and accurate purely geometric protein structure search. PLOS ONE 14, e0213712 (2019).

74

J. Pei, N. V. Grishin, AL2CO: Calculation of positional conservation in a protein sequence alignment. Bioinformatics 17, 700–712 (2001).

75

K. Katoh, D. M. Standley, MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

76

A. M. Altenhoff, C.-M. Train, K. J. Gilbert, I. Mediratta, T. Mendes de Farias, D. Moi, Y. Nevers, H.-S. Radoykova, V. Rossier, A. Warwick Vesztrocy, N. M. Glover, C. Dessimoz, OMA orthology in 2021: Website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. 49, D373–D379 (2021).

77

P. Benkert, M. Biasini, T. Schwede, Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics 27, 343–350 (2011).

78

L. Holm, Using Dali for protein structure comparison. Methods Mol. Biol. 2112, 29–42 (2020).

79

S. J. Hubbard, J. M. Thornton, “naccess,” computer program (Department of Biochemistry and Molecular Biology, University College London, 1993).

80

A. Lafita, S. Bliven, A. Kryshtafovych, M. Bertoni, B. Monastyrskyy, J. M. Duarte, T. Schwede, G. Capitani, Assessment of protein assembly prediction in CASP12. Proteins 86 (Suppl 1), 247–256 (2018).

81

P. Conway, M. D. Tyka, F. DiMaio, D. E. Konerding, D. Baker, Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23, 47–55 (2014).

82

M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, D. G. Higgins, Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948 (2007).