Abstract: In 2022, Meta AI Research released ESMFold, the first large-scale protein language model-based single-sequence structure prediction method. By training the ESM-2 language model with up to 15 billion parameters, this approach directly infers atomic-level 3D structures from amino acid sequences, achieving prediction accuracy comparable to AlphaFold2 without requiring multiple sequence alignment (MSA), while improving inference speed by up to 60x.
1. Background: From MSA Dependency to Single-Sequence Prediction Paradigm
The core challenge in protein structure prediction lies in inferring three-dimensional conformations from primary sequences. Traditional methods rely on multiple sequence alignment (MSA) to extract co-evolutionary information—analyzing amino acid covariation across homologous sequences to infer spatially proximal residue pairs. State-of-the-art methods such as AlphaFold2 and RoseTTAFold are built upon this paradigm, achieving near-experimental accuracy through deep integration of MSA information.
However, MSA construction requires searching massive sequence databases, a process that can exceed 10 minutes when using high-sensitivity search protocols, creating a computational bottleneck.
The emergence of protein language models (PLMs) offers a new approach to overcome this limitation. PLMs learn statistical dependencies between sequences through masked language modeling training on millions of evolutionarily diverse protein sequences. Researchers hypothesized that since protein structures and functions are encoded in sequence patterns through evolutionary constraints, language models may implicitly learn structural information while predicting masked amino acids. If this hypothesis holds, it would mean that 3D structures could be decoded directly from the internal representations of language models, completely bypassing the MSA construction step.
2. ESM-2: Scale-Driven Emergence of Structural Information
The core of ESMFold is the ESM-2 language model family, with parameters scaling from 8 million to 15 billion across four orders of magnitude. All models employ Transformer architecture and are trained with masked language modeling objectives: randomly masking portions of amino acids in sequences and requiring the model to predict the identity of masked positions based on context.
Despite training objectives involving only sequences, research found that structural information emerges in predictable ways as model scale increases. Model performance is measured by perplexity, which describes the average number of choices the model faces at each prediction position. After 270,000 training steps, the 8-million-parameter model achieved a perplexity of 10.45, while the 15-billion-parameter model dropped to 6.37, indicating that larger models develop significantly deeper understanding of protein sequences.
More importantly, this improvement in sequence modeling capability is highly correlated with the emergence of structure prediction ability.
Two Levels of Structural Information Emergence
Low-resolution level: Transformer attention patterns naturally correspond to residue contact maps. By extracting contact predictions from attention maps through linear projection, researchers found that long-range contact prediction accuracy continuously improves with increasing model scale. For proteins with high evolutionary depth (those with more homologous sequences in the training set), improvements saturate at smaller scales; for proteins with low evolutionary depth, improvements continue to the maximum scale.
High-resolution level: Researchers used equivariant Transformers to project atomic coordinates from language model internal representations. The 15-billion-parameter model achieved a TM-score of 0.71 on the CAMEO test set and 0.54 on the CASP14 test set, an improvement of 0.064 points over the 150-million-parameter model. Notably, perplexity and TM-score show nearly perfect negative correlation (CASP14: -0.99, CAMEO: -1.00), indicating a deep connection between language modeling objectives and structure learning.
3. ESMFold Architecture: End-to-End Single-Sequence Prediction
Building upon the language modeling capabilities of ESM-2, researchers developed the ESMFold structure prediction network. This architecture feeds protein sequences into ESM-2, processes them through its feed-forward layers, and passes internal representations to the folding head. The folding head contains a series of folding blocks that alternately update sequence and pair representations, which are then fed into an equivariant Transformer structure module. After three cycles of iterative refinement, the system outputs atomic-level coordinates and confidence predictions.
Comparison with AlphaFold2 and RoseTTAFold
Compared to AlphaFold2 and RoseTTAFold, ESMFold's architecture is significantly simplified. The latter two deeply integrate MSA information through complex modules such as Evoformer, performing attention operations across rows and columns of MSA; ESMFold completely removes MSA construction and template search steps, relying solely on representations extracted by the language model from single sequences.
This simplification brings speed advantages: on NVIDIA V100 GPUs, ESMFold predicts structures for 384-residue proteins in 14.2 seconds, 6x faster than single-model AlphaFold2; on shorter sequences, the speedup can reach approximately 60x. When accounting for MSA search time (over 10 minutes for high-sensitivity protocols), overall acceleration can reach one to two orders of magnitude.
Accuracy Performance
In terms of accuracy, ESMFold achieves an average TM-score of 0.83 on the CAMEO test set (194 structures), comparable to RoseTTAFold (0.82); on the CASP14 test set (51 structures), it reaches 0.68, lower than AlphaFold2 (0.85) using full MSA and templates. This gap is more pronounced on CASP14, likely reflecting that this test set contains more orphan proteins (proteins lacking homologous sequences), which are precisely the challenging cases for MSA-based methods.
Interestingly, when MSA inputs are removed from AlphaFold2 and RoseTTAFold, their performance drops significantly below ESMFold, indicating that ESMFold has advantages in single-sequence scenarios.
Confidence Scoring
ESMFold's confidence score (pLDDT) is well-calibrated. On CAMEO, high-confidence predictions (pLDDT > 0.7) achieve LDDT of 0.83, close to AlphaFold2's 0.85; when confidence is extremely high (pLDDT > 0.9), the median all-atom RMSD95 is 1.42 Å, with backbone RMSD95 of 0.94 Å, approaching experimental accuracy. This well-calibrated confidence score provides a basis for large-scale screening of reliable predictions.
4. ESM Metagenomic Atlas: Evolutionary-Scale Characterization of 617 Million Structures
ESMFold's speed advantage enables unprecedented metagenomic structure characterization. The research team performed structure prediction on 617 million sequences (length 20-1024) from the MGnify90 database, covering 99% of the database's sequences. This computational task was completed in two weeks on a heterogeneous cluster of approximately 2,000 GPUs, demonstrating the method's scalability.
Prediction Results Statistics
- Approximately 365 million structures (59%) achieved good confidence (average pLDDT > 0.5 and pTM > 0.5)
- Approximately 225 million structures (36%) achieved high confidence (average pLDDT > 0.7 and pTM > 0.7)
- Approximately 113 million structures achieved very high confidence (pLDDT > 0.9), expected to have reliability approaching experimental structures
Novelty Discoveries
These high-confidence predictions contain substantial novelty. In a random sample of 1 million high-confidence structures:
- 76.8% have less than 90% similarity to any sequence in UniRef90, indicating significant differences from known protein families
- 3.4% have no significant matches in UniRef90 at all
- 12.6% of high-confidence structures have no similar structures in PDB with TM-score exceeding 0.5
- 25.4% have no similar structures with TM-score exceeding 0.7
Particularly noteworthy is that 10.4% of high-confidence structures lack both structural similarity (TM-score ≤ 0.5) and sequence homologs (similarity < 30%), representing entirely new regions of the protein universe.
Functional Relationships Revealed by Structural Similarity
ESMFold also reveals remote structural similarities that cannot be detected by sequence alone. For example, the metagenomic sequence MGYP000936678158 has no significant sequence matches in UniRef90 or reference proteomes, but its predicted structure shows similarity (TM-score ~0.67) to various nuclease experimental structures (PDB 5YET, 3HR4); another sequence MGYP004000959047 similarly lacks sequence matches, but its structure is highly similar to bacterial sterol-binding domains (PDB 6BYM, 5YQP) with TM-score 0.78-0.80. These findings demonstrate that ESMFold can transcend sequence similarity limitations and infer functional relationships through structural similarity.
All predicted structures are openly accessible through the ESM Metagenomic Atlas (https://esmatlas.com), supporting bulk downloads, programmatic API access, and online search, providing new resources for large-scale structural biology research.
5. Discussion: Advantages, Limitations, and Future Outlook
ESMFold represents significant technical progress in protein structure prediction. Its core contribution lies in demonstrating that language models can extract sufficient evolutionary information from single sequences to support atomic-level structure prediction, thereby eliminating dependence on traditional MSA. This paradigm shift brings multiple advantages:
- Speed improvements enable large-scale structure characterization
- Simplified architecture reduces computational resource requirements
- Single-sequence characteristics provide unique value for orphan proteins and rapid design cycles
Limitations
However, the method also has clear limitations. On test sets such as CASP14, which contain many orphan proteins, ESMFold's accuracy remains lower than AlphaFold2 using full MSA, indicating that MSA-based methods still have advantages for proteins lacking evolutionary information. Additionally, ESMFold's accuracy is highly correlated with language model perplexity, meaning that improving the language model is the key path to enhancing structure prediction, but language model training costs are extremely high—the 15-billion-parameter model requires substantial computational resources to train.
Future Outlook
From a broader perspective, ESMFold's success provides empirical support for scaling laws in protein language models: as parameters, data, and computational resources increase, language models continue to demonstrate new capabilities. Researchers note that current models are far from reaching theoretically applicable scale limits, and further scale expansion in the future may bring improvements in modeling capability for low evolutionary depth proteins.
At the application level, ESMFold's metagenomic atlas demonstrates the potential of rapid structure prediction for exploring unknown regions of the protein universe. The discovery of millions of novel structures provides rich material for drug target identification, enzyme engineering, and new functional protein design. As prediction methods continue to improve and computational capabilities advance, the goal of structurally characterizing all known proteins is becoming practically achievable.
References
Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv, 2022. doi: 10.1101/2022.07.20.500902