Abstract
Protein Language Models (PLMs), as an interdisciplinary field bridging natural language processing and computational biology, have achieved remarkable progress in recent years. Based on the comprehensive review by researchers from Huazhong University of Science and Technology published on arXiv (arXiv:2502.06881v1), this article systematically reviews the architectural evolution, positional encoding strategies, scaling laws, dataset construction, and downstream applications of PLMs, while objectively analyzing current core challenges and future development trends.
1. Background: When Proteins Meet Language Models
Protein sequences and natural languages share significant conceptual similarities: both consist of discrete "letters" (amino acids or words) arranged linearly, and both follow specific grammatical rules. This insight laid the foundation for transferring natural language processing techniques to protein research.
With the rapid development of sequencing technologies, unlabeled protein sequence data has grown exponentially. The combination of Transformer architecture and large-scale self-supervised learning has catalyzed the explosive growth of PLMs. These models learn distributed representations of proteins and demonstrate capabilities approaching or even exceeding traditional experimental methods in structure prediction, functional annotation, and protein design tasks.
2. Evolution of Model Architectures
2.1 Early Explorations (Pre-Transformer)
Before the emergence of Transformers, researchers had already experimented with various neural network architectures:
- ProtVec (2015): First applied word embedding techniques to protein sequences, treating amino acid triplets as "words" for embedding learning
- MIF-ST: Combined convolutional neural networks with graph neural networks for joint sequence-structure representation
- UniRep, SeqVec: Utilized recurrent neural networks to capture long-range dependencies
These early explorations accumulated valuable experience for subsequent Transformer applications, but were limited by insufficient parallelization capabilities and difficulties in long-sequence modeling, failing to achieve breakthrough progress.
2.2 Mainstream Architectures in the Transformer Era
Current mainstream PLMs are all based on the Transformer architecture and can be categorized into three types based on their design paradigms:
Encoder-only Models
Adopt BERT-style bidirectional encoding, suitable for representation learning and downstream feature extraction. Representative models:
- ESM-2: 15B parameters
- ESM-3: 98B parameters, representing the current scale limit for encoder models
Decoder-only Models
Adopt GPT-style autoregressive generation, focusing on protein sequence generation tasks. Representative models:
- ProGen2: 6.4B parameters, demonstrating the ability to generate proteins with catalytic activity
- RITA: Based on rotary position encoding
Encoder-Decoder Models
Support sequence-to-sequence transformation tasks. Representative models:
- ProstT5: Achieves bidirectional translation between sequences and 3Di structure tokens
- xTrimoPGLM: 100B parameters, exploring unified modeling of understanding and generation
2.3 Structure Integration Trends
While pure sequence models can capture evolutionary and structural information, they lack explicit structural supervision. Recent models have attempted various structure fusion strategies:
- SaProt: Converts structural data into 3Di tokens
- ESM-3: Unifies sequence, structure, and function into a single latent space
- LM-GVP: Connects sequence and graph features
- PeTriBERT: Uses Fourier embeddings to encode 3D structures
- MSA-Transformer: Extends masked language modeling to multiple sequence alignments
These attempts reflect the development trend of PLMs from single-modality to multi-modal fusion.
3. Technical Choices for Positional Encoding
Transformers themselves do not model positional information and require positional encoding to introduce it. In the development history of PLMs, positional encoding strategies have evolved from absolute to relative approaches:
| Encoding Type | Characteristics | Representative Models |
|---|---|---|
| Absolute Positional Encoding | Simple to implement and computationally efficient, but lacks length extrapolation capability | ESM-1b, ProtTrans |
| Rotary Positional Encoding (RoPE) | Combines length flexibility with long-range decay characteristics, outperforms ALiBI | ESM-2, ProGen2, RITA |
| Relative Positional Encoding | Insensitive to sequence length, more suitable for capturing structural information | T5, DeBERTa |
4. Applicability Boundaries of Scaling Laws
The scaling laws proposed by OpenAI describe the power-law relationship between model performance and parameters, data volume, and computational resources. In the PLM field, these laws exhibit unique characteristics:
- The ESM series clearly demonstrates performance improvements from model scale expansion
- PLM modeling losses typically follow strict power-law relationships
- Compared to NLP models, PLMs are more prone to underfitting—even training with data volumes far exceeding NLP optimal points remains insufficient
This finding suggests that further expanding model scale and training data may still significantly improve PLM performance. However, the cost of scaling cannot be ignored: ultra-large-scale models are difficult to generalize to downstream tasks and require efficient architectural design and fine-tuning strategies.
5. Logic of Data System Construction
5.1 Sequence Data
- UniProt Series: Including UniRef 50/90/100, UniParc, UniProtKB—the most widely used protein sequence databases
- BFD: Large-scale integrated database containing hundreds of millions of sequences
- MGnify: 2.4 billion metagenomic predicted sequences, enhancing training data diversity
- OAS: Over 500 million antibody sequences, supporting antibody-specific model training
5.2 Structure Data
- PDB: Gold standard for experimentally determined biomolecular structures, limited in volume but highest in quality
- AlphaFoldDB: Complements scarce experimental structures through AlphaFold predictions
- ESMAtlas: 617 million metagenomic protein structure predictions, millions of which are novel structures
5.3 Evaluation Benchmarks
- Structure Prediction: CASP, CAMEO, SCOP, CATH
- Function Prediction: CAFA, EC, GO, FLIP
- Comprehensive Capability: TAPE, PEER, ProteinGym
6. Capability Boundaries of Downstream Applications
6.1 Structure Prediction
MSA-free models have become the mainstream direction recently. Single-sequence models such as ESMFold and HelixFold-Single implicitly learn co-evolutionary information through large-scale training, outperforming single-sequence AlphaFold2 on orphan proteins while significantly improving computational speed.
6.2 Function Prediction
The rich embedding information provided by PLMs offers new pathways for function prediction. Models like DeepFRI and GPSFun attempt to integrate structural information, while PhiGnet introduces residue functional contribution quantification methods, enhancing prediction interpretability.
6.3 Protein Design
- ProGen: Generates novel sequences with natural enzymatic activity
- IgLM: Optimizes antibody sequence design
- ESM-3, ProteinMPNN: Support structure-based sequence optimization
- Sapiens, AbLang: Achieve expert-level performance in antibody humanization tasks
6.4 Mutation Effect Prediction
Zero-shot prediction has become an important application scenario for PLMs. Models like ESM-1v and MSA-Transformer can predict the impact of mutations on protein fitness without experimental data, while multi-modal models like AlphaMissense and ProSST achieve state-of-the-art performance.
7. Challenges and Future Directions
7.1 Core Challenges
- Unclear Design Standards: Optimal configurations for model architecture, dataset scale, and distribution still lack systematic guidance
- Long Sequence Modeling Difficulties: Protein sequence lengths span a wide range (30-33,000 amino acids), imposing demanding hardware requirements
- Generalization Capability to be Improved: The generalization capability of ultra-large-scale models on downstream tasks still needs enhancement
7.2 Future Directions
MSA-free Models: Represent the pursuit of efficiency and universality. Although MSA can significantly improve performance, issues such as high computational cost, unstable results, and failure on orphan proteins have driven the development of MSA-free models.
Multi-modal Fusion: Represents the ultimate pursuit of representation capability. Joint modeling of sequence-structure-function has become the mainstream trend. The success of structure prediction models like AlphaFold has addressed the scarcity of training data, and this direction promises to provide new understanding for more general protein language modeling.
8. Conclusion
Protein Language Models are in a period of rapid development. From early RNN explorations to Transformer dominance, from pure sequence modeling to multi-modal fusion, the technical roadmap is becoming increasingly mature. Scaling laws exhibit unique characteristics in the protein domain, suggesting room for further scaling, but data quality and model efficiency are equally important.
MSA-free models and multi-modal fusion represent the two current mainstream trends—the former pursuing efficiency and universality, the latter pursuing representation capability and prediction accuracy. Future PLM development needs to seek balance between scale, efficiency, and generalization capability, while paying attention to core technical challenges such as long-sequence modeling.
Reference: Wang L, Li X, Zhang H, et al. A Comprehensive Review of Protein Language Models. arXiv preprint arXiv:2502.06881, 2025.