Protein Language Models: Technical Evolution, Core Challenges, and Future Directions

Abstract

Protein Language Models (PLMs), as an interdisciplinary field bridging natural language processing and computational biology, have achieved remarkable progress in recent years. Based on the comprehensive review by researchers from Huazhong University of Science and Technology published on arXiv (arXiv:2502.06881v1), this article systematically reviews the architectural evolution, positional encoding strategies, scaling laws, dataset construction, and downstream applications of PLMs, while objectively analyzing current core challenges and future development trends.

1. Background: When Proteins Meet Language Models

Protein sequences and natural languages share significant conceptual similarities: both consist of discrete "letters" (amino acids or words) arranged linearly, and both follow specific grammatical rules. This insight laid the foundation for transferring natural language processing techniques to protein research.

With the rapid development of sequencing technologies, unlabeled protein sequence data has grown exponentially. The combination of Transformer architecture and large-scale self-supervised learning has catalyzed the explosive growth of PLMs. These models learn distributed representations of proteins and demonstrate capabilities approaching or even exceeding traditional experimental methods in structure prediction, functional annotation, and protein design tasks.

2. Evolution of Model Architectures

2.1 Early Explorations (Pre-Transformer)

Before the emergence of Transformers, researchers had already experimented with various neural network architectures:

ProtVec (2015): First applied word embedding techniques to protein sequences, treating amino acid triplets as "words" for embedding learning
MIF-ST: Combined convolutional neural networks with graph neural networks for joint sequence-structure representation
UniRep, SeqVec: Utilized recurrent neural networks to capture long-range dependencies

These early explorations accumulated valuable experience for subsequent Transformer applications, but were limited by insufficient parallelization capabilities and difficulties in long-sequence modeling, failing to achieve breakthrough progress.

2.2 Mainstream Architectures in the Transformer Era

Current mainstream PLMs are all based on the Transformer architecture and can be categorized into three types based on their design paradigms:

Encoder-only Models

Adopt BERT-style bidirectional encoding, suitable for representation learning and downstream feature extraction. Representative models:

ESM-2: 15B parameters
ESM-3: 98B parameters, representing the current scale limit for encoder models

Decoder-only Models

Adopt GPT-style autoregressive generation, focusing on protein sequence generation tasks. Representative models:

ProGen2: 6.4B parameters, demonstrating the ability to generate proteins with catalytic activity
RITA: Based on rotary position encoding

Encoder-Decoder Models

Support sequence-to-sequence transformation tasks. Representative models:

ProstT5: Achieves bidirectional translation between sequences and 3Di structure tokens
xTrimoPGLM: 100B parameters, exploring unified modeling of understanding and generation

2.3 Structure Integration Trends

While pure sequence models can capture evolutionary and structural information, they lack explicit structural supervision. Recent models have attempted various structure fusion strategies:

SaProt: Converts structural data into 3Di tokens
ESM-3: Unifies sequence, structure, and function into a single latent space
LM-GVP: Connects sequence and graph features
PeTriBERT: Uses Fourier embeddings to encode 3D structures
MSA-Transformer: Extends masked language modeling to multiple sequence alignments

These attempts reflect the development trend of PLMs from single-modality to multi-modal fusion.

3. Technical Choices for Positional Encoding

Transformers themselves do not model positional information and require positional encoding to introduce it. In the development history of PLMs, positional encoding strategies have evolved from absolute to relative approaches:

Encoding Type	Characteristics	Representative Models
Absolute Positional Encoding	Simple to implement and computationally efficient, but lacks length extrapolation capability	ESM-1b, ProtTrans
Rotary Positional Encoding (RoPE)	Combines length flexibility with long-range decay characteristics, outperforms ALiBI	ESM-2, ProGen2, RITA
Relative Positional Encoding	Insensitive to sequence length, more suitable for capturing structural information	T5, DeBERTa

4. Applicability Boundaries of Scaling Laws

The scaling laws proposed by OpenAI describe the power-law relationship between model performance and parameters, data volume, and computational resources. In the PLM field, these laws exhibit unique characteristics:

The ESM series clearly demonstrates performance improvements from model scale expansion
PLM modeling losses typically follow strict power-law relationships
Compared to NLP models, PLMs are more prone to underfitting—even training with data volumes far exceeding NLP optimal points remains insufficient

This finding suggests that further expanding model scale and training data may still significantly improve PLM performance. However, the cost of scaling cannot be ignored: ultra-large-scale models are difficult to generalize to downstream tasks and require efficient architectural design and fine-tuning strategies.

5. Logic of Data System Construction

5.1 Sequence Data

UniProt Series: Including UniRef 50/90/100, UniParc, UniProtKB—the most widely used protein sequence databases
BFD: Large-scale integrated database containing hundreds of millions of sequences
MGnify: 2.4 billion metagenomic predicted sequences, enhancing training data diversity
OAS: Over 500 million antibody sequences, supporting antibody-specific model training

5.2 Structure Data

PDB: Gold standard for experimentally determined biomolecular structures, limited in volume but highest in quality
AlphaFoldDB: Complements scarce experimental structures through AlphaFold predictions
ESMAtlas: 617 million metagenomic protein structure predictions, millions of which are novel structures

5.3 Evaluation Benchmarks

Structure Prediction: CASP, CAMEO, SCOP, CATH
Function Prediction: CAFA, EC, GO, FLIP
Comprehensive Capability: TAPE, PEER, ProteinGym

6. Capability Boundaries of Downstream Applications

6.1 Structure Prediction

MSA-free models have become the mainstream direction recently. Single-sequence models such as ESMFold and HelixFold-Single implicitly learn co-evolutionary information through large-scale training, outperforming single-sequence AlphaFold2 on orphan proteins while significantly improving computational speed.

6.2 Function Prediction

The rich embedding information provided by PLMs offers new pathways for function prediction. Models like DeepFRI and GPSFun attempt to integrate structural information, while PhiGnet introduces residue functional contribution quantification methods, enhancing prediction interpretability.

6.3 Protein Design

ProGen: Generates novel sequences with natural enzymatic activity
IgLM: Optimizes antibody sequence design
ESM-3, ProteinMPNN: Support structure-based sequence optimization
Sapiens, AbLang: Achieve expert-level performance in antibody humanization tasks

6.4 Mutation Effect Prediction

Zero-shot prediction has become an important application scenario for PLMs. Models like ESM-1v and MSA-Transformer can predict the impact of mutations on protein fitness without experimental data, while multi-modal models like AlphaMissense and ProSST achieve state-of-the-art performance.

7. Challenges and Future Directions

7.1 Core Challenges

Unclear Design Standards: Optimal configurations for model architecture, dataset scale, and distribution still lack systematic guidance
Long Sequence Modeling Difficulties: Protein sequence lengths span a wide range (30-33,000 amino acids), imposing demanding hardware requirements
Generalization Capability to be Improved: The generalization capability of ultra-large-scale models on downstream tasks still needs enhancement

7.2 Future Directions

MSA-free Models: Represent the pursuit of efficiency and universality. Although MSA can significantly improve performance, issues such as high computational cost, unstable results, and failure on orphan proteins have driven the development of MSA-free models.

Multi-modal Fusion: Represents the ultimate pursuit of representation capability. Joint modeling of sequence-structure-function has become the mainstream trend. The success of structure prediction models like AlphaFold has addressed the scarcity of training data, and this direction promises to provide new understanding for more general protein language modeling.

8. Conclusion

Protein Language Models are in a period of rapid development. From early RNN explorations to Transformer dominance, from pure sequence modeling to multi-modal fusion, the technical roadmap is becoming increasingly mature. Scaling laws exhibit unique characteristics in the protein domain, suggesting room for further scaling, but data quality and model efficiency are equally important.

MSA-free models and multi-modal fusion represent the two current mainstream trends—the former pursuing efficiency and universality, the latter pursuing representation capability and prediction accuracy. Future PLM development needs to seek balance between scale, efficiency, and generalization capability, while paying attention to core technical challenges such as long-sequence modeling.

Reference: Wang L, Li X, Zhang H, et al. A Comprehensive Review of Protein Language Models. arXiv preprint arXiv:2502.06881, 2025.

← Back to Blog