Original: Benegas et al., arXiv:2407.11435v2
Institution: UC Berkeley
Abstract
Genomic Language Models (gLMs) represent an emerging field that applies natural language processing techniques to DNA sequence analysis, gradually demonstrating potential in functional constraint prediction, sequence design, and transfer learning. However, compared to protein language models, gLMs face unique challenges including vast genome scale, sparse functional regions, and divergent cross-species regulatory logic. This article systematically analyzes the technical status, core applications, and future development directions of gLMs based on a comprehensive review by the UC Berkeley team published on arXiv.
1. Background: Paradigm Shift from Proteins to Genomes
The success of protein language models has opened new pathways for biological sequence analysis. Transformer-based models have achieved breakthrough progress in protein structure prediction and variant effect prediction. The core hypothesis is that billions of years of evolution have explored the protein sequence space relevant to life, so large-scale unlabeled protein sequence data contains rich biological information. This success naturally raises a question: Can similar language modeling methods be applied to DNA sequences to drive transformative changes in genomics?
However, applying language models to genomes faces several fundamental differences:
- Scale difference: Proteins are well-defined functional units with relatively limited length; most genomes are vast, containing large amounts of non-functional regions where functional elements are submerged in massive background sequences.
- Data availability: Whole-genome sequence data is far less available than protein sequences—while protein databases contain hundreds of millions of sequences, whole-genome sequences across the tree of life are relatively scarce, limiting the diversity of functionally important DNA elements in training data.
Nevertheless, researchers believe gLMs still hold tremendous potential, with the key being to adapt model architectures and training strategies specifically for genomic characteristics.
2. Core Applications: Progress and Limitations of Three Task Categories
2.1 Functional Constraint Prediction
One of the most mature applications of gLMs is unsupervised functional constraint prediction. The basic logic is that reference genomes typically come from healthy individuals and relatively lack deleterious variants; therefore, models trained on this data tend to assign lower probabilities to harmful variants. By calculating the log-likelihood ratio (LLR) between two alleles, their relative fitness can be estimated.
This approach has achieved significant success in plant genomes:
- GPN achieved state-of-the-art variant effect prediction performance on the model plant Arabidopsis thaliana, with LLR scores correlating with allele frequencies in natural populations, despite the model being trained on only a single genome from this species.
- AgroNT and PlantCaduceus also obtained excellent results in other plant species.
However, on the human genome, the LLR from Nucleotide Transformer underperforms existing baselines; while GPN-MSA achieved competitive performance by leveraging whole-genome multiple sequence alignments (MSA) across vertebrates. Notably, observed nucleotide distributions are driven not only by functional constraints but also by mutation biases; explicitly incorporating this information into functional constraint prediction is a promising direction for future research.
2.2 Sequence Design
Sequence generation based on causal language models (CLM) is another important application of gLMs. By recursively predicting the next token given a sequence fragment (prompt or control tag), models can generate entirely new sequences.
For regulatory sequence design, regLM based on the HyenaDNA model achieved de novo generation of promoter and enhancer sequences, where prepending control tags enables designing promoter sequences that drive high or low expression in specific cell types.
In more complex tasks, the EVO model has been used to design novel CRISPR-Cas systems. Large-scale DNA sequence design (at chromosome or genome level) represents a more ambitious goal:
- EVO generated 20 sequences totaling approximately 650 million base pairs with realistic coding sequence density and reasonable protein structures
- MegaDNA generated complete phage genomes up to 96kb in length
However, these attempts still face challenges: EVO-generated sequences lack highly conserved marker genes typically present in functional prokaryotic genomes, with limited matching between predicted protein structures and natural databases; independent evaluations show MegaDNA-generated genomes still differ from natural genomes in sequence composition.
2.3 Transfer Learning
Transfer learning is the third category of gLM applications. By pre-training on raw sequence data, gLMs transform input genomic sequences into intermediate vector representations (embeddings) that can be extracted and used as features for other models, or fine-tuned for downstream tasks.
Unsupervised embedding visualizations show that models can distinguish between different classes of genomic elements (such as coding sequences and untranslated regions), indicating that learned representations contain biologically relevant information.
Practical cases include:
- SegmentNT: Achieved state-of-the-art performance in gene and cis-regulatory element annotation by fine-tuning Nucleotide Transformer
- AgroNT: After pre-training on diverse plant species, fine-tuned to predict chromatin accessibility and gene expression
- DNABERT-S: Applied contrastive learning for metagenomic binning
- IsoFormer: Explored multimodal transfer learning between DNA and protein language models
However, two recent studies evaluating multiple gLMs on human genome prediction tasks found they generally failed to surpass specially designed models. This finding raises important questions: In the field of human genetics, where high-quality annotated data and carefully designed models already exist, can gLMs provide significant added value?
3. Technical Considerations: Data, Architecture, and Training Decisions
3.1 Data Selection and Quality Control
Unlike NLP and protein domains, genomics lacks universally accepted standardized datasets. The complexity of data quality control lies in:
- Only about 3.3% of bases in the human reference genome are considered significantly constrained and potentially functional
- Typical training sequences contain both functional and non-functional sites, making it difficult to simply classify them as high-quality or low-quality samples
Repeat sequence handling is another critical issue. Approximately 50% of the human genome consists of repetitive sequences (with generally high proportions in eukaryotes), yet few gLM studies propose solutions (such as downsampling or downweighting), let alone adequately discuss this problem. Distinguishing generalization improvements from memorization effects requires separately reporting perplexity on non-repetitive regions.
Ensuring data sufficiency is equally important. A single genome may be insufficient to train large models, especially when non-functional regions are downsampled. Adding within-species sequence variation is one approach, but many species (including humans) have relatively limited inter-individual variation. Cross-species training is a more common strategy, but as species divergence increases, regulatory logic diverges much faster than proteins, potentially requiring explicit species identifiers as model inputs.
3.2 Trade-offs in Architecture and Learning Objectives
gLMs show diversity in architectural choices:
- Transformers and their variants (such as BigBird, DNABERT, Nucleotide Transformer) dominate
- State Space Models (SSM) such as HyenaDNA, Caduceus, and Mamba show advantages in handling long sequences due to their linear time complexity
- CNN-Transformer hybrid architectures have also been explored
For tokenization strategies, nucleotide-level, overlapping k-mer, non-overlapping k-mer, and Byte Pair Encoding (BPE) are all used.
For learning objectives, Masked Language Modeling (MLM) and Causal Language Modeling (CLM) are the two main paradigms:
- MLM allows bidirectional context utilization, suitable for representation learning
- CLM supports autoregressive generation, suitable for sequence design
For functional constraint prediction, MLM can calculate LLR for SNPs with a single query, while CLM requires two queries; but CLM more easily handles multiple substitutions, insertions, and deletions, while MLM requires more expensive pseudo-LLR methods.
Long-range interaction modeling is a genome-specific challenge. Enhancer-promoter contacts can span hundreds of thousands of bases, and determining the appropriate receptive field size remains unresolved. Multi-scale architectures (such as MEGABYTE's hierarchical Transformer) and efficient attention mechanisms (such as FlashAttention) are directions for addressing this, but genome-scale modeling (billions of base pairs) remains beyond current methods' capabilities.
4. Evaluation Challenges: The Benchmarking Dilemma
Evaluating gLMs faces multiple difficulties:
Functional constraint prediction requires large-scale functional experimental data (such as saturation mutagenesis) to validate predictions, but such data is scarce and carries circular validation risks.
Sequence design test set perplexity may not reliably indicate a model's design utility, requiring comprehensive examination of generated sequence composition, motif patterns, and predicted functional activity. The Polygraph benchmark proposed a series of analysis dimensions for regulatory sequence design, but evaluation of whole-genome or chromosome design tasks also requires examining the presence and positioning of essential genes and regulatory elements, as well as their interactions.
Transfer learning evaluation has unique challenges: any benchmark set must reliably indicate model performance on relevant tasks. Functional genomics data (such as ENCODE or Roadmap Epigenomics projects) can be transformed into prediction tasks for genomic regions and variant annotations, but current benchmarks differ in task and methodology selection while providing seemingly redundant insights into model capabilities. The computational genomics community needs to develop standardized, scalable, and widely trusted benchmarks.
5. Conclusion: A Rational View of "Foundation Model" Claims
gLMs are in a rapid development phase, showing potential in functional constraint prediction, regulatory sequence design, and transfer learning. However, unlike the magical breakthroughs that the term "artificial intelligence" might imply, gLMs should be viewed as another useful modeling tool, similar to the positioning of Hidden Markov Models when first introduced. The term "foundation model" implies substantial improvements in downstream task performance, but this is an empirical question, not an inherent property of pre-trained models; in genomics, this new field, establishing appropriate benchmarks may take considerable time.
Early gLMs were mostly direct transfers of NLP models, but further integration of deep genomics expertise may yield the greatest returns. Evaluating gLM capabilities is challenging because metrics can be misleading, especially when over-optimized. The advantage of NLP is that humans are natural language experts who can calibrate benchmarks to match professional judgment; in genomics, one must rely on data and expert knowledge to falsify models, making the problem more challenging while highlighting the necessity of collaboration with domain experts and deliberate experiments for benchmark development.
Key Questions for Future Research
- How to best model cross-scale patterns from motifs to genes to whole genomes?
- Which applications require modeling long-range interactions and how to determine receptive field size?
- How to incorporate structural variation into gLMs?
- How to leverage population genetics data?
- How to best integrate transcriptomics and epigenetics data?
- Does the scaling hypothesis hold for gLMs and for how long?
The answers to these questions will determine whether gLMs can evolve from promising tools to pillars of genomics research.
References:
[1] Benegas G, Ye C, Albors C, et al. Genomic Language Models: Opportunities and Challenges. arXiv preprint arXiv:2407.11435v2, 2024.