Original: Nguyen et al., Science 2024
DOI: 10.1126/science.ado9336
Institution: Arc Institute & Stanford University

Abstract

Evo is a genomic foundation model published in 2024 by the Arc Institute and Stanford University joint team, featuring 7 billion parameters and 131K token context length, using the StripedHyena architecture for single-nucleotide resolution long-sequence modeling. Trained on 2.7 million prokaryotic and phage genomes, Evo demonstrates zero-shot functional prediction capabilities across DNA, RNA, and protein modalities, and successfully achieves multimodal generative design of CRISPR-Cas systems and transposon systems, representing significant progress in genomic foundation models.

1. Background: Challenges in Cross-Modal and Cross-Scale Biological Modeling

The fundamental instructions of life are encoded in the DNA sequences of all organisms. Understanding these instructions can deepen our knowledge of biological processes and open new avenues for reprogramming biology to create useful technologies. However, even the simplest microbial genomes are extraordinarily complex, with millions of base pairs encoding interactions between DNA, RNA, and proteins—the three modalities of the central dogma of molecular biology that are key executors of cellular function.

This complexity exists at multiple scales, from individual molecules to entire genomes, representing a vast landscape of genetic information functionally selected over evolutionary time.

Limitations of Existing Methods:

A DNA model capable of unifying information at molecular, systems, and genome scales could learn from large genomic regions to capture system-wide interactions, enabling the design of more complex biological functions.

Technical Obstacles: Applying large language model technology to DNA sequence modeling faces specific challenges. Mainstream dense Transformer architectures incur high computational costs (quadratic scaling) as input sequence length grows relative to model width, and typically underperform at single-nucleotide or byte-level resolution compared to coarser resolutions. Consequently, Transformer-based DNA models are limited to short context lengths and employ schemes that aggregate nucleotides into tokens, sacrificing single-nucleotide resolution.

2. Technical Architecture: StripedHyena and Long-Sequence Modeling

Evo adopts the StripedHyena architecture, a hybrid model design combining attention mechanisms with data-controlled convolutional operators. Specifically:

This hybrid design aims to combine the advantages of both mechanisms:

Hyena layers belong to the category of deep signal processing primitives, achieving efficient, input-dependent computation through structured operators compatible with fast multiplication algorithms that can be evaluated in sub-quadratic time. This design enables Evo to process sequences up to 131,072 tokens at single-nucleotide resolution while maintaining computational efficiency.

Scaling Law Analysis

The research team conducted scaling law analysis for DNA pre-training, systematically comparing four architectures: Transformer++, Mamba, Hyena, and StripedHyena:

3. Training Data and Scaling Laws

Evo was trained on a dataset called OpenGenome, containing:

Safety Considerations: For biosafety reasons, the training data excluded viruses that infect eukaryotic hosts.

Pre-training Stages:

Scaling Law Findings: DNA sequence modeling follows patterns similar to natural language and vision: as computational resources, model size, and data volume increase, model performance shows predictable improvements. For the Evo 7B model, the estimated compute-optimal token count is 250 billion, while actual training on 300 billion tokens places it at a 17% offset from compute-optimal model size.

4. Zero-Shot Functional Prediction: Cross-Modal Capability Assessment

4.1 Protein Function Prediction

Evo was evaluated for its ability to predict the effects of mutations on protein function in a zero-shot setting. Using Deep Mutational Scanning (DMS) datasets, experimental fitness scores for amino acid sequences were predicted via language model likelihood or pseudo-likelihood.

Key Findings:

Limitations: On human protein DMS datasets, Evo could not predict mutation effects on fitness, likely because the pre-training dataset consisted solely of prokaryotic sequences. However, the study observed a strong correlation between wild-type sequence language model perplexity and fitness prediction performance, suggesting that fine-tuning on mammalian coding sequences or future pre-training could extend Evo's performance beyond bacterial proteins.

4.2 Non-Coding RNA Function Prediction

Evo was evaluated on mutation effect prediction tasks for non-coding RNAs (ncRNAs) such as tRNA, ribosomal RNA, and ribozymes.

Key Findings:

4.3 Regulatory DNA Activity Prediction

Promoter Activity Prediction:

Protein Expression Prediction:

5. Multimodal Generative Design: From CRISPR to Transposons

5.1 Code Design of CRISPR-Cas Systems

Evo was used to generate CRISPR-Cas molecular complexes containing interacting protein and ncRNA components.

Fine-tuning Strategy:

Generation Results: Some predicted ORFs showed protein sequence similarity to the closest natural Cas9 of less than 40%.

Functional Validation: From approximately 2 million Evo-generated sequences, 11 Cas9 systems with robust predicted pLDDT scores were selected for functional validation. One generated product named EvoCas9-1 showed robust activity:

5.2 IS200/IS605 Transposon Systems

Evo was also used to generate IS200/IS605 family transposon systems, which catalyze "cut-and-paste" transposition through interactions between TnpA transposase and terminal hairpins.

Fine-tuning and Generation:

Experimental Validation: Among 48 experimentally tested Evo-generated designs, 11 IS200-like elements and 3 IS605-like elements showed evidence of in vitro excision and insertion, with a success rate approaching 50%. These active elements used diverse hairpins, encoding TnpA proteins with sequence identity as low as 67% to the fine-tuning database.

Significance: This is the first example of protein-DNA system design using language model code.

6. Genome-Scale Learning: Gene Essentiality and Sequence Generation

6.1 Gene Essentiality Prediction

Through second-stage pre-training on 131,072 token context, Evo can analyze entire genomes. The study evaluated the model's sensitivity to gene essentiality:

Key Findings: Across 58 whole-genome essentiality studies, log-likelihood changes in Evo's 66k context significantly correlated with gene essentiality in 49 genomes. Providing additional genomic context (from gene sequences only to 8k context) significantly improved performance, but average performance from 8k to 66k context was comparable.

6.2 Genome-Scale Sequence Generation

Evo was used to generate 16 sequences of approximately 1 million bases each, representing a scale more than 7 times the model's context length.

Generation Quality:

Limitations:

These results are consistent with findings from generative models in other domains (such as natural language or image generation): direct sampling from pre-trained models typically produces syntactically correct but locally biased toward simpler constructions and globally incoherent sequences.

7. Discussion: Capability Boundaries and Future Directions

Evo represents significant progress in genomic foundation models, achieving prediction and generation tasks at molecular, systems, and genome scales. However, as a first-generation DNA foundation model, it faces several technical limitations and challenges.

7.1 Technical Limitations

Pre-training Data:

7.2 Biosafety Considerations

Models capable of genome-scale design have potential to advance therapeutic discovery, sustainability, and fundamental biological understanding, but also raise biosafety and ethical considerations. The research team implemented the following measures:

7.3 Future Directions

Future Research Directions

  • Increase model scale
  • Extend context length
  • Introduce more diverse pre-training data (including eukaryotic genomes)
  • Combine with advances in large-scale genome engineering
  • Extend the scope of bioengineering and design to entire genome scale

Integration of eukaryotic genomes will require consideration of the higher complexity of these genomes and substantial resource investment in engineering, computational, and safety-related model alignment.

References:
[1] Nguyen E, Poli M, Durrant MG, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386(6723):eado9336.

← Back to Blog