Original: Nguyen et al., Science 2024
DOI: 10.1126/science.ado9336
Institution: Arc Institute & Stanford University
Abstract
Evo is a genomic foundation model published in 2024 by the Arc Institute and Stanford University joint team, featuring 7 billion parameters and 131K token context length, using the StripedHyena architecture for single-nucleotide resolution long-sequence modeling. Trained on 2.7 million prokaryotic and phage genomes, Evo demonstrates zero-shot functional prediction capabilities across DNA, RNA, and protein modalities, and successfully achieves multimodal generative design of CRISPR-Cas systems and transposon systems, representing significant progress in genomic foundation models.
1. Background: Challenges in Cross-Modal and Cross-Scale Biological Modeling
The fundamental instructions of life are encoded in the DNA sequences of all organisms. Understanding these instructions can deepen our knowledge of biological processes and open new avenues for reprogramming biology to create useful technologies. However, even the simplest microbial genomes are extraordinarily complex, with millions of base pairs encoding interactions between DNA, RNA, and proteins—the three modalities of the central dogma of molecular biology that are key executors of cellular function.
This complexity exists at multiple scales, from individual molecules to entire genomes, representing a vast landscape of genetic information functionally selected over evolutionary time.
Limitations of Existing Methods:
- Existing machine learning methods focus primarily on modality-specific models, optimized separately for proteins, coding sequences, RNA, or regulatory DNA
- Generative applications are limited to single molecules, simple complexes, or short DNA sequence design
- Complex biological processes (such as gene regulation, CRISPR immunity, or genetic transposition) depend on numerous interactions between molecules of multiple modalities
A DNA model capable of unifying information at molecular, systems, and genome scales could learn from large genomic regions to capture system-wide interactions, enabling the design of more complex biological functions.
Technical Obstacles: Applying large language model technology to DNA sequence modeling faces specific challenges. Mainstream dense Transformer architectures incur high computational costs (quadratic scaling) as input sequence length grows relative to model width, and typically underperform at single-nucleotide or byte-level resolution compared to coarser resolutions. Consequently, Transformer-based DNA models are limited to short context lengths and employ schemes that aggregate nucleotides into tokens, sacrificing single-nucleotide resolution.
2. Technical Architecture: StripedHyena and Long-Sequence Modeling
Evo adopts the StripedHyena architecture, a hybrid model design combining attention mechanisms with data-controlled convolutional operators. Specifically:
- Evo contains 32 blocks, with 29 layers using Hyena layers (data-controlled convolutional operators)
- 3 layers (10%) use multi-head attention equipped with Rotary Position Embeddings (RoPE)
This hybrid design aims to combine the advantages of both mechanisms:
- Hyena layers process sequences in an input-dependent manner through combinations of short and long convolutional filters, particularly adept at filtering noisy patterns that may appear in DNA and aggregating individual nucleotides into motifs
- Attention layers provide global context aggregation capabilities
Hyena layers belong to the category of deep signal processing primitives, achieving efficient, input-dependent computation through structured operators compatible with fast multiplication algorithms that can be evaluated in sub-quadratic time. This design enables Evo to process sequences up to 131,072 tokens at single-nucleotide resolution while maintaining computational efficiency.
Scaling Law Analysis
The research team conducted scaling law analysis for DNA pre-training, systematically comparing four architectures: Transformer++, Mamba, Hyena, and StripedHyena:
- Under compute-optimal protocols, Transformer++ produced significantly worse perplexity across all compute budgets, reflecting the architecture's inefficiency at byte resolution
- State space and deep signal processing architectures both showed better scaling rates than Transformer++, with Hyena and StripedHyena performing best
3. Training Data and Scaling Laws
Evo was trained on a dataset called OpenGenome, containing:
- Over 80,000 bacterial and archaeal genomes
- Millions of predicted phage and plasmid sequences
- A total of 300 billion nucleotide tokens
Safety Considerations: For biosafety reasons, the training data excluded viruses that infect eukaryotic hosts.
Pre-training Stages:
- Stage 1: Context length of 8,192 tokens
- Stage 2: Context expanded to 131,072 tokens
Scaling Law Findings: DNA sequence modeling follows patterns similar to natural language and vision: as computational resources, model size, and data volume increase, model performance shows predictable improvements. For the Evo 7B model, the estimated compute-optimal token count is 250 billion, while actual training on 300 billion tokens places it at a 17% offset from compute-optimal model size.
4. Zero-Shot Functional Prediction: Cross-Modal Capability Assessment
4.1 Protein Function Prediction
Evo was evaluated for its ability to predict the effects of mutations on protein function in a zero-shot setting. Using Deep Mutational Scanning (DMS) datasets, experimental fitness scores for amino acid sequences were predicted via language model likelihood or pseudo-likelihood.
Key Findings:
- On prokaryotic protein DMS datasets, Evo outperformed all other tested nucleotide models, including GenSLM models specifically trained on coding sequences
- Achieved performance comparable to leading protein-specific language models
- This indicates that despite being trained on long genomic sequences without explicit coding sequence annotations, Evo can still acquire deep understanding of protein-coding sequences
Limitations: On human protein DMS datasets, Evo could not predict mutation effects on fitness, likely because the pre-training dataset consisted solely of prokaryotic sequences. However, the study observed a strong correlation between wild-type sequence language model perplexity and fitness prediction performance, suggesting that fine-tuning on mammalian coding sequences or future pre-training could extend Evo's performance beyond bacterial proteins.
4.2 Non-Coding RNA Function Prediction
Evo was evaluated on mutation effect prediction tasks for non-coding RNAs (ncRNAs) such as tRNA, ribosomal RNA, and ribozymes.
Key Findings:
- Evo again outperformed all other tested nucleotide language models, including RNA-FM models specifically trained on ncRNA sequences
- In studies measuring the effects of 5S rRNA mutations on E. coli growth rate, Evo showed strong prediction performance (Spearman correlation r = 0.60)
- These results demonstrate that Evo can learn mutation effects on ncRNA function, extending beyond the realm of protein sequences
4.3 Regulatory DNA Activity Prediction
Promoter Activity Prediction:
- Evo's zero-shot likelihood showed non-zero correlation with promoter activity across four independent studies (average Spearman r = 0.43)
- Exceeded sequence GC content and GenSLM's zero-shot likelihood
- When combining Evo embeddings with supervised CNN architecture, performance approached state-of-the-art promoter activity prediction method Promoter Calculator
Protein Expression Prediction:
- Zero-shot likelihood of RBS sequences alone showed weak correlation (r = 0.17)
- Correlation improved significantly when connecting promoter and RBS sequences
5. Multimodal Generative Design: From CRISPR to Transposons
5.1 Code Design of CRISPR-Cas Systems
Evo was used to generate CRISPR-Cas molecular complexes containing interacting protein and ncRNA components.
Fine-tuning Strategy:
- Fine-tuned on a dataset of 72,831 CRISPR-Cas loci
- Added special prompt tokens for Cas9, Cas12, and Cas13
- The model could generate coherent sequences containing corresponding Cas coding sequences and CRISPR arrays
Generation Results: Some predicted ORFs showed protein sequence similarity to the closest natural Cas9 of less than 40%.
Functional Validation: From approximately 2 million Evo-generated sequences, 11 Cas9 systems with robust predicted pLDDT scores were selected for functional validation. One generated product named EvoCas9-1 showed robust activity:
- After recombinant expression and purification, paired with chemically synthesized Evo-generated sgRNA
- Showed comparable in vitro cleavage activity to SpCas9
- EvoCas9-1 amino acid sequence has 79.9% identity to the closest Cas9 in the fine-tuning database
- 73.1% identity to SpCas9
5.2 IS200/IS605 Transposon Systems
Evo was also used to generate IS200/IS605 family transposon systems, which catalyze "cut-and-paste" transposition through interactions between TnpA transposase and terminal hairpins.
Fine-tuning and Generation:
- Fine-tuned on 10,720 IS605 elements and 219,866 IS200 elements
- The model learned representations of MGE boundaries
- Could specify one end using information from the other end, reflecting understanding of the tight evolutionary connection between the two terminal elements
Experimental Validation: Among 48 experimentally tested Evo-generated designs, 11 IS200-like elements and 3 IS605-like elements showed evidence of in vitro excision and insertion, with a success rate approaching 50%. These active elements used diverse hairpins, encoding TnpA proteins with sequence identity as low as 67% to the fine-tuning database.
Significance: This is the first example of protein-DNA system design using language model code.
6. Genome-Scale Learning: Gene Essentiality and Sequence Generation
6.1 Gene Essentiality Prediction
Through second-stage pre-training on 131,072 token context, Evo can analyze entire genomes. The study evaluated the model's sensitivity to gene essentiality:
- Inserted premature stop codons at the beginning of each coding sequence
- Measured the impact of these changes on Evo likelihood
Key Findings: Across 58 whole-genome essentiality studies, log-likelihood changes in Evo's 66k context significantly correlated with gene essentiality in 49 genomes. Providing additional genomic context (from gene sequences only to 8k context) significantly improved performance, but average performance from 8k to 66k context was comparable.
6.2 Genome-Scale Sequence Generation
Evo was used to generate 16 sequences of approximately 1 million bases each, representing a scale more than 7 times the model's context length.
Generation Quality:
- Used species-level tokens to prompt the model to generate bacterial genomes
- Generated sequences had coding density almost identical to natural genomes, much higher than random sequences
- Visualizations showed that both natural and generated sequences exhibited similar coding organization patterns, with adjacent sequences typically having the same strand orientation
- Protein structure predictions obtained using ESMFold showed that almost all sequences had predicted secondary structure and globular folds
Limitations:
- Generated sequences did not contain many highly conserved marker genes indicative of complete genomes
- Only 3 rRNAs were generated in approximately 16 million base sample sequences
- Many protein structure predictions had low confidence, biased toward evolutionarily simpler alpha-helix secondary structures
- Limited matching with natural protein databases
These results are consistent with findings from generative models in other domains (such as natural language or image generation): direct sampling from pre-trained models typically produces syntactically correct but locally biased toward simpler constructions and globally incoherent sequences.
7. Discussion: Capability Boundaries and Future Directions
Evo represents significant progress in genomic foundation models, achieving prediction and generation tasks at molecular, systems, and genome scales. However, as a first-generation DNA foundation model, it faces several technical limitations and challenges.
7.1 Technical Limitations
Pre-training Data:
- Evo was trained on 300 billion prokaryotic tokens, representing only a tiny fraction of publicly available genomic data
- Since the model was trained only on prokaryotic data, its ability to predict functional effects of human protein mutations is limited
- Many CRISPR-Cas generations contain obviously problematic sequences, such as missing or truncated cas genes
- At genome scale, Evo struggles to include key marker genes such as complete rRNA sets
7.2 Biosafety Considerations
Models capable of genome-scale design have potential to advance therapeutic discovery, sustainability, and fundamental biological understanding, but also raise biosafety and ethical considerations. The research team implemented the following measures:
- Safety precautions excluding eukaryotic viruses
- Open-sourcing the model to promote transparency and dialogue with the broader scientific community
7.3 Future Directions
Future Research Directions
- Increase model scale
- Extend context length
- Introduce more diverse pre-training data (including eukaryotic genomes)
- Combine with advances in large-scale genome engineering
- Extend the scope of bioengineering and design to entire genome scale
Integration of eukaryotic genomes will require consideration of the higher complexity of these genomes and substantial resource investment in engineering, computational, and safety-related model alignment.
References:
[1] Nguyen E, Poli M, Durrant MG, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386(6723):eado9336.