Original: Brixi et al., bioRxiv 2025
DOI: 10.1101/2025.02.18.638918
Institution: Arc Institute, Stanford University, NVIDIA
Abstract
Evo 2 is a biological foundation model released in 2025 by the joint team of Arc Institute, Stanford University, and NVIDIA, trained on 9.3 trillion DNA base pairs covering all domains of life. Evo 2 adopts two scales of 4B and 7B parameters, achieving an unprecedented 1 million token context window and single-nucleotide resolution. The model accurately predicts functional effects from non-coding pathogenic mutations to clinically significant BRCA1 variants in zero-shot settings, and demonstrates for the first time controllable design of epigenomic structures through inference-time search.
1. Background: From Prokaryotes to Eukaryotes
The fundamental instructions of life are encoded in DNA sequences. While tools for sequencing, synthesizing, and editing genome code have transformed biological research, intelligently combining new biological systems requires deep understanding of the immense complexity encoded in genomes. Previous studies showed that machine learning models trained on bacterial genome sequences can model the functions of DNA, RNA, and proteins, as well as their interactions forming complex molecular machines.
However, extending this sequence modeling paradigm to eukaryotic genomes requires advances in data curation, model architecture, training and inference infrastructure, and inference-time computation to address:
- Complex genome architecture: Eukaryotic evolution produced extensive non-coding regions, alternative splicing patterns, and multi-layer epigenomic control
- Multicellularity and complex traits: These features underpin the emergence of multicellularity, complex traits, and intelligent behaviors unique to eukaryotic life
2. Technical Architecture: StripedHyena 2
Evo 2 adopts the StripedHyena 2 architecture, the first convolution-based multi-hybrid architecture. Multi-hybrid architectures are a new class designed to leverage synergies between different types of operators arranged in striped patterns.
Architecture Features:
- Combines three different variants of input-dependent convolutional operators and attention mechanisms
- Improves training efficiency on both short and long sequences
- At 40B parameter scale, achieves 1.3x speedup at 16K context length
- Achieves 3x speedup at 1 million context length
Two-Stage Training Strategy
- Stage 1: Pre-training with 8,192 token context length, data weighted to focus on gene windows
- Stage 2: Expands context to 1 million tokens through multi-stage mid-training
3. Training Data and Open Science
Evo 2 is trained on OpenGenome2:
- Over 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and phages
- 7B parameter version: Trained on 2.4 trillion tokens
- 40B parameter version: Trained on 9.3 trillion tokens
Safety: Eukaryotic virus genomes excluded. Open Source: Model weights, training code, inference code, and training data released under open license.
4. Zero-Shot Functional Prediction
4.1 Cross-Domain Mutation Effect Prediction
- In 20 prokaryotic and 16 eukaryotic species, model likelihood changes align with known biological constraints
- Nonsynonymous variants, premature stop codons, and frameshift mutations cause greater likelihood changes than synonymous mutations
- 40B parameter model shows higher sensitivity to deletions in miRNA and snoRNA sequences
4.2 Clinical Variant Effect Prediction
ClinVar: For coding non-SNV variants (indels), Evo 2 outperforms other models in zero-shot classification. For non-coding variants, Evo 2 surpasses other models in both SNV and non-SNV.
BRCA1/BRCA2: Sets new state-of-the-art for BRCA1 non-coding SNVs. When coding and non-coding variants evaluated together, outperforms all other models.
5. Mechanistic Interpretability
Sparse autoencoders (SAEs) reveal features corresponding to:
- Mobile genetic elements (prophage regions, CRISPR spacers)
- Protein secondary structures (alpha-helices and beta-sheets)
- Human transcription factor binding sites
- Exon-intron architecture (applicable to extinct species like mammoth)
6. Genome-Scale Generation
6.1 Mitochondrial Genome Generation
Generated 250 unique 16kb mitochondrial sequences with correct numbers of CDS, tRNA, and rRNA genes.
6.2 Minimal Bacterial Genome
Using M. genitalium (~580 kb) as model: Nearly 70% of Evo 2 40B genes contain significant Pfam hits, dramatically improved from Evo 1 131k (18%).
6.3 Eukaryotic Chromosome
Generated 330 kb DNA using S. cerevisiae chromosome III (316 kb) 10.5 kb as prompt. Successfully generated eukaryotic-like DNA with predicted tRNAs, properly positioned promoters, and genes showing intron structure.
7. Inference-Time Search: Generative Epigenomics
First example of inference-time scaling in biological language modeling. Using Enformer and Borzoi to guide Evo 2 generation with beam search:
- Designed chromatin accessibility patterns encoding messages in Morse code: "LO", "ARC", "EVO2"
- Predictable log-linear relationship: Increasing inference-time compute leads to better quality designs
- Achieves AUROC ~0.9 with sufficient beam search width
8. Biosafety and Risk Assessment
- Virus risk mitigation: Excluded eukaryotic viruses from training data. Red teaming shows generations in this domain are essentially random.
- Ancestry bias: Evo 2 generalizes well across human populations.
9. Conclusion
Evo 2 represents significant progress in biological foundation models, achieving prediction and generation tasks across molecular, systems, and genome scales, across all domains of life.
Key Achievements
- Learning from 9 trillion tokens of genome sequences
- Robust prediction of pathogenicity for different mutation types including indels
- State-of-the-art for non-coding and splice-related variants
- Genome-length sequence design at scale of human mitochondrial genome, minimal bacterial genome, or yeast chromosome
- First demonstration of inference-time scaling in biological language modeling
Reference:
[1] Brixi G, Durrant MG, Ku J, et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025. doi: 10.1101/2025.02.18.638918