Original: Brixi et al., bioRxiv 2025
DOI: 10.1101/2025.02.18.638918
Institution: Arc Institute, Stanford University, NVIDIA

Abstract

Evo 2 is a biological foundation model released in 2025 by the joint team of Arc Institute, Stanford University, and NVIDIA, trained on 9.3 trillion DNA base pairs covering all domains of life. Evo 2 adopts two scales of 4B and 7B parameters, achieving an unprecedented 1 million token context window and single-nucleotide resolution. The model accurately predicts functional effects from non-coding pathogenic mutations to clinically significant BRCA1 variants in zero-shot settings, and demonstrates for the first time controllable design of epigenomic structures through inference-time search.

1. Background: From Prokaryotes to Eukaryotes

The fundamental instructions of life are encoded in DNA sequences. While tools for sequencing, synthesizing, and editing genome code have transformed biological research, intelligently combining new biological systems requires deep understanding of the immense complexity encoded in genomes. Previous studies showed that machine learning models trained on bacterial genome sequences can model the functions of DNA, RNA, and proteins, as well as their interactions forming complex molecular machines.

However, extending this sequence modeling paradigm to eukaryotic genomes requires advances in data curation, model architecture, training and inference infrastructure, and inference-time computation to address:

2. Technical Architecture: StripedHyena 2

Evo 2 adopts the StripedHyena 2 architecture, the first convolution-based multi-hybrid architecture. Multi-hybrid architectures are a new class designed to leverage synergies between different types of operators arranged in striped patterns.

Architecture Features:

Two-Stage Training Strategy

3. Training Data and Open Science

Evo 2 is trained on OpenGenome2:

Safety: Eukaryotic virus genomes excluded. Open Source: Model weights, training code, inference code, and training data released under open license.

4. Zero-Shot Functional Prediction

4.1 Cross-Domain Mutation Effect Prediction

4.2 Clinical Variant Effect Prediction

ClinVar: For coding non-SNV variants (indels), Evo 2 outperforms other models in zero-shot classification. For non-coding variants, Evo 2 surpasses other models in both SNV and non-SNV.

BRCA1/BRCA2: Sets new state-of-the-art for BRCA1 non-coding SNVs. When coding and non-coding variants evaluated together, outperforms all other models.

5. Mechanistic Interpretability

Sparse autoencoders (SAEs) reveal features corresponding to:

6. Genome-Scale Generation

6.1 Mitochondrial Genome Generation

Generated 250 unique 16kb mitochondrial sequences with correct numbers of CDS, tRNA, and rRNA genes.

6.2 Minimal Bacterial Genome

Using M. genitalium (~580 kb) as model: Nearly 70% of Evo 2 40B genes contain significant Pfam hits, dramatically improved from Evo 1 131k (18%).

6.3 Eukaryotic Chromosome

Generated 330 kb DNA using S. cerevisiae chromosome III (316 kb) 10.5 kb as prompt. Successfully generated eukaryotic-like DNA with predicted tRNAs, properly positioned promoters, and genes showing intron structure.

7. Inference-Time Search: Generative Epigenomics

First example of inference-time scaling in biological language modeling. Using Enformer and Borzoi to guide Evo 2 generation with beam search:

8. Biosafety and Risk Assessment

9. Conclusion

Evo 2 represents significant progress in biological foundation models, achieving prediction and generation tasks across molecular, systems, and genome scales, across all domains of life.

Key Achievements

  • Learning from 9 trillion tokens of genome sequences
  • Robust prediction of pathogenicity for different mutation types including indels
  • State-of-the-art for non-coding and splice-related variants
  • Genome-length sequence design at scale of human mitochondrial genome, minimal bacterial genome, or yeast chromosome
  • First demonstration of inference-time scaling in biological language modeling

Reference:
[1] Brixi G, Durrant MG, Ku J, et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025. doi: 10.1101/2025.02.18.638918

← Back to Blog