Evo 2: Genome Modeling and Design Across All Domains of Life

Original: Brixi et al., bioRxiv 2025
DOI: 10.1101/2025.02.18.638918
Institution: Arc Institute, Stanford University, NVIDIA

Abstract

Evo 2 is a biological foundation model released in 2025 by the joint team of Arc Institute, Stanford University, and NVIDIA, trained on 9.3 trillion DNA base pairs covering all domains of life. Evo 2 adopts two scales of 4B and 7B parameters, achieving an unprecedented 1 million token context window and single-nucleotide resolution. The model accurately predicts functional effects from non-coding pathogenic mutations to clinically significant BRCA1 variants in zero-shot settings, and demonstrates for the first time controllable design of epigenomic structures through inference-time search.

1. Background: From Prokaryotes to Eukaryotes

The fundamental instructions of life are encoded in DNA sequences. While tools for sequencing, synthesizing, and editing genome code have transformed biological research, intelligently combining new biological systems requires deep understanding of the immense complexity encoded in genomes. Previous studies showed that machine learning models trained on bacterial genome sequences can model the functions of DNA, RNA, and proteins, as well as their interactions forming complex molecular machines.

However, extending this sequence modeling paradigm to eukaryotic genomes requires advances in data curation, model architecture, training and inference infrastructure, and inference-time computation to address:

Complex genome architecture: Eukaryotic evolution produced extensive non-coding regions, alternative splicing patterns, and multi-layer epigenomic control
Multicellularity and complex traits: These features underpin the emergence of multicellularity, complex traits, and intelligent behaviors unique to eukaryotic life

2. Technical Architecture: StripedHyena 2

Evo 2 adopts the StripedHyena 2 architecture, the first convolution-based multi-hybrid architecture. Multi-hybrid architectures are a new class designed to leverage synergies between different types of operators arranged in striped patterns.

Architecture Features:

Combines three different variants of input-dependent convolutional operators and attention mechanisms
Improves training efficiency on both short and long sequences
At 40B parameter scale, achieves 1.3x speedup at 16K context length
Achieves 3x speedup at 1 million context length

Two-Stage Training Strategy

Stage 1: Pre-training with 8,192 token context length, data weighted to focus on gene windows
Stage 2: Expands context to 1 million tokens through multi-stage mid-training

3. Training Data and Open Science

Evo 2 is trained on OpenGenome2:

Over 8.8 trillion nucleotides from bacteria, archaea, eukaryotes, and phages
7B parameter version: Trained on 2.4 trillion tokens
40B parameter version: Trained on 9.3 trillion tokens

Safety: Eukaryotic virus genomes excluded. Open Source: Model weights, training code, inference code, and training data released under open license.

4. Zero-Shot Functional Prediction

4.1 Cross-Domain Mutation Effect Prediction

In 20 prokaryotic and 16 eukaryotic species, model likelihood changes align with known biological constraints
Nonsynonymous variants, premature stop codons, and frameshift mutations cause greater likelihood changes than synonymous mutations
40B parameter model shows higher sensitivity to deletions in miRNA and snoRNA sequences

4.2 Clinical Variant Effect Prediction

ClinVar: For coding non-SNV variants (indels), Evo 2 outperforms other models in zero-shot classification. For non-coding variants, Evo 2 surpasses other models in both SNV and non-SNV.

BRCA1/BRCA2: Sets new state-of-the-art for BRCA1 non-coding SNVs. When coding and non-coding variants evaluated together, outperforms all other models.

5. Mechanistic Interpretability

Sparse autoencoders (SAEs) reveal features corresponding to:

Mobile genetic elements (prophage regions, CRISPR spacers)
Protein secondary structures (alpha-helices and beta-sheets)
Human transcription factor binding sites
Exon-intron architecture (applicable to extinct species like mammoth)

6. Genome-Scale Generation

6.1 Mitochondrial Genome Generation

Generated 250 unique 16kb mitochondrial sequences with correct numbers of CDS, tRNA, and rRNA genes.

6.2 Minimal Bacterial Genome

Using M. genitalium (~580 kb) as model: Nearly 70% of Evo 2 40B genes contain significant Pfam hits, dramatically improved from Evo 1 131k (18%).

6.3 Eukaryotic Chromosome

Generated 330 kb DNA using S. cerevisiae chromosome III (316 kb) 10.5 kb as prompt. Successfully generated eukaryotic-like DNA with predicted tRNAs, properly positioned promoters, and genes showing intron structure.

7. Inference-Time Search: Generative Epigenomics

First example of inference-time scaling in biological language modeling. Using Enformer and Borzoi to guide Evo 2 generation with beam search:

Designed chromatin accessibility patterns encoding messages in Morse code: "LO", "ARC", "EVO2"
Predictable log-linear relationship: Increasing inference-time compute leads to better quality designs
Achieves AUROC ~0.9 with sufficient beam search width

8. Biosafety and Risk Assessment

Virus risk mitigation: Excluded eukaryotic viruses from training data. Red teaming shows generations in this domain are essentially random.
Ancestry bias: Evo 2 generalizes well across human populations.

9. Conclusion

Evo 2 represents significant progress in biological foundation models, achieving prediction and generation tasks across molecular, systems, and genome scales, across all domains of life.

                Key Achievements
                Learning from 9 trillion tokens of genome sequences
Robust prediction of pathogenicity for different mutation types including indels
State-of-the-art for non-coding and splice-related variants
Genome-length sequence design at scale of human mitochondrial genome, minimal bacterial genome, or yeast chromosome
First demonstration of inference-time scaling in biological language modeling

            

Reference:
[1] Brixi G, Durrant MG, Ku J, et al. Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025. doi: 10.1101/2025.02.18.638918

← Back to Blog