Original: Passaro et al., bioRxiv 2025
DOI: 10.1101/2025.06.14.659707

Abstract

Boltz-2 is an open-source biomolecular structure prediction model developed by MIT, Valence Labs, and ETH Zurich, achieving significant improvements over Boltz-1. While maintaining structure prediction capabilities, it is the first AI model to achieve binding affinity prediction accuracy approaching Free Energy Perturbation (FEP) methods, with computational efficiency improved by over 1000×.

1. Research Background

1.1 Separation of Structure and Affinity Prediction

In recent years, models such as AlphaFold 3 and Boltz-1 have significantly improved the accuracy of biomolecular complex structure prediction. However, these models still show clear deficiencies in predicting binding affinity (a key property measuring molecular binding strength). The separation of structure prediction and affinity prediction capabilities limits the practical application value of these models in drug discovery.

1.2 FEP Accuracy vs Computational Cost Trade-off

Free Energy Perturbation (FEP) is currently the most accurate affinity calculation technique, but its computational cost is extremely high, making it impractical for large-scale screening. Molecular docking methods are faster but lack sufficient accuracy to provide reliable signals. This long-standing trade-off between accuracy and computational cost constrains the efficiency of computational drug discovery.

1.3 Limitations of Existing AI Methods

Existing AI-based affinity prediction models have not yet reached the accuracy of FEP methods or laboratory assays. Main challenges include: experimental variation and noise in publicly available binding data, difficulty in selecting training signals, and the representation learning gap between structure prediction and affinity prediction.

2. Data Pipeline Innovations

2.1 Structure Data Expansion

Unlike Boltz-1 which only uses single structures, Boltz-2 leverages ensemble data from experimental techniques (NMR) and computational methods (molecular dynamics). Experimental data includes PDB structures released before June 1, 2023. MD data comes from three large-scale open projects: MISATO, ATLAS, and mdCATH.

2.2 Affinity Data Curation

Affinity data comes from public databases such as ChEMBL, PubChem, and BindingDB. Data curation strategy focuses on four aspects: retaining only high-quality assays, reducing data bias through synthetic decoy data, reducing overfitting, and ensuring structure quality through confidence score filtering.

2.3 Hybrid Supervision Strategy

To support both hit discovery and lead optimization scenarios, the model uses a hybrid dataset containing binary classification labels and continuous affinity values. For continuous values (Ki, Kd, IC50, etc.), all values are converted to logarithmic scale in µM units.

2.4 Synthetic Decoy Generation

To expand the negative sample pool and improve chemical space coverage, the model generates synthetic decoys by randomly shuffling binders identified in hit-to-lead screening across different targets. The final dataset contains approximately 1.4 million binders and over 3 million decoys, covering approximately 3,000 unique protein clusters.

3. Model Architecture Improvements

3.1 Architecture Components

Boltz-2 architecture contains four main components:

Training crop size is expanded to 768 tokens, comparable to AlphaFold 3.

3.2 Controllability Enhancements

The model introduces three key control functions:

3.3 Affinity Module

The affinity module consists of a PairFormer and two prediction heads: one predicting binding likelihood and another regressing continuous affinity values. The module operates on Boltz-2's structure predictions, leveraging pair representations and predicted coordinates refined by a PairFormer model specifically attending to protein-ligand and intra-ligand interactions.

3.4 Physical Quality Constraints

Boltz-2 integrates Boltz-steering (introduced as part of the Boltz-1x release)—a method that applies physics-based potentials at inference time to improve physical plausibility without sacrificing accuracy. The version integrating this method is called Boltz-2x.

4. Training Strategy

4.1 Three-Stage Training

Model training is divided into three stages: structure training, confidence training, and affinity training. Affinity training occurs after structure and confidence training, with gradients detached from the trunk.

4.2 Affinity Training Details

The affinity training pipeline includes several key components: pre-computation and cropping of binding pockets, trunk representation preprocessing and custom sampling strategies, and batch construction focusing on local chemical variations. Supervision is jointly applied to binary and continuous affinity tasks:

4.3 Coupling with Generative Models

Boltz-2 is used to train a molecular generator (SynFlowNet) to produce small molecules with high binding scores. The generative agent employs a GFlowNet loss function, enabling it to sample from arbitrary and multimodal score distributions.

5. Performance Evaluation and Limitations

5.1 Structure Prediction Performance

On the evaluation set of PDB structures submitted in 2024-2025, Boltz-2 shows comparable or moderate improvements over Boltz-1 across modalities. The most significant improvements are in RNA chains and DNA-protein complexes. Compared to other commercially available models such as Chai-1 and ProteinX, Boltz-2 performs competitively but still slightly lags behind AlphaFold 3.

5.2 Dynamic Performance

On held-out clusters from mdCATH and ATLAS datasets, MD conditioning has a noticeable effect on predicted ensembles, resulting in more diverse structures that better capture the conformational diversity of simulations. Boltz-2 performs comparably to recent specialized models in predicting key dynamic properties such as RMSF.

5.3 Affinity Prediction Performance

On the FEP+ benchmark (4-target subset: CDK2, TYK2, JNK1, P38), Boltz-2 significantly outperforms deep learning baselines, approaching the accuracy of FEP-based methods while achieving over 1000× speedup. In retrospective evaluation of the CASP16 affinity track, Boltz-2 outperforms all submitted entries out-of-the-box.

5.4 Prospective Case Study

In prospective screening on the TYK2 target, the workflow coupling Boltz-2 with the generative model (SynFlowNet) successfully generated diverse, synthesizable high-affinity binders, validated by Absolute Binding Free Energy (ABFE) simulations.

5.5 Limitations and Information Gaps

The paper does not fully disclose the following technical details: specific numbers for model parameter scale, training computational resources, inference speed benchmarks directly compared to AlphaFold 3, and evaluation of affinity prediction generalization across different chemical series. The gap with AlphaFold 3 in structure prediction (especially in antibody-antigen prediction) indicates that open models still lag behind proprietary models in certain tasks.

6. Conclusion

Boltz-2 represents significant progress in the open-source biomolecular modeling field, achieving for the first time FEP-comparable affinity prediction accuracy and AlphaFold 3-approaching structure prediction capability within a unified framework. The 1000×+ computational efficiency improvement in affinity prediction provides a practical tool for large-scale virtual screening and lead optimization.

Future Development Directions

  • Further narrow the structure prediction gap with proprietary models
  • Expand chemical space coverage for affinity prediction
  • Deepen integration with generative models for end-to-end molecular design
  • Establish more standardized affinity prediction benchmarks to facilitate community collaboration

References:
[1] Passaro S, Corso G, Wohlwend J, et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction. bioRxiv 2025. https://doi.org/10.1101/2025.06.14.659707

← Back to Blog