Original: Boitreaud et al., bioRxiv 2024
DOI:10.1101/2024.10.10.615955
Abstract
Chai-1 is a multimodal molecular structure prediction foundation model that achieves state-of-the-art performance across multiple tasks including protein-ligand interaction prediction and protein multimer prediction. The model's distinctive features include support for experimental constraint prompting (such as crosslinking mass spectrometry and epitope mapping data), which significantly improves prediction accuracy; and single-sequence prediction capability, maintaining high performance without multiple sequence alignment (MSA). Model weights and inference code are open-sourced for non-commercial use, with a freely available web interface for commercial applications.
1. Background: A New Phase in Structure Prediction
In 2024, the field of molecular structure prediction entered a new phase of multimodal fusion. The release of AlphaFold3 demonstrated the possibility of a unified framework for handling various biomolecular types, while Chai-1 further explored the integration pathway between experimental data and computational models.
The traditional paradigm for protein structure prediction relies on multiple sequence alignment (MSA) to capture co-evolutionary information. However, obtaining MSA requires the existence of homologous sequences, which may be difficult for certain proteins (such as antibody variable regions). Additionally, experimental techniques (such as crosslinking mass spectrometry and epitope mapping) can provide additional spatial constraint information, but how to effectively integrate this information into prediction models has remained an open question.
The design goals of Chai-1 address these challenges: single-sequence prediction capability, experimental constraint integration, and multi-task unification.
2. Technical Architecture and Innovations
2.1 Base Architecture
Chai-1's neural network architecture is primarily based on AlphaFold3's design, employing pair-bias self-attention mechanisms. The key difference lies in using a single-model strategy for all evaluation tasks, with training data cut off at 2021-01-12.
2.2 Language Model Embeddings
Chai-1 introduces protein language model embeddings as additional input tracks, using a 3-billion-parameter language model to generate residue-level embeddings. This design enables Chai-1 to maintain high accuracy even in single-sequence mode.
2.3 Constraint Features
Chai-1 supports multiple experimental constraint features:
- Pocket Constraints: Specifies distance thresholds between particular residues and a chain, simulating information from epitope mapping experiments
- Contact Constraints: Specifies distance thresholds between two residues, simulating information from crosslinking mass spectrometry experiments
- Docking Constraints: Uses four distance intervals to encode inter-chain distances, complementary to template information
3. Performance Evaluation
3.1 Protein-Ligand Prediction
On the PoseBusters benchmark, Chai-1 achieves 77% success rate (ligand RMSD < 2Å), comparable to AlphaFold3's 76%. With apo structure prompting, the success rate improves to 81%.
3.2 Protein Multimer Prediction
On the low-homology protein-protein interface evaluation set (n=929 interface clusters):
- Chai-1 (full mode): 75.1% DockQ > 0.23 success rate
- Chai-1 (single-sequence mode): 69.8%, comparable to AF-Multimer 2.3 with MSA (67.7%)
Statistical testing shows Chai-1 significantly outperforms AF-Multimer 2.3 (p = 6.24 × 10^-10).
3.3 Antibody-Protein Interface Prediction
On the antibody-protein interface subset, Chai-1's single-sequence mode performs comparably to its full mode, even outperforming AF-Multimer 2.3 with MSA. This finding is significant: antibody variable regions have high sequence diversity with limited MSA information, giving single-sequence methods a natural advantage in such tasks.
3.4 Effects of Constraint Prompting
In antibody-antigen complex prediction, experimental constraints show significant effects:
- No constraints (baseline): 35% DockQ acceptable predictions
- Single distance constraint (≤15Å): 57%
- Four epitope residues: Significant improvement
4. Open Source and Availability
Chai-1 adopts a tiered openness strategy: model weights and inference code are released as a Python package (non-commercial use); the web interface is freely available for commercial drug discovery. This strategy balances promoting academic research with supporting commercial applications.
5. Limitations and Discussion
5.1 Known Limitations
- Inter-chain orientation prediction: Sometimes correctly predicts individual chains but fails to properly place relative orientations
- Modified residue sensitivity: Highly sensitive to modified residues; removal or substitution may cause significant changes in predicted structures
- High-quality antibody-antigen prediction remains challenging: High-quality (DockQ > 0.8) predictions still account for a low proportion (4-8%)
5.2 Comparison with AlphaFold3
Chai-1 and AlphaFold3 perform comparably on benchmark tests, but Chai-1's constraint prompting functionality and single-sequence capabilities provide differentiated advantages in specific application scenarios.
6. Conclusion
Chai-1 represents an important attempt in the field of molecular structure prediction toward multimodal fusion development. By integrating protein language model embeddings and experimental constraint features, the model expands single-sequence prediction and experimental data integration capabilities while maintaining performance comparable to AlphaFold3. For the drug discovery field, Chai-1's open-source strategy and free commercial web interface lower the barrier to entry, and its antibody-protein interface prediction capability has direct application value for antibody drug development.
References
Boitreaud, J., et al. (2024). Chai-1: Decoding the molecular interactions of life. bioRxiv. https://doi.org/10.1101/2024.10.10.615955
Code:https://github.com/chaidiscovery/chai-lab/
Web Interface:https://lab.chaidiscovery.com/