Original: Boitreaud et al., bioRxiv 2024
DOI:10.1101/2024.10.10.615955

Abstract

Chai-1 is a multimodal molecular structure prediction foundation model that achieves state-of-the-art performance across multiple tasks including protein-ligand interaction prediction and protein multimer prediction. The model's distinctive features include support for experimental constraint prompting (such as crosslinking mass spectrometry and epitope mapping data), which significantly improves prediction accuracy; and single-sequence prediction capability, maintaining high performance without multiple sequence alignment (MSA). Model weights and inference code are open-sourced for non-commercial use, with a freely available web interface for commercial applications.

1. Background: A New Phase in Structure Prediction

In 2024, the field of molecular structure prediction entered a new phase of multimodal fusion. The release of AlphaFold3 demonstrated the possibility of a unified framework for handling various biomolecular types, while Chai-1 further explored the integration pathway between experimental data and computational models.

The traditional paradigm for protein structure prediction relies on multiple sequence alignment (MSA) to capture co-evolutionary information. However, obtaining MSA requires the existence of homologous sequences, which may be difficult for certain proteins (such as antibody variable regions). Additionally, experimental techniques (such as crosslinking mass spectrometry and epitope mapping) can provide additional spatial constraint information, but how to effectively integrate this information into prediction models has remained an open question.

The design goals of Chai-1 address these challenges: single-sequence prediction capability, experimental constraint integration, and multi-task unification.

2. Technical Architecture and Innovations

2.1 Base Architecture

Chai-1's neural network architecture is primarily based on AlphaFold3's design, employing pair-bias self-attention mechanisms. The key difference lies in using a single-model strategy for all evaluation tasks, with training data cut off at 2021-01-12.

2.2 Language Model Embeddings

Chai-1 introduces protein language model embeddings as additional input tracks, using a 3-billion-parameter language model to generate residue-level embeddings. This design enables Chai-1 to maintain high accuracy even in single-sequence mode.

2.3 Constraint Features

Chai-1 supports multiple experimental constraint features:

3. Performance Evaluation

3.1 Protein-Ligand Prediction

On the PoseBusters benchmark, Chai-1 achieves 77% success rate (ligand RMSD < 2Å), comparable to AlphaFold3's 76%. With apo structure prompting, the success rate improves to 81%.

3.2 Protein Multimer Prediction

On the low-homology protein-protein interface evaluation set (n=929 interface clusters):

Statistical testing shows Chai-1 significantly outperforms AF-Multimer 2.3 (p = 6.24 × 10^-10).

3.3 Antibody-Protein Interface Prediction

On the antibody-protein interface subset, Chai-1's single-sequence mode performs comparably to its full mode, even outperforming AF-Multimer 2.3 with MSA. This finding is significant: antibody variable regions have high sequence diversity with limited MSA information, giving single-sequence methods a natural advantage in such tasks.

3.4 Effects of Constraint Prompting

In antibody-antigen complex prediction, experimental constraints show significant effects:

4. Open Source and Availability

Chai-1 adopts a tiered openness strategy: model weights and inference code are released as a Python package (non-commercial use); the web interface is freely available for commercial drug discovery. This strategy balances promoting academic research with supporting commercial applications.

5. Limitations and Discussion

5.1 Known Limitations

5.2 Comparison with AlphaFold3

Chai-1 and AlphaFold3 perform comparably on benchmark tests, but Chai-1's constraint prompting functionality and single-sequence capabilities provide differentiated advantages in specific application scenarios.

6. Conclusion

Chai-1 represents an important attempt in the field of molecular structure prediction toward multimodal fusion development. By integrating protein language model embeddings and experimental constraint features, the model expands single-sequence prediction and experimental data integration capabilities while maintaining performance comparable to AlphaFold3. For the drug discovery field, Chai-1's open-source strategy and free commercial web interface lower the barrier to entry, and its antibody-protein interface prediction capability has direct application value for antibody drug development.

References

Boitreaud, J., et al. (2024). Chai-1: Decoding the molecular interactions of life. bioRxiv. https://doi.org/10.1101/2024.10.10.615955

Code:https://github.com/chaidiscovery/chai-lab/
Web Interface:https://lab.chaidiscovery.com/

← Back to Blog