Abstract

VirSentAI is an autonomous trimodal agent developed by University of A Coruña and other institutions, designed to bridge the gap between viral emergence and therapeutic response. The system parses unstructured submission records using the MedGemma large language model, employs a fine-tuned HyenaDNA model (v2-hyena-dna-16k) to process complete viral genomes up to 160,000 bases for human infectivity prediction, and calculates affinities between viral proteins and approved drugs via the PLAPT model. On 31,728 complete viral genomes, the zoonotic prediction module achieves AUROC 0.95. The system has scanned 16,060 viruses, identified 33 high-risk (≥90%) zoonotic viruses, and generated 29,625 viral protein-drug interaction predictions. The platform is freely accessible, though clinical validation and real-time surveillance effectiveness remain to be evaluated.

1. Background: Challenges in Zoonotic Surveillance

Zoonotic viruses capable of crossing species barriers from animal hosts to humans pose a persistent and unpredictable threat to global health. The societal disruption caused by the COVID-19 pandemic reminds us that most emerging human pathogens—including coronaviruses, filoviruses, and influenza viruses—originate from animals. Theoretically, genomic sequencing of novel animal viruses can determine their potential to infect humans; however, in practice, laboratory host range determination is resource-intensive and retrospective.

Computational models have evolved from highly specialized to generalized approaches. Early tools (e.g., HostPredictor, Flu-CNN) achieved AUCs of 0.95-0.99 on specific viral families (e.g., avian influenza) but had narrow applicability. Subsequent pan-viral systems (e.g., VIDHOP, BiLSTM-VHP) expanded the scope but often sacrificed long-sequence context. Recently, large-scale foundation models like Evo 2 can process million-scale nucleotide contexts but deliberately exclude human pathogen training data to prevent misuse, limiting their direct applicability for host prediction.

In this context, VirSentAI was developed to occupy this precise ecological niche—a tri-AI model agent processing text, DNA and protein sequences, and drug SMILES. It leverages large-scale pretraining through fine-tuned HyenaDNA architecture to scan complete viral genomes, flagging viruses most likely to infect humans and automatically triggering downstream therapeutic modules for drug repurposing.

2. Technical Architecture and Methods

2.1 Trimodal Agent Architecture

VirSentAI employs a three-stage agent workflow, with each stage using specially optimized AI architectures to process different data types:

2.2 Viral Sentinel Layer and Data Flow

The system's data flow begins with the NCBI Nucleotide API, automatically scanning newly released complete viral DNA sequences. Notably, the system chooses to use NCBI directly rather than RefSeq to prioritize access to the most recently submitted viral sequences, avoiding delays introduced by expert curation workflows.

Retrieved sequences undergo MedGemma text processing before being input to the HyenaDNA model for zoonotic score calculation. For viruses scoring above 90%, the system automatically extracts viral protein sequences from NCBI and inputs them along with FDA-approved drug SMILES from ChEMBL into the PLAPT model to calculate interaction affinities. All data is stored in an SQLite database, processed through Python scripts to generate JSON/CSV summaries, and finally displayed through a web interface as tables, dynamic charts, and viral-protein-drug interaction networks.

2.3 Model Training Details

The core prediction model virsentai-v2-hyena-dna-16k is fine-tuned from HyenaDNA-medium-160k-seqlen-hf using 31,728 complete viral genomes (from NCBI, VirusHostDB, and BV-BRC), with strictly balanced human vs. non-human host labels to mitigate classification bias. The model was trained for 15 epochs using 16-bit mixed precision and AdamW optimizer. To manage memory requirements on a single 24GB NVIDIA GPU, batch size 2 was used with 8-step gradient accumulation, effectively simulating a larger batch size. The entire 150-hour training process was completed under typical academic research infrastructure constraints.

3. Model Performance and Surveillance Results

3.1 Prediction Performance

In rigorous cross-validation on 31,728 complete viral genomes, VirSentAI demonstrated robust classification capabilities, achieving AUROC 0.9496 and overall accuracy 0.8724. These metrics place it competitively within the field, comparable to leading models reporting AUCs of 0.95-0.99 (e.g., HostPredictor, Flu-CNN).

Notably, VirSentAI's core architectural advantage lies in its ability to process complete viral genomes as single continuous sequences, capturing both local mutation features and long-range dependencies across the genome—context that is frequently lost in fragment-dependent or feature-engineering approaches.

Model Core Method Scope Performance (AUC/ACC)
HostPredictor Gradient Boosting Ensemble Avian Influenza AUC = 0.95
Flu-CNN 1D CNN Influenza A ACC = 0.99
VIDHOP Deep Neural Network Rabies, Rotavirus AUC = 0.93-0.98
VirSentAI HyenaDNA Pan-viral AUC = 0.95, ACC = 0.87

3.2 Real-world Surveillance Results

To date, VirSentAI has scanned 16,060 viruses (including novel viruses and those with unknown hosts), of which 33 were predicted to have ≥90% zoonotic risk. Based on viruses with predicted zoonotic probability ≥80%, the system generated 29,625 viral protein-drug PLAPT interaction affinity predictions (affinity ≥8.0). The web interface displays statistics using stricter filtering criteria (zoonotic score ≥90%, PLAPT affinity ≥10.0).

Top 10 high-risk viruses include: Isavirus salaris (salmon virus, 98.15%), Longquan virus (bat/insectivore/rodent virus, 98.00%), Choristoneura fumiferana entomopoxvirus (97.90%), Influenza B virus (97.58%), among others. Notably, some predictions involve non-mammalian host viruses (e.g., algal viruses, insect viruses), whose biological plausibility warrants further investigation.

4. Discussion

4.1 Technical Contributions and Significance

VirSentAI represents an innovative application of multimodal agents in zoonotic surveillance. Its three-stage architecture (text-genome-chemistry) demonstrates the feasibility of integrating heterogeneous data types, filling the gap between specialized tools and generalized foundation models.

The application of HyenaDNA's long-context architecture proves the value of processing complete viral genomes (rather than fragments) in capturing host adaptation signals. The system's computational efficiency (120 million parameters, single-GPU training) makes it operable under academic institutional resource constraints, while open science practices (freely accessible code and platform) help lower the barrier to entry for global health surveillance.

4.2 Limitations and Unverified Issues

Despite the encouraging technical architecture, this study has several important unverified aspects:

4.3 Comparison with Related Work

Compared with recently published Fleming (antibiotic design agent) and Latent-Y (biologics design agent), VirSentAI demonstrates the applicability of multimodal agent frameworks across different biomedical domains. All three share the characteristic of combining specialized AI models with LLM coordination layers to achieve end-to-end workflows; differences lie in application domains and data modalities. Compared with specialized tools like HostPredictor and Flu-CNN, VirSentAI's advantage lies in generalization to the full viral spectrum, while its disadvantage is slightly lower accuracy without experimental validation. Compared with foundation models like Evo 2, VirSentAI is specifically fine-tuned for zoonotic prediction, while Evo 2 excludes human pathogen training data for safety considerations.

5. Conclusion

As an autonomous multimodal zoonotic surveillance agent, VirSentAI demonstrates the feasibility of an end-to-end architecture integrating text, genomic, and chemical data. Its 0.95 AUROC and 160k base long-context processing capability indicate the potential of HyenaDNA architecture in viral genome analysis.

However, the lack of wet-lab validation, insufficient biological basis for prediction thresholds, and unknown actual effectiveness of real-time surveillance severely limit current judgment of its public health value. Future research should prioritize:

  1. Laboratory host range validation of predicted viruses
  2. Integration testing with public health agency early warning systems
  3. Retrospective validation of historical outbreaks to assess actual early warning timeliness

Only after completing these validations can VirSentAI transform from a research prototype into an actionable pandemic preparedness tool.

References

  • Munteanu CR, Vázquez-Naya JM, Tejera E. Viral Sentry AI (VirSentAI) - Automated Zoonotic Surveillance & Drug Repurposing Agent. bioRxiv. 2025. DOI: 10.64898/2025.12.29.684576
  • Code Repository: https://github.com/muntisa/virsentai
  • Platform: https://muntisa.github.io/virsentai
  • Related Models: HyenaDNA, MedGemma, PLAPT, Evo 2
  • Comparison Tools: HostPredictor, VIDHOP, BiLSTM-VHP, Flu-CNN
  • Related Agents: Fleming (Harvard) - Antibiotic Design, Latent-Y (Latent Labs) - Biologics Design
← Back to Blog