Large language models improve annotation of prokaryotic viral proteins

Zachary N. Flamholz, Steven J. Biller, Libusha Kelly

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Viral genomes are poorly annotated in metagenomic samples, representing an obstacle to understanding viral diversity and function. Current annotation approaches rely on alignment-based sequence homology methods, which are limited by the paucity of characterized viral proteins and divergence among viral sequences. Here we show that protein language models can capture prokaryotic viral protein function, enabling new portions of viral sequence space to be assigned biologically meaningful labels. When applied to global ocean virome data, our classifier expanded the annotated fraction of viral protein families by 29%. Among previously unannotated sequences, we highlight the identification of an integrase defining a mobile element in marine picocyanobacteria and a capsid protein that anchors globally widespread viral elements. Furthermore, improved high-level functional annotation provides a means to characterize similarities in genomic organization among diverse viral sequences. Protein language models thus enhance remote homology detection of viral proteins, serving as a useful complement to existing approaches.

Original languageEnglish (US)
Pages (from-to)537-549
Number of pages13
JournalNature Microbiology
Volume9
Issue number2
DOIs
StatePublished - Feb 2024

ASJC Scopus subject areas

  • Microbiology
  • Immunology
  • Applied Microbiology and Biotechnology
  • Genetics
  • Microbiology (medical)
  • Cell Biology

Fingerprint

Dive into the research topics of 'Large language models improve annotation of prokaryotic viral proteins'. Together they form a unique fingerprint.

Cite this