Automating Viral Replication Module Detection: A Pipeline for Environmental Phage Metagenomics

Researcher(s)

  • Nolan Vasil, Computer Science, University of Delaware

Faculty Mentor(s)

  • Shawn Polson, Center for Bioinformatics and Computational Biology, University of Delaware

Abstract

The Viral Ecology and Informatics Lab (VEIL) at the University of Delaware investigates viral communities by identifying viral populations through the replication proteins encoded in their genomes. These replication proteins are critical to the biology of phages (viruses that infect microbes), influencing phenotypic characteristics of infection, such as replication speed, burst size, and infection strategy (virulent vs. temperate), which ultimately impact microbial host communities and nutrient cycling. VEIL uses in vitro (e.g., enzyme biochemistry assays), in vivo (e.g., phage mutagenesis and infection assays), and in silico (e.g., bioinformatics, phylogenetics) approaches to characterize phage based on replication proteins, including DNA polymerase A, ribonucleotide reductase, and helicase. Replication proteins and modules (collections of proteins co-occurring in a genome) are important for making phenotypic predictions based on sequence data. Currently, VEIL relies on a complex integration of bioinformatics tools and analyses, which makes it difficult to standardize analyses among researchers and to democratize this valuable approach for the broader scientific community. This research project aims to address this issue by developing a standardized and automated pipeline using Nextflow, a workflow management system designed to enable reproducible and scalable execution across different computing environments. Nextflow allows integration with different computing infrastructures, including cloud-based platforms and HPC clusters such as the Biomix cluster at UD. The pipeline integrates Bash and Python scripts to automate key processes: identification of viral replication proteins, annotation of genofeature metadata (e.g., labels describing biochemistry, active sites, motifs, and enzyme families), and detection of replication modules on contigs representing viral populations. The pipeline also outputs statistical summaries and visualizations. This automated pipeline enhances efficiency, reduces manual intervention, and supports a more scalable analysis of viral metagenomes.