Researcher(s)
- Sean Fletcher, Medical Diagnostics, University of Delaware
Faculty Mentor(s)
- Subhasis Biswas, Medical & Molecular Sciences, University of Delaware
- Esther Biswas-Fiss, Medical & Molecular Sciences, University of Delaware
Abstract
Human papillomavirus (HPV) is a prevalent viral pathogen responsible for causing a variety of malignancies, including cervical cancer. The life cycle is controlled by the E1-E2 replication initiation complex, whose functional efficiency is a key determinant of oncogenic potential. To better understand the molecular determinants of pathogenicity, this study integrates a multi-genera conservation analysis with a machine learning (ML) framework trained on viral protein sequences.
Our conservation analysis revealed that while the DNA-binding domain of E2 is highly conserved, the N-terminal transactivation domain, responsible for the critical interaction with E1 exhibits significant clade-specific variability. Building on this, our ML framework was tasked with predicting the oncogenic risk of HPV types from their E1 protein sequence. After models revealed that the distinction between subtypes was not learnable from the sequence data, we reformulated the problem as a binary classification task: separating “Low-Risk” from “Any High-Risk” types. This shift was remarkably successful, yielding a classifier with 100% cross-validated accuracy, indicating a perfectly separable signal between the two classes.
Analysis of this classifier identified critical features distributed across functional domains of the E1 protein. Strikingly, both the feature based Random Forest, and the sequence based 1D-CNN showed a strong consensus. The highest density of predictive features was concentrated in the C-terminal helicase, and the N-terminal regulatory domain. This model convergence on key regions, such as the area around Gln-60, provides computationally validated evidence of their importance in oncogenesis. This work establishes a clear direction for studies where these features will guide molecular dynamics simulations. These simulations will aim to dissect the precise molecular determinants of high-risk HPV types by revealing how variations at these sites impact E1’s structure and function. The mechanistic insights gained will inform our work using state of the art protein designer tools to design novel therapeutic inhibitors that target these critical viral functions.