Using Machine Learning for Robust Protein Function Prediction With a Large Language Model

Researcher(s)

Aaron Oster, Computer Science, University of Delaware

Faculty Mentor(s)

Jason Gleghorn, Biomedical Engineering, University of Delaware

Abstract

Aaron S. Oster1, Logan Hallee2, Jason P.Gleghorn3

Departments of Computer Science1, Bioinformatics2,3, and Biomedical Engineering3, University of Delaware, Newark DE, 19716

Proteins are complex biomolecules that are composed of one or more chains of amino acid residues. Biologists currently struggle with classifying and predicting functional behaviors of proteins in high-throughput, as studying the behavior of these proteins is time intensive and expensive. Since language models have become increasingly better at linguistic understanding in recent years, we aim to refine a transformer neural network model that reads these sequences of amino acids as an interpretable language. However, many traditional language models are so large in size they can not learn effectively from small datasets. Our aim is to enable protein function prediction in unique classes. For this, we turn to enzyme classification (EC) numbers; a small numeric classification scheme that distinguishes unique enzymatic activity. Datasets of annotated proteins enable a mapping of EC numbers to unique classes and therefore offer a mechanism to design a predictive model. The particular model we turn to is ANKH, an efficient protein language model (pLM) that offers unprecedented performance in protein understanding despite its modest size. Our lab has previously shown that small 1D convolutional networks are highly effective in interpreting the protein latent space of pLMs. To this end we combined these two resources to predict EC numbers. With as little as 55 instances of an EC number ever annotated, our model predicts the correct EC number for over 3100 unique classes with 86% accuracy, and a top-5 accuracy of 97%. In addition to a robust and accurate function predictor, we developed an interactive front-end user framework that allows users to query our trained model. Our chatbot has access to EC weights and also our labs interaction predictor SYNTERACT. We hope to continue to enable fast protein annotation and inference to accelerate the pace of biomedical research.