Using Machine Learning to Advance Protein Understanding through Contrastive Language Alignment

Researcher(s)

William Sharp, Computer Science, University of Delaware

Faculty Mentor(s)

Jason Gleghorn, Biomedical Engineering, University of Delaware

Abstract

Proteins are complicated biomolecules composed of smaller subunits called amino acids. The prevalence of amino acid sequences and subsequent annotation has made tremendous strides with the advent of high-throughput assays and cheap DNA sequencing. This amount of annotation has enabled artificial intelligence (AI) to quickly develop in the biomedical field. By treating the amino acid vocabulary as a semantic language, protein language models (pLMs) enable the rapid understanding of proteins. Specifically, this recent surge in language models has opened new doors in protein synthesis. Given the immense quality of protein annotation AI can learn the patterns of known data, and extrapolate this into novel sequences. This project aims to elucidate the ability to search for protein sequences given text descriptions by leveraging contrastive learning. Contrastive learning is a technique where corresponding multimodal data points are juxtaposed against each other to teach a model which entries are similar and which are different. Initially developed by OpenAI, CLIP (Contrastive Language-Image Pretraining) is an AI model that combines vision and language understanding in a single model, enabling the captioning of images or image generation from captions. To accommodate protein sequences in lue of images we created ProtCLP (Protein Contrastive Language Pretraining), an AI model based on a scientific BERT model and a ProtBERT model. With the use of high-quality datasets from mol-instructions, our results show that contrastive learning enables search of multimodal protein space, opening doors for protein understanding and generation.