Researcher(s)
- Tamar Peleg, Applied Molecular Biology & Biotechnology, University of Delaware
Faculty Mentor(s)
- Logan Hallee, Center for Bioinformatics and Computational Biology, University of Delaware
- Nikolaos Rafailidis, Center for Bioinformatics and Computational Biology, University of Delaware
- Jason Gleghorn, Biomedical Engineering, University of Delaware
Abstract
Protein language models (pLMs) have revolutionized life sciences by deriving sequence-based embeddings that correlate with diverse downstream tasks. pLM-based pipelines offer broad utility in drug discovery, enzyme engineering, and synthetic biology, where rapid prediction of physicochemical and biological properties can accelerate target identification and therapeutic development. However, their adoption in chemical and biomedical research is constrained by complex workflows and heavy computational demands. We introduce Protify, a platform that simplifies pLM‑based property prediction through end‑to‑end benchmarking, flexible pipeline construction, publication‑ready visualization, and built‑in reproducibility, without requiring advanced programming skills. To demonstrate Protify’s utility for therapeutic screening, we present two case studies with direct translational relevance. (1) Multi-label subcellular localization prediction, where a curated set of 34,693 sequences spanning 13 compartments was analyzed using embeddings from nine pLMs against three controls and a novel transformer-based probing scheme that enhances interpretability, ESMC-600 achieved the best performance with a F1max of 0.82, significantly surpassing random (0.46), random-transformer (0.70), and one-hot (0.70) controls. This approach provided interpretable attention maps that pinpoint biologically meaningful motifs and domains. Protein localization prediction is an essential task for therapeutic antibody and peptide screening: precise in silico localization profiles facilitate the development of compartment-specific delivery where therapeutics co-localize with their targets. (2) Taxonomic classification across eight hierarchical ranks (domain through species; up to 268,989 sequences per rank) processed with a three-layer linear probe, yielding F1 scores from 0.96 (domain) to 0.27 (species) with ESMC-600 and revealing how pLMs inherently encode phylogenetic signals, highlighting the risk of unintended data leakage in multispecies studies. In these two distinct examples Protify’s ability to deliver state-of-the-art performance and biological insight and its potential for large-scale screening of protein libraries and increasing accessibility of cutting-edge computational methods.