Human Centered Evaluation of CLAP-Based Emotion Annotations for Therapeutic Music Using MTurk

Researcher(s)

Liam Stapley, Computer Science, University of Delaware

Faculty Mentor(s)

Matthew Mauriello, Computer and Information Sciences, University of Delaware

Abstract

Recent advances in contrastive audio-language models, such as CLAP (Contrastive Language-Audio Pretraining), offer scalable methods for annotating music with emotionally relevant tags by aligning audio embeddings with natural language descriptions. However, these annotations are weakly supervised and often lack validation against human perception, limiting their reliability for therapeutic applications.

We propose a human-in-the-loop framework for evaluating and refining CLAP-generated emotion annotations for therapeutic music. CLAP is used to generate top-k emotion tags and confidence scores for a curated library of 432 modular music tracks designed for therapeutic use. Based on the top three predicted tags, GPT-4o produces concise emotional profiles in natural language. These annotations are evaluated through a crowdsourcing study on Amazon Mechanical Turk (MTurk), involving 16 qualified workers who completed a gold-standard clearance test co-designed with music composers to ensure domain familiarity and annotation quality.

To support this process, we developed a custom web interface using React and FastAPI. The interface enables workers to listen to audio clips, rate agreement with CLAP-generated tags, rank the top three features, and select between competing machine-generated descriptions. Features such as skip prevention, form submission limits, and URL-based worker identification and song integration ensure usability and data integrity. Emotion tag definitions are included to support clarity and consistency across annotators. All responses are stored in a structured SQL database for downstream analysis.

Our planned analyses will examine correlations between model confidence and human agreement, apply pseudo-labeling to refine annotations, and explore inter-worker agreement patterns. This work contributes a human-validated therapeutic music dataset, a scalable evaluation platform, and practical guidance for integrating emotional metadata into music therapy applications.