Assessing the Utility of LLMs in Generating Effective Student Feedback

Researcher(s)

  • Brendan Lewis, Computer Science, University of Delaware

Faculty Mentor(s)

  • John Aromando, Computer & Information Sciences (College of Engineering), University of Delaware

Abstract

Programmers, especially novice programmers, often rely on autogenerated feedback to fix dysfunctional code; whether that’s an error produced by the language’s compiler/interpreter or feedback generated by an autograding tool. With the advent of generative AI, specifically large language models, it is now theoretically possible to generate human-like feedback on faulty code and to enhance existing deterministic feedback that may be hard to understand. Ideally, the addition of AI-generated feedback would enhance student code comprehension further. To explore this possibility we added a new tool to the autograder/feedback pipeline Pedal that submits a prompt with the programmer’s code to OpenAI’s GPT-4 model. Several variations of the prompt were run on 80 anonymized student programs from an introductory computer science course, taken from problems with a wide range of observed difficulty. Comparing the utility of the AI-generated feedback to the pre-existing Pedal feedback necessitated the creation of a new rubric to measure feedback utility. Feedback accuracy, conciseness, clarity, amount of jargon, and sentiment were each measured on a one to five Likert scale for all AI-generated feedback, with the unmodified Pedal feedback similarly graded per student program. Analysis of the measurements displayed a difference in several categories; accuracy and conciseness being the most intriguing. This paper tests the viability of using a large language model to generate feedback for student submitted code and builds on existing work by creating a rubric to measure feedback utility.