Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties - Report - MDSpire
Advertisement
Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties
Psychometric Evaluation of GPT-4o-Generated vs Human MCQs in Imaging Disciplines
Overview
This study compared multiple-choice questions (MCQs) generated by GPT-4o with those authored by experts across three imaging specialties. Results showed no significant differences in item difficulty or discriminatory power between AI-generated and human-written items, and participants could not reliably identify item origin beyond chance.
Background
High-quality MCQ banks are essential for medical education but require substantial expert effort and psychometric validation. Large language models (LLMs) like GPT-4o offer potential to accelerate question generation, yet their reliability and validity in clinical assessment remain underexplored. This study addresses key questions about psychometric equivalence and detectability of AI-generated items in imaging disciplines, involving medical students and physicians.
GPT-4o–generated MCQs demonstrated psychometric properties (difficulty and discrimination) statistically indistinguishable from expert-authored items.
Neither medical students nor physicians could reliably identify whether an item was AI-generated or human-written, with accuracy near chance level (0.50).
There were no significant differences in psychometric indices or origin identification between clinicians and students.
A human-in-the-loop process was used to screen AI-generated items for factual accuracy before administration.
The study was preregistered, blinded, and conducted across three imaging specialties with a balanced participant cohort.
Clinical Implications
GPT-4o can effectively augment the development of formative MCQ assessments in imaging disciplines without compromising psychometric quality. Its use may reduce faculty workload while maintaining item reliability and learner engagement. However, human oversight remains critical to ensure factual accuracy and appropriateness before deployment.
Conclusion
GPT-4o–generated MCQs are psychometrically comparable to expert-written items and functionally indistinguishable to examinees and experts, supporting their potential integration into medical education assessment workflows in imaging specialties.
References
Artsi et al. 2023 -- Review of AI-Generated Assessment Items in Medical Education
Recent Studies 2023 -- Psychometric Evaluation of LLM-Generated MCQs
Foundation Models in Radiology 2023 -- AI in Imaging Workflows
by Philipp Linde, Florian Fichter, Markus Dietlein, Ferdinand Sudbrock, Kambiz Afshar, Hendrik Dapper, Emmanouil Fokas, Anna-Lena Hillebrecht, Tobias Raupach, Matthias Carl Laupichler