Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties - Scorecard - MDSpire
Advertisement
Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties
Clinical Scorecard: Evaluation of Psychometric Characteristics and Identifiability of Multiple-Choice Questions Generated by GPT-4o Versus Those Created by Humans in Imaging Disciplines
At a Glance
Category
Detail
Condition
Assessment quality of AI-generated versus human-authored multiple-choice questions (MCQs) in medical imaging education
Key Mechanisms
Comparison of item difficulty, discriminatory power, and origin identifiability between GPT-4o-generated and expert-written MCQs
Target Population
Medical students and clinical physicians in imaging specialties (radiation oncology, radiology, nuclear medicine)
Care Setting
Medical education and assessment settings within imaging disciplines
Key Highlights
GPT-4o-generated MCQs showed no statistically significant difference from expert-authored items in core psychometric properties (item difficulty and discrimination).
Examinees and experts could not reliably identify the origin of MCQs (AI vs human) better than chance.
A human-in-the-loop process was used to screen AI-generated items for factual accuracy prior to administration.
Guideline-Based Recommendations
Diagnosis
Use psychometric evaluation (item difficulty and discrimination) to assess MCQ quality in medical education.
Incorporate blinded, within-subject study designs to compare AI-generated and human-authored assessment items.
Management
Employ human expert review to screen AI-generated MCQs for factual accuracy before use in assessments.
Leverage AI-generated items to augment formative assessment banks while maintaining psychometric standards.
Monitoring & Follow-up
Monitor item performance metrics continuously to detect domain drift or error propagation in AI-generated content.
Assess examinee and expert ability to detect item origin to ensure integrity and acceptance of AI-generated assessments.
Risks
Potential exclusion of AI-generated MCQs due to errors or unclear wording necessitates human oversight.
Heterogeneous trainee confidence and competency with AI tools highlight the need for transparent educational policies.
Patient & Prescribing Data
Medical students and physicians undergoing assessment in imaging disciplines
AI-generated MCQs can reliably supplement human-authored questions for formative assessments without compromising psychometric quality or detectability.
Clinical Best Practices
Implement blinded, within-subject comparisons to validate AI-generated educational content against expert standards.
Maintain human-in-the-loop review processes to ensure factual accuracy and appropriateness of AI-generated MCQs.
Use psychometric parameters such as item difficulty and discrimination to guide item selection and refinement.
Educate trainees and clinicians on the capabilities and limitations of AI-generated assessment tools to foster appropriate trust and use.
by Philipp Linde, Florian Fichter, Markus Dietlein, Ferdinand Sudbrock, Kambiz Afshar, Hendrik Dapper, Emmanouil Fokas, Anna-Lena Hillebrecht, Tobias Raupach, Matthias Carl Laupichler