Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

By
Philipp Linde
Florian Fichter
Markus Dietlein
Ferdinand Sudbrock
Kambiz Afshar
Hendrik Dapper
Emmanouil Fokas
Anna-Lena Hillebrecht
Tobias Raupach
Matthias Carl Laupichler
January 8, 2026
0 min

Npj Digital Medicine

Overview

This study compared multiple-choice questions (MCQs) generated by GPT-4o with those authored by experts across three imaging specialties. Results showed no significant differences in item difficulty or discriminatory power between AI-generated and human-written items, and participants could not reliably identify item origin beyond chance.

Background

High-quality MCQ banks are essential for medical education but require substantial expert effort and psychometric validation. Large language models (LLMs) like GPT-4o offer potential to accelerate question generation, yet their reliability and validity in clinical assessment remain underexplored. This study addresses key questions about psychometric equivalence and detectability of AI-generated items in imaging disciplines, involving medical students and physicians.

Data Highlights

Metric	GPT-4o MCQs	Human MCQs	Statistical Significance
Item Difficulty	Comparable	Comparable	No significant difference
Discriminatory Power	Comparable	Comparable	No significant difference
Origin Identification Accuracy	~50% (chance)	~50% (chance)	No significant difference
Participants	128 total (82 medical students, 46 physicians)
Specialties	Radiation oncology, nuclear medicine, radiology, others

Key Findings

GPT-4o–generated MCQs demonstrated psychometric properties (difficulty and discrimination) statistically indistinguishable from expert-authored items.
Neither medical students nor physicians could reliably identify whether an item was AI-generated or human-written, with accuracy near chance level (0.50).
There were no significant differences in psychometric indices or origin identification between clinicians and students.
A human-in-the-loop process was used to screen AI-generated items for factual accuracy before administration.
The study was preregistered, blinded, and conducted across three imaging specialties with a balanced participant cohort.

Clinical Implications

GPT-4o can effectively augment the development of formative MCQ assessments in imaging disciplines without compromising psychometric quality. Its use may reduce faculty workload while maintaining item reliability and learner engagement. However, human oversight remains critical to ensure factual accuracy and appropriateness before deployment.

Conclusion

GPT-4o–generated MCQs are psychometrically comparable to expert-written items and functionally indistinguishable to examinees and experts, supporting their potential integration into medical education assessment workflows in imaging specialties.

References

Artsi et al. 2023 -- Review of AI-Generated Assessment Items in Medical Education
Recent Studies 2023 -- Psychometric Evaluation of LLM-Generated MCQs
Foundation Models in Radiology 2023 -- AI in Imaging Workflows

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Psychometric Evaluation of GPT-4o-Generated vs Human MCQs in Imaging Disciplines

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Related Content

Closed loop text guided framework for lung cancer lesion segmentation and quantification

AI in radiology and interventions: a structured narrative review of workflow automation, accuracy, and efficiency gains of today and what’s coming

MRI of wrist ligament trauma was similar at 7 T and 3 T with arthroscopy as a reference standard