Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

By
Philipp Linde
Florian Fichter
Markus Dietlein
Ferdinand Sudbrock
Kambiz Afshar
Hendrik Dapper
Emmanouil Fokas
Anna-Lena Hillebrecht
Tobias Raupach
Matthias Carl Laupichler
January 8, 2026
0 min

Npj Digital Medicine

At a Glance

Category	Detail
Condition	Assessment quality of AI-generated versus human-authored multiple-choice questions (MCQs) in medical imaging education
Key Mechanisms	Comparison of item difficulty, discriminatory power, and origin identifiability between GPT-4o-generated and expert-written MCQs
Target Population	Medical students and clinical physicians in imaging specialties (radiation oncology, radiology, nuclear medicine)
Care Setting	Medical education and assessment settings within imaging disciplines

Key Highlights

GPT-4o-generated MCQs showed no statistically significant difference from expert-authored items in core psychometric properties (item difficulty and discrimination).
Examinees and experts could not reliably identify the origin of MCQs (AI vs human) better than chance.
A human-in-the-loop process was used to screen AI-generated items for factual accuracy prior to administration.

Guideline-Based Recommendations

Diagnosis

Use psychometric evaluation (item difficulty and discrimination) to assess MCQ quality in medical education.
Incorporate blinded, within-subject study designs to compare AI-generated and human-authored assessment items.

Management

Employ human expert review to screen AI-generated MCQs for factual accuracy before use in assessments.
Leverage AI-generated items to augment formative assessment banks while maintaining psychometric standards.

Monitoring & Follow-up

Monitor item performance metrics continuously to detect domain drift or error propagation in AI-generated content.
Assess examinee and expert ability to detect item origin to ensure integrity and acceptance of AI-generated assessments.

Risks

Potential exclusion of AI-generated MCQs due to errors or unclear wording necessitates human oversight.
Heterogeneous trainee confidence and competency with AI tools highlight the need for transparent educational policies.

Patient & Prescribing Data

Medical students and physicians undergoing assessment in imaging disciplines

AI-generated MCQs can reliably supplement human-authored questions for formative assessments without compromising psychometric quality or detectability.

Clinical Best Practices

Implement blinded, within-subject comparisons to validate AI-generated educational content against expert standards.
Maintain human-in-the-loop review processes to ensure factual accuracy and appropriateness of AI-generated MCQs.
Use psychometric parameters such as item difficulty and discrimination to guide item selection and refinement.
Educate trainees and clinicians on the capabilities and limitations of AI-generated assessment tools to foster appropriate trust and use.

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Clinical Scorecard: Evaluation of Psychometric Characteristics and Identifiability of Multiple-Choice Questions Generated by GPT-4o Versus Those Created by Humans in Imaging Disciplines

At a Glance

Key Highlights

Guideline-Based Recommendations

Diagnosis

Management

Monitoring & Follow-up

Risks

Patient & Prescribing Data

Clinical Best Practices

References

Original Source(s)

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Related Content

Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images

ESR Essentials: common performance metrics in AI—practice recommendations by the European Society of Medical Imaging Informatics

MRI of wrist ligament trauma was similar at 7 T and 3 T with arthroscopy as a reference standard