Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties - Scorecard - MDSpire

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

  • By

  • Philipp Linde

  • Florian Fichter

  • Markus Dietlein

  • Ferdinand Sudbrock

  • Kambiz Afshar

  • Hendrik Dapper

  • Emmanouil Fokas

  • Anna-Lena Hillebrecht

  • Tobias Raupach

  • Matthias Carl Laupichler

  • January 8, 2026

  • 0 min

Share

Clinical Scorecard: Evaluation of Psychometric Characteristics and Identifiability of Multiple-Choice Questions Generated by GPT-4o Versus Those Created by Humans in Imaging Disciplines

At a Glance

CategoryDetail
ConditionAssessment quality of AI-generated versus human-authored multiple-choice questions (MCQs) in medical imaging education
Key MechanismsComparison of item difficulty, discriminatory power, and origin identifiability between GPT-4o-generated and expert-written MCQs
Target PopulationMedical students and clinical physicians in imaging specialties (radiation oncology, radiology, nuclear medicine)
Care SettingMedical education and assessment settings within imaging disciplines

Key Highlights

  • GPT-4o-generated MCQs showed no statistically significant difference from expert-authored items in core psychometric properties (item difficulty and discrimination).
  • Examinees and experts could not reliably identify the origin of MCQs (AI vs human) better than chance.
  • A human-in-the-loop process was used to screen AI-generated items for factual accuracy prior to administration.

Guideline-Based Recommendations

Diagnosis

  • Use psychometric evaluation (item difficulty and discrimination) to assess MCQ quality in medical education.
  • Incorporate blinded, within-subject study designs to compare AI-generated and human-authored assessment items.

Management

  • Employ human expert review to screen AI-generated MCQs for factual accuracy before use in assessments.
  • Leverage AI-generated items to augment formative assessment banks while maintaining psychometric standards.

Monitoring & Follow-up

  • Monitor item performance metrics continuously to detect domain drift or error propagation in AI-generated content.
  • Assess examinee and expert ability to detect item origin to ensure integrity and acceptance of AI-generated assessments.

Risks

  • Potential exclusion of AI-generated MCQs due to errors or unclear wording necessitates human oversight.
  • Heterogeneous trainee confidence and competency with AI tools highlight the need for transparent educational policies.

Patient & Prescribing Data

Medical students and physicians undergoing assessment in imaging disciplines

AI-generated MCQs can reliably supplement human-authored questions for formative assessments without compromising psychometric quality or detectability.

Clinical Best Practices

  • Implement blinded, within-subject comparisons to validate AI-generated educational content against expert standards.
  • Maintain human-in-the-loop review processes to ensure factual accuracy and appropriateness of AI-generated MCQs.
  • Use psychometric parameters such as item difficulty and discrimination to guide item selection and refinement.
  • Educate trainees and clinicians on the capabilities and limitations of AI-generated assessment tools to foster appropriate trust and use.

References

Original Source(s)

Related Content