Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties - Report - MDSpire

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

  • By

  • Philipp Linde

  • Florian Fichter

  • Markus Dietlein

  • Ferdinand Sudbrock

  • Kambiz Afshar

  • Hendrik Dapper

  • Emmanouil Fokas

  • Anna-Lena Hillebrecht

  • Tobias Raupach

  • Matthias Carl Laupichler

  • January 8, 2026

  • 0 min

Share

Psychometric Evaluation of GPT-4o-Generated vs Human MCQs in Imaging Disciplines

Overview

This study compared multiple-choice questions (MCQs) generated by GPT-4o with those authored by experts across three imaging specialties. Results showed no significant differences in item difficulty or discriminatory power between AI-generated and human-written items, and participants could not reliably identify item origin beyond chance.

Background

High-quality MCQ banks are essential for medical education but require substantial expert effort and psychometric validation. Large language models (LLMs) like GPT-4o offer potential to accelerate question generation, yet their reliability and validity in clinical assessment remain underexplored. This study addresses key questions about psychometric equivalence and detectability of AI-generated items in imaging disciplines, involving medical students and physicians.

Data Highlights

MetricGPT-4o MCQsHuman MCQsStatistical Significance
Item DifficultyComparableComparableNo significant difference
Discriminatory PowerComparableComparableNo significant difference
Origin Identification Accuracy~50% (chance)~50% (chance)No significant difference
Participants128 total (82 medical students, 46 physicians)
SpecialtiesRadiation oncology, nuclear medicine, radiology, others

Key Findings

  • GPT-4o–generated MCQs demonstrated psychometric properties (difficulty and discrimination) statistically indistinguishable from expert-authored items.
  • Neither medical students nor physicians could reliably identify whether an item was AI-generated or human-written, with accuracy near chance level (0.50).
  • There were no significant differences in psychometric indices or origin identification between clinicians and students.
  • A human-in-the-loop process was used to screen AI-generated items for factual accuracy before administration.
  • The study was preregistered, blinded, and conducted across three imaging specialties with a balanced participant cohort.

Clinical Implications

GPT-4o can effectively augment the development of formative MCQ assessments in imaging disciplines without compromising psychometric quality. Its use may reduce faculty workload while maintaining item reliability and learner engagement. However, human oversight remains critical to ensure factual accuracy and appropriateness before deployment.

Conclusion

GPT-4o–generated MCQs are psychometrically comparable to expert-written items and functionally indistinguishable to examinees and experts, supporting their potential integration into medical education assessment workflows in imaging specialties.

References

  1. Artsi et al. 2023 -- Review of AI-Generated Assessment Items in Medical Education
  2. Recent Studies 2023 -- Psychometric Evaluation of LLM-Generated MCQs
  3. Foundation Models in Radiology 2023 -- AI in Imaging Workflows

Original Source(s)

Related Content