Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties - Summary - MDSpire

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

  • By

  • Philipp Linde

  • Florian Fichter

  • Markus Dietlein

  • Ferdinand Sudbrock

  • Kambiz Afshar

  • Hendrik Dapper

  • Emmanouil Fokas

  • Anna-Lena Hillebrecht

  • Tobias Raupach

  • Matthias Carl Laupichler

  • January 8, 2026

  • 0 min

Share

Objective:

To evaluate the psychometric properties and identifiability of multiple-choice questions (MCQs) generated by GPT-4o compared to those created by human experts in imaging disciplines, highlighting the significance of AI in medical education assessments.

Key Findings:
  • No statistically significant differences were found in item difficulty and discrimination between GPT-4o-generated and human-authored items, suggesting comparable quality.
  • Participants could not reliably identify the origin of the items better than chance, indicating the indistinguishability of AI and human items.
  • The study suggests that AI-generated items can maintain psychometric quality comparable to human-generated items, supporting their potential use in assessments.
Interpretation:

The findings indicate that GPT-4o can produce MCQs that are psychometrically comparable to those created by experts, supporting the potential integration of AI in medical education assessments and enhancing the quality of evaluations.

Limitations:
  • The study was limited to three imaging specialties, which may affect generalizability; future studies should include a broader range of specialties.
  • A significant portion of AI-generated MCQs had to be excluded due to errors or unclear wording, highlighting the need for improved AI training and oversight.
Conclusion:

GPT-4o-generated MCQs can augment assessment in medical education while preserving psychometric quality, but further validation across diverse clinical contexts is necessary to ensure reliability and effectiveness.

Original Source(s)

Related Content