Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

By
Philipp Linde
Florian Fichter
Markus Dietlein
Ferdinand Sudbrock
Kambiz Afshar
Hendrik Dapper
Emmanouil Fokas
Anna-Lena Hillebrecht
Tobias Raupach
Matthias Carl Laupichler
January 8, 2026
0 min

Npj Digital Medicine

Objective:

To evaluate the psychometric properties and identifiability of multiple-choice questions (MCQs) generated by GPT-4o compared to those created by human experts in imaging disciplines, highlighting the significance of AI in medical education assessments.

Key Findings:

No statistically significant differences were found in item difficulty and discrimination between GPT-4o-generated and human-authored items, suggesting comparable quality.
Participants could not reliably identify the origin of the items better than chance, indicating the indistinguishability of AI and human items.
The study suggests that AI-generated items can maintain psychometric quality comparable to human-generated items, supporting their potential use in assessments.

Interpretation:

The findings indicate that GPT-4o can produce MCQs that are psychometrically comparable to those created by experts, supporting the potential integration of AI in medical education assessments and enhancing the quality of evaluations.

Limitations:

The study was limited to three imaging specialties, which may affect generalizability; future studies should include a broader range of specialties.
A significant portion of AI-generated MCQs had to be excluded due to errors or unclear wording, highlighting the need for improved AI training and oversight.

Conclusion:

GPT-4o-generated MCQs can augment assessment in medical education while preserving psychometric quality, but further validation across diverse clinical contexts is necessary to ensure reliability and effectiveness.

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Objective:

Key Findings:

Interpretation:

Limitations:

Conclusion:

Original Source(s)

Psychometric properties and detectability of GPT-4o–generated multiple-choice questions compared with human-authored items across imaging specialties

Related Content

Performance of contrast-enhanced cone-beam breast CT to predict nipple–areolar complex involvement in early-stage breast cancer

Diagnostic accuracy of abbreviated biparametric MRI for prostate cancer screening: a prospective feasibility study (ReIMAGINE study)

Use of gadolinium-based contrast agents in head and neck cancer diagnosis, staging, and monitoring: current applications and future perspectives