Clinical Report: Evaluating GPT-4V for Radiologic Image Interpretation
Overview
This study assessed GPT-4V’s diagnostic accuracy across multiple imaging modalities and contexts, revealing variable performance influenced by modality and clinical information. While GPT-4V demonstrated plausible report generation and some diagnostic capability, limitations including missed findings and hallucinations were noted.
Background
Large language models like ChatGPT have revolutionized text-based interactions, and multimodal models such as GPT-4V extend these capabilities to image interpretation by integrating computer vision. In radiology, GPT-4V offers potential applications from report summarization to clinical decision support. However, early evaluations show mixed results regarding its diagnostic accuracy and reliability, with concerns about its probabilistic nature and imperfect performance in medical imaging.
Data Highlights
Imaging Modality
Number of Studies
Diagnostic Accuracy (Contextualized)
Diagnostic Accuracy (Uncontextualized)
Radiography
60
Variable (improved with context)
Lower accuracy
CT
60
Variable (improved with context)
Lower accuracy
MRI
60
Variable (improved with context)
Lower accuracy
Angiography
26
Variable (improved with context)
Lower accuracy
Key Findings
GPT-4V’s diagnostic accuracy varies significantly across imaging modalities, with performance generally better in CT and MRI than in radiography and angiography.
The model can identify relevant imaging findings and generate plausible differential diagnoses, but sometimes misses obvious abnormalities or misattributes findings (e.g., lesion laterality).
GPT-4V’s self-reported diagnostic confidence correlates with accuracy, but hallucinations and errors remain a concern.
Current evidence is limited by retrospective design, small sample sizes, and potential overlap with training data, highlighting the need for cautious clinical integration.
Clinical Implications
Clinicians should view GPT-4V as a supplementary tool rather than a definitive diagnostic resource, especially given its variable accuracy and potential for hallucinations. Incorporating clinical context enhances its performance, suggesting that integration with patient data is critical. Ongoing validation and cautious use are essential before routine clinical deployment.
Conclusion
GPT-4V shows promise in radiologic image interpretation with improved accuracy when clinical context is provided, but limitations in diagnostic reliability and hallucination risk necessitate further rigorous evaluation before clinical adoption.
References
OpenAI/ChatGPT/2024 -- ChatGPT usage statistics and capabilities
OpenAI/GPT-4V/2023 -- GPT-4V system card and capabilities
Microsoft/Extended Report/2023 -- GPT-4V performance in medical imaging
Various Authors/2023-2024 -- Early scientific analyses of GPT-4V in radiology
Joint clinical consensus outlines evaluation and management considerations for arrhythmias, coronary atherosclerosis, aortic dilatation, myocardial fibrosis, and related findings in older competitive athletes.
A VHA study across 11 vendors finds AI-generated primary care notes score lower than clinician-written notes, with the largest deficits in thoroughness, organization, and usefulness