Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation - Report - MDSpire

Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation

  • By

  • Marc Sebastian Huppertz

  • Robert Siepmann

  • David Topp

  • Omid Nikoubashman

  • Can Yüksel

  • Christiane Katharina Kuhl

  • Daniel Truhn

  • Sven Nebelung

  • October 18, 2024

  • 0 min

Share

Clinical Report: Evaluating GPT-4V for Radiologic Image Interpretation

Overview

This study assessed GPT-4V’s diagnostic accuracy across multiple imaging modalities and contexts, revealing variable performance influenced by modality and clinical information. While GPT-4V demonstrated plausible report generation and some diagnostic capability, limitations including missed findings and hallucinations were noted.

Background

Large language models like ChatGPT have revolutionized text-based interactions, and multimodal models such as GPT-4V extend these capabilities to image interpretation by integrating computer vision. In radiology, GPT-4V offers potential applications from report summarization to clinical decision support. However, early evaluations show mixed results regarding its diagnostic accuracy and reliability, with concerns about its probabilistic nature and imperfect performance in medical imaging.

Data Highlights

Imaging ModalityNumber of StudiesDiagnostic Accuracy (Contextualized)Diagnostic Accuracy (Uncontextualized)
Radiography60Variable (improved with context)Lower accuracy
CT60Variable (improved with context)Lower accuracy
MRI60Variable (improved with context)Lower accuracy
Angiography26Variable (improved with context)Lower accuracy

Key Findings

  • GPT-4V’s diagnostic accuracy varies significantly across imaging modalities, with performance generally better in CT and MRI than in radiography and angiography.
  • Providing clinical context (e.g., patient age, sex, chief complaint) improves GPT-4V’s diagnostic accuracy compared to uncontextualized image interpretation.
  • The model can identify relevant imaging findings and generate plausible differential diagnoses, but sometimes misses obvious abnormalities or misattributes findings (e.g., lesion laterality).
  • GPT-4V’s self-reported diagnostic confidence correlates with accuracy, but hallucinations and errors remain a concern.
  • Current evidence is limited by retrospective design, small sample sizes, and potential overlap with training data, highlighting the need for cautious clinical integration.

Clinical Implications

Clinicians should view GPT-4V as a supplementary tool rather than a definitive diagnostic resource, especially given its variable accuracy and potential for hallucinations. Incorporating clinical context enhances its performance, suggesting that integration with patient data is critical. Ongoing validation and cautious use are essential before routine clinical deployment.

Conclusion

GPT-4V shows promise in radiologic image interpretation with improved accuracy when clinical context is provided, but limitations in diagnostic reliability and hallucination risk necessitate further rigorous evaluation before clinical adoption.

References

  1. OpenAI/ChatGPT/2024 -- ChatGPT usage statistics and capabilities
  2. OpenAI/GPT-4V/2023 -- GPT-4V system card and capabilities
  3. Microsoft/Extended Report/2023 -- GPT-4V performance in medical imaging
  4. Various Authors/2023-2024 -- Early scientific analyses of GPT-4V in radiology

Original Source(s)

Related Content