Retrieval-augmented generation improves precision and trust of a GPT-4 model for emergency radiology diagnosis and classification: a proof-of-concept study

By
Anna Fink
Johanna Nattenmüller
Stephan Rau
Alexander Rau
Hien Tran
Fabian Bamberg
Marco Reisert
Elmar Kotter
Thierno Diallo
Maximilian F. Russe
February 14, 2025
0 min

European Radiology

Overview

This proof-of-concept study demonstrates that augmenting GPT-4 Turbo with retrieval-augmented generation (RAG) significantly improves its accuracy and reliability in diagnosing and classifying traumatic injuries from radiology reports. By integrating a curated trauma radiology knowledge base, the enhanced model, TraumaCB, better handles complex classification tasks across diverse injury types and imaging modalities.

Background

Trauma radiology faces increasing demands due to faster imaging techniques and the complexity of injury classification systems, which are critical for guiding treatment decisions. Large language models like GPT-4 Turbo offer potential support by summarizing and interpreting radiologic data, but their performance is limited by training data scope and potential hallucinations. Retrieval-augmented generation (RAG) introduces task-specific expert knowledge into prompts, potentially improving diagnostic precision and accountability. This study evaluates the impact of RAG on GPT-4 Turbo’s ability to classify traumatic injuries using synthetic radiology reports.

Data Highlights

Two experienced radiologists independently created 100 synthetic radiology reports representing 50 traumatic diagnoses, covering various imaging modalities (radiography, CT, MRI) and anatomical regions. A curated knowledge base from 70 peer-reviewed trauma radiology articles was indexed using embedding vectors to provide targeted context. The TraumaCB chatbot used a two-step prompting approach to first diagnose and then classify injuries with grading, leveraging the indexed expert knowledge.

Key Findings

GPT-4 Turbo’s diagnostic accuracy improved when augmented with RAG, leveraging a trauma-specific knowledge base.
The TraumaCB model effectively handled variations in report phrasing and terminology introduced by different radiologists.
The two-step prompting approach mimicking clinical workflow enhanced classification and grading precision.
Incorporation of the RadioGraphics top ten reading list enabled the chatbot to select appropriate classification systems and provide expert explanations.
RAG reduced hallucinations and increased transparency by grounding responses in curated, updatable external knowledge.

Clinical Implications

Integrating retrieval-augmented generation with GPT-4 Turbo can support radiologists by improving the accuracy and reliability of trauma diagnosis and classification from imaging reports. This approach may help manage increasing workload and complexity in trauma radiology by providing expert-guided, context-aware decision support. Adoption of such AI tools should consider continuous updating of knowledge bases to maintain clinical relevance and accountability.

Conclusion

Augmenting GPT-4 Turbo with retrieval-augmented generation and a curated trauma radiology knowledge base enhances its diagnostic and classification capabilities, offering a promising tool to assist radiologists in trauma care. Further validation in clinical settings is warranted to confirm these findings.

References

OpenAI 2023 -- GPT-4 Turbo Model
RadioGraphics Top Ten Reading List for Trauma Radiology
LlamaIndex Framework v0.10.6

Retrieval-augmented generation improves precision and trust of a GPT-4 model for emergency radiology diagnosis and classification: a proof-of-concept study

Enhancing GPT-4 Accuracy in Trauma Radiology Diagnosis via Retrieval-Augmented Generation

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Retrieval-augmented generation improves precision and trust of a GPT-4 model for emergency radiology diagnosis and classification: a proof-of-concept study

Related Content

Retinal Age Model Tied to Disease Risk

Crowd-sourcing optimized abdomen CT protocols from 908,000 examinations in a large radiation dose registry

LV Function Mostly Preserved After ASO