Utilizing Large Language Models to Extract OPS Codes from Meningioma Surgical Reports
Overview
This study evaluated the accuracy of GPT-based large language models (LLMs) in extracting OPS procedure codes from 100 meningioma surgical reports. Professional coders achieved the highest accuracy, but the medically fine-tuned GPT CodeMedic outperformed both the general GPT-4o model and surgeons in optimal coding accuracy.
Background
In the German hospital system, surgical procedures are coded using OPS codes which contribute to Diagnosis Related Groups (DRGs) for revenue accounting. Accurate coding is critical as errors can lead to financial penalties. Traditionally, surgeons assign initial codes which are then reviewed by professional coders. Recent advances in artificial intelligence, particularly large language models, have shown promise in automating medical coding tasks, but their performance in OPS coding for neurosurgical procedures has not been previously studied.
Data Highlights
Group
Sufficient Coding (%)
Optimal Coding (%)
Surgeons
99-100
31
Professional Coders
99-100
94
GPT-4o
78
24
GPT CodeMedic
89
34
Key Findings
Professional coders achieved the highest optimal coding accuracy at 94%.
Surgeons had high sufficient coding rates (99-100%) but low optimal coding accuracy (31%).
GPT CodeMedic outperformed GPT-4o by over 10% in both sufficient and optimal coding categories.
GPT CodeMedic was significantly superior to surgeons in optimal coding (p = 0.03).
Both LLMs performed significantly worse than professional coders in sufficient and optimal coding (p < 0.01).
There was no significant difference between surgeons and GPT-4o in sufficient coding (p = 0.88).
Clinical Implications
Medically fine-tuned LLMs like GPT CodeMedic demonstrate promising capabilities in extracting accurate OPS codes from neurosurgical reports, potentially supporting clinical coding workflows. However, professional coders currently maintain superior accuracy, underscoring the need for human oversight when integrating AI tools. Continued refinement and validation of LLMs could enhance coding efficiency and reduce administrative burden in surgical departments.
Conclusion
This study is the first to assess GPT-based LLMs for OPS coding in meningioma surgery, showing that specialized models can approach and in some aspects surpass surgeon coding accuracy but still lag behind professional coders. These findings support further development of AI-assisted coding to improve hospital revenue processes.
References
BfArM OPS Catalogue 2022-2025
GPT-4o and GPT CodeMedic Model Descriptions 2023-2025