Large language models for extraction of OPS-codes from operative reports in meningioma surgery

By
Sebastian Lehmann
Florian Wilhelmy
Nikolaus von Dercks
Erdem Güresir
Johannes Wach
July 31, 2025
0 min

Acta Neurochirurgica

Overview

This study evaluated the accuracy of GPT-based large language models (LLMs) in extracting OPS procedure codes from 100 meningioma surgical reports. Professional coders achieved the highest accuracy, but the medically fine-tuned GPT CodeMedic outperformed both the general GPT-4o model and surgeons in optimal coding accuracy.

Background

In the German hospital system, surgical procedures are coded using OPS codes which contribute to Diagnosis Related Groups (DRGs) for revenue accounting. Accurate coding is critical as errors can lead to financial penalties. Traditionally, surgeons assign initial codes which are then reviewed by professional coders. Recent advances in artificial intelligence, particularly large language models, have shown promise in automating medical coding tasks, but their performance in OPS coding for neurosurgical procedures has not been previously studied.

Data Highlights

Group	Sufficient Coding (%)	Optimal Coding (%)
Surgeons	99-100	31
Professional Coders	99-100	94
GPT-4o	78	24
GPT CodeMedic	89	34

Key Findings

Professional coders achieved the highest optimal coding accuracy at 94%.
Surgeons had high sufficient coding rates (99-100%) but low optimal coding accuracy (31%).
GPT CodeMedic outperformed GPT-4o by over 10% in both sufficient and optimal coding categories.
GPT CodeMedic was significantly superior to surgeons in optimal coding (p = 0.03).
Both LLMs performed significantly worse than professional coders in sufficient and optimal coding (p < 0.01).
There was no significant difference between surgeons and GPT-4o in sufficient coding (p = 0.88).

Clinical Implications

Medically fine-tuned LLMs like GPT CodeMedic demonstrate promising capabilities in extracting accurate OPS codes from neurosurgical reports, potentially supporting clinical coding workflows. However, professional coders currently maintain superior accuracy, underscoring the need for human oversight when integrating AI tools. Continued refinement and validation of LLMs could enhance coding efficiency and reduce administrative burden in surgical departments.

Conclusion

This study is the first to assess GPT-based LLMs for OPS coding in meningioma surgery, showing that specialized models can approach and in some aspects surpass surgeon coding accuracy but still lag behind professional coders. These findings support further development of AI-assisted coding to improve hospital revenue processes.

References

BfArM OPS Catalogue 2022-2025
GPT-4o and GPT CodeMedic Model Descriptions 2023-2025
German DRG System and Coding Procedures

Large language models for extraction of OPS-codes from operative reports in meningioma surgery

Utilizing Large Language Models to Extract OPS Codes from Meningioma Surgical Reports

Overview

Background

Data Highlights

Key Findings

Clinical Implications

Conclusion

References

Original Source(s)

Large language models for extraction of OPS-codes from operative reports in meningioma surgery

Related Content

Impact of Time to Minimally Invasive Puncture and Drainage on Long-Term Mortality in Spontaneous Intracerebral Hemorrhage

Correction: Surgical Intervention for Cognitive Dysfunction Due to Internal Jugular Vein Stenosis: A Clinical Investigation of Atlas Transverse Process Resection

Volumetric Image Registration Techniques for Rigid and Nonrigid Models in Image-Guided Interventions