Large language models for extraction of OPS-codes from operative reports in meningioma surgery - Report - MDSpire

Large language models for extraction of OPS-codes from operative reports in meningioma surgery

  • By

  • Sebastian Lehmann

  • Florian Wilhelmy

  • Nikolaus von Dercks

  • Erdem Güresir

  • Johannes Wach

  • July 31, 2025

  • 0 min

Share

Utilizing Large Language Models to Extract OPS Codes from Meningioma Surgical Reports

Overview

This study evaluated the accuracy of GPT-based large language models (LLMs) in extracting OPS procedure codes from 100 meningioma surgical reports. Professional coders achieved the highest accuracy, but the medically fine-tuned GPT CodeMedic outperformed both the general GPT-4o model and surgeons in optimal coding accuracy.

Background

In the German hospital system, surgical procedures are coded using OPS codes which contribute to Diagnosis Related Groups (DRGs) for revenue accounting. Accurate coding is critical as errors can lead to financial penalties. Traditionally, surgeons assign initial codes which are then reviewed by professional coders. Recent advances in artificial intelligence, particularly large language models, have shown promise in automating medical coding tasks, but their performance in OPS coding for neurosurgical procedures has not been previously studied.

Data Highlights

GroupSufficient Coding (%)Optimal Coding (%)
Surgeons99-10031
Professional Coders99-10094
GPT-4o7824
GPT CodeMedic8934

Key Findings

  • Professional coders achieved the highest optimal coding accuracy at 94%.
  • Surgeons had high sufficient coding rates (99-100%) but low optimal coding accuracy (31%).
  • GPT CodeMedic outperformed GPT-4o by over 10% in both sufficient and optimal coding categories.
  • GPT CodeMedic was significantly superior to surgeons in optimal coding (p = 0.03).
  • Both LLMs performed significantly worse than professional coders in sufficient and optimal coding (p < 0.01).
  • There was no significant difference between surgeons and GPT-4o in sufficient coding (p = 0.88).

Clinical Implications

Medically fine-tuned LLMs like GPT CodeMedic demonstrate promising capabilities in extracting accurate OPS codes from neurosurgical reports, potentially supporting clinical coding workflows. However, professional coders currently maintain superior accuracy, underscoring the need for human oversight when integrating AI tools. Continued refinement and validation of LLMs could enhance coding efficiency and reduce administrative burden in surgical departments.

Conclusion

This study is the first to assess GPT-based LLMs for OPS coding in meningioma surgery, showing that specialized models can approach and in some aspects surpass surgeon coding accuracy but still lag behind professional coders. These findings support further development of AI-assisted coding to improve hospital revenue processes.

References

  1. BfArM OPS Catalogue 2022-2025
  2. GPT-4o and GPT CodeMedic Model Descriptions 2023-2025
  3. German DRG System and Coding Procedures

Original Source(s)

Related Content