Improving disease misclassification and prevalence estimates by linking primary and secondary care electronic health records: an illustration from arthritis research - Report - MDSpire

Improving disease misclassification and prevalence estimates by linking primary and secondary care electronic health records: an illustration from arthritis research

  • By

  • Belay Birlie Yimer

  • Fangyuan Zhang

  • Jenny Humphreys

  • Mark Lunt

  • Meghna Jani

  • John McBeth

  • William G Dixon

  • September 17, 2025

  • 0 min

Share

Enhancing Disease Classification Accuracy by Linking Primary and Secondary Care EHRs in Psoriatic Arthritis

Overview

This study demonstrated that linking primary care electronic health records (EHRs) with text-mined secondary care outpatient letters significantly improves the accuracy of psoriatic arthritis (PsA) prevalence estimates. Primary care data alone underestimated PsA prevalence by more than twofold due to false negatives, which were identified through secondary care data linkage.

Background

Accurate disease classification in routinely collected EHR data is essential for reliable prevalence estimates and research. Primary care databases often rely on coded diagnoses, which may contain false positives and fail to capture all true cases, leading to misclassification. Secondary care outpatient diagnoses are typically recorded as unstructured free text, limiting their use in validation. Advances in natural language processing enable extraction of diagnostic information from these texts, offering an opportunity to improve case identification and correct prevalence estimates.

Data Highlights

MetricValue95% CI
Primary care PsA cases identified245
Primary care population188,286 adults
Observed PsA prevalence (primary care only)0.13%0.11% - 0.15%
Subgroup attending hospital rheumatology clinic7,532 patients
Primary care PsA codes in subgroup202
True positives confirmed in subgroup188
False positives in subgroup14
False negatives (hospital-diagnosed, no primary care code)196
Adjusted PsA prevalence (corrected for misclassification)0.25%0.21% - 0.28%

Key Findings

  • Primary care EHR data alone identified 245 PsA cases among 188,286 adults, yielding an observed prevalence of 0.13%.
  • In a subgroup of 7,532 patients attending hospital rheumatology clinics, 202 had a primary care PsA code; 14 of these were false positives upon validation.
  • Primary care codes missed 196 hospital-diagnosed PsA cases (false negatives), indicating substantial under-ascertainment.
  • Linkage with text-mined secondary care outpatient letters enabled identification of both false positives and false negatives.
  • Adjusting for misclassification using linked data doubled the estimated PsA prevalence to 0.25%.
  • Text mining of outpatient letters compensates for the lack of coded secondary care diagnoses in national datasets.

Clinical Implications

Clinicians and researchers should be aware that relying solely on primary care coded data may substantially underestimate disease prevalence due to false negatives. Integrating secondary care data, especially through text mining of outpatient letters, can improve case ascertainment and provide more accurate epidemiological estimates. This approach supports better disease surveillance and resource allocation in clinical practice.

Conclusion

Linking primary and secondary care EHRs with advanced text-mining techniques significantly enhances the accuracy of disease classification and prevalence estimates for psoriatic arthritis. This methodology addresses limitations of primary care coding and highlights the importance of comprehensive data integration in epidemiological research.

References

  1. Study Authors/Journal/Year -- Enhancing Accuracy of Disease Classification and Prevalence Assessments through Integration of Primary and Secondary Care Electronic Health Records

Original Source(s)

Related Content