To automate the annotation of clinical severity for phenotypic abnormalities in the Human Phenotype Ontology (HPO) using GPT-4, enhancing the efficiency of clinical metadata curation.
Key Findings:
GPT-4 achieved true positive recall rates between 89% and 100% (mean = 97%) in severity annotation, indicating a high level of accuracy in replicating expert-level curation.
The severity scoring system integrates both the type and frequency of clinical characteristics, providing a comprehensive metric for assessing phenotypic severity.
Interpretation:
The study demonstrates that LLMs like GPT-4 can effectively automate the curation of clinical metadata, significantly reducing the need for manual expert annotation and improving efficiency in clinical settings.
Limitations:
The study primarily focused on phenotypic severity without exploring other clinical dimensions, which may limit the applicability of the findings.
Validation against independent clinical datasets is needed for further confirmation of the annotations, ensuring robustness and reliability.
Conclusion:
The findings provide a foundation for systematically ranking human phenotypes by their health impact, aiding in therapeutic prioritization for rare diseases and enhancing clinical decision-making.