For Doctors in a Hurry
- Clinicians require automated tools to accurately classify pediatric chest radiograph reports for community-acquired pneumonia.
- The researchers analyzed 1,000 pediatric emergency department encounters using five open-source large language models.
- The Gemma2 9B model achieved a pneumonia F1 score of 0.82 in three-class classification tasks.
- The authors concluded that these models provide accurate classification of pediatric pneumonia reports compared to human adjudication.
- Integrating these models into clinical workflows may enhance radiographic interpretation and support pediatric emergency care quality.
Automated Interpretation of Pediatric Radiographic Findings
Pediatric community-acquired pneumonia remains a leading cause of morbidity worldwide, requiring precise diagnostic pathways to ensure timely treatment and effective antimicrobial stewardship [1, 2]. While chest radiographs are a standard diagnostic tool, the interpretation of free-text reports is often complicated by variable terminology and subjective clinician adjudication [3, 4]. Recent efforts to improve diagnostic accuracy have explored deep learning (a subset of artificial intelligence that mimics human neural networks to process complex data) to assist in identifying bacterial infections and differentiating them from other respiratory pathologies [4]. Implementing these automated systems in clinical practice could optimize patient outcomes while minimizing the unintended consequences of inappropriate antibiotic use [5]. A new study evaluates how open-source large language models (computational systems trained on vast datasets to understand and generate human-like text) handle the complexities of pediatric imaging documentation.
Retrospective Analysis of Emergency Department Encounters
The researchers conducted a retrospective single-center study to evaluate the efficacy of automated classification systems within a high-volume clinical environment. The dataset comprised 1000 pediatric emergency department encounters recorded between 2016 and 2022, all of which included at least one chest radiograph. This longitudinal timeframe allowed the authors to capture a wide variety of reporting styles and clinical presentations, ensuring the models were tested against the natural linguistic variability found in electronic health records. By focusing on real-world emergency department documentation, the study aimed to address the practical challenges of identifying community-acquired pneumonia in a setting where rapid diagnostic turnaround is essential for patient management. The study population reflected a broad pediatric demographic, with a median patient age of 4.2 years and an interquartile range (the middle 50 percent of the data distribution, representing the spread from the 25th to the 75th percentile) of 1.7 to 10.5 years. Clinical severity within the cohort was notable, as 54.4 percent of the patients were admitted to the hospital from the emergency department following their initial evaluation. This high admission rate suggests that the large language models were evaluated on a population with a significant burden of acute respiratory illness, rather than mild or incidental findings. To establish a ground truth for model comparison, two physicians adjudicated each report, classifying them as positive, negative, or indeterminate for pneumonia. This rigorous human review process served as the gold standard for measuring the accuracy of the subsequent machine learning analyses.
Physician Adjudication and Model Selection
To establish a reliable benchmark for the automated system, the study utilized a gold standard of physician adjudication. Two clinicians independently reviewed the 1000 pediatric chest radiograph reports to categorize them based on the presence of community-acquired pneumonia. This process resulted in 27.8 percent of reports being labeled as positive for pneumonia, while 58.5 percent were classified as negative. A subset of 13.7 percent of reports was labeled as indeterminate, reflecting the clinical reality of ambiguous radiographic findings, such as vague opacities that do not clearly meet diagnostic criteria. This distribution provided a diverse dataset for the internal validation of the automated classification system, ensuring the models were tested against both clear-cut and borderline clinical cases. The researchers evaluated the performance of five distinct open-source large language models: Gemma2 9B, Gemma2 27B, Falcon3 7B, DeepSeek R1 Distill Llama 8B, and Llama3.1 8B. To ensure rigorous testing, the study employed a 70/30 train-test split (a methodological framework where 70 percent of the data is used to teach the model to recognize patterns and the remaining 30 percent is reserved to evaluate its accuracy on unseen reports). This approach is critical for clinicians to understand, as it demonstrates the model's ability to generalize its findings to new patients rather than simply memorizing the training set. By validating these models against physician-adjudicated labels, the study aimed to develop and internally validate an automated system for classifying chest radiograph reports for community-acquired pneumonia in children, potentially streamlining the interpretation of free-text clinical documentation.
Diagnostic Accuracy Across Classification Tiers
The researchers evaluated the models using two distinct frameworks to assess their utility in clinical workflows. First, performance was measured for three-class classification, which required the models to distinguish between reports labeled as pneumonia, negative, or indeterminate. Among the tested architectures, Gemma2 9B achieved the best overall performance. To quantify diagnostic accuracy, the study utilized the F1 score (a statistical measure that balances precision, the proportion of positive identifications that were actually correct, and recall, the proportion of actual positives that were identified correctly). In the three-class analysis, Gemma2 9B achieved a pneumonia F1 score of 0.82 and a no-pneumonia F1 score of 0.97. These metrics indicate that while the model is highly reliable at ruling out pneumonia, the presence of indeterminate reports introduces more complexity into the classification of positive cases. To further refine the clinical utility of these tools, the researchers also measured performance for binary classification, a method that grouped pneumonia and indeterminate reports together against no-pneumonia reports. This grouping reflects a common clinical threshold where any report not explicitly negative may require further review or follow-up. This binary classification improved performance across the board. Specifically, Gemma2 9B reached an F1 score of 0.97 in this category, while the larger Gemma2 27B achieved an F1 score of 0.93. These high scores suggest that the models are particularly adept at identifying clear negative findings, which could help clinicians prioritize reports that require immediate attention or further diagnostic workup. The study also compared these modern large language models against established computational techniques to determine if newer architectures provided a measurable advantage. All large language models substantially outperformed traditional natural language processing classifiers, including XGBoost (an optimized gradient boosting library), random forest, and logistic regression. While these traditional machine learning methods have been used for text classification in the past, they often struggle with the nuanced and sometimes ambiguous language found in pediatric radiology reports. The superior performance of the large language models suggests they are better equipped to handle the linguistic variability inherent in free-text clinical documentation, providing a more robust foundation for automated decision support systems in the emergency department.
Addressing Ambiguity in Clinical Documentation
The researchers conducted a qualitative review of the instances where the large language models and the adjudicating physicians disagreed on the classification of a chest radiograph. This analysis revealed that discrepancies between model and human labels often involved ambiguous language in the reports, such as descriptions of patchy opacities or perihilar streaking that lack definitive diagnostic certainty. Because the models were evaluated on a dataset where 13.7 percent of reports were classified as indeterminate, the artificial intelligence had to navigate the same linguistic nuances that challenge clinicians in daily practice. The findings indicate that the models generally captured the essence of the radiologist's narrative, even when the final label was contested during the adjudication of the 1000 pediatric emergency department encounters. Further investigation into these conflicting cases provided insight into the limitations of both automated and human classification. The analysis of discrepancies suggested interpretive subjectivity in the reports rather than specific model error, meaning the disagreement was often rooted in the varying ways a clinician might interpret a vague radiographic description. For the practicing physician, this underscores that the large language model is not merely making random errors but is reflecting the real-world uncertainty present in clinical documentation. By identifying these areas of subjectivity, the study suggests that these models could eventually serve as a secondary tool to flag reports where the clinical picture remains unclear, thereby supporting more consistent diagnostic standards and improving the reliability of pediatric emergency care.
References
1. Bradley JS, Byington CL, Shah SS, et al. The Management of Community-Acquired Pneumonia in Infants and Children Older Than 3 Months of Age: Clinical Practice Guidelines by the Pediatric Infectious Diseases Society and the Infectious Diseases Society of America. Clinical Infectious Diseases. 2011. doi:10.1093/cid/cir531
2. Orso D, Ban A, Guglielmo N. Lung ultrasound in diagnosing pneumonia in childhood: a systematic review and meta-analysis.. Journal of ultrasound. 2018. doi:10.1007/s40477-018-0306-5
3. Abid I, Qureshi N, Lategan N, Williams S, Shahid S. Point-of-care lung ultrasound in detecting pneumonia: A systematic review.. Canadian journal of respiratory therapy : CJRT = Revue canadienne de la therapie respiratoire : RCTR. 2024. doi:10.29390/001c.92182
4. Zubair M. Clinical applications of artificial intelligence in identification and management of bacterial infection: Systematic review and meta-analysis.. Saudi journal of biological sciences. 2024. doi:10.1016/j.sjbs.2024.103934
5. Dellit TH, Owens RC, McGowan JE, et al. Infectious Diseases Society of America and the Society for Healthcare Epidemiology of America Guidelines for Developing an Institutional Program to Enhance Antimicrobial Stewardship. Clinical Infectious Diseases. 2006. doi:10.1086/510393