For Doctors in a Hurry
- Clinicians require automated tools to accurately classify pediatric chest radiograph reports for community-acquired pneumonia in emergency settings.
- The researchers analyzed 1000 pediatric emergency department encounters using five open-source large language models to categorize radiographic findings.
- The Gemma2 9B model achieved a pneumonia F1 score of 0.82 and a no-pneumonia F1 score of 0.97.
- The authors concluded that these large language models effectively classify pediatric chest radiograph reports by interpreting complex clinical language.
- Integrating these models into clinical workflows may assist physicians with decision support and quality improvement for pediatric pneumonia diagnosis.
Automating Radiographic Interpretation in Pediatric Pneumonia
Pediatric community-acquired pneumonia remains a leading cause of global morbidity and is the third leading cause of death for children under five years of age [1, 2]. While the Pediatric Infectious Diseases Society and the Infectious Diseases Society of America provide management frameworks [3], a cross-sectional survey of emergency physicians found that chest radiograph use was nonadherent to guidelines in more than 50% of clinical cases [4]. This diagnostic inconsistency is further complicated by the limitations of traditional imaging; a meta-analysis of 2,897 patients demonstrated that lung ultrasound (a bedside imaging technique using high-frequency sound waves) achieves a sensitivity of 92.13% and a specificity of 81.91% compared to standard radiographs [5]. Furthermore, the manual review of unstructured radiology reports creates a significant bottleneck for antimicrobial stewardship (coordinated interventions to optimize antibiotic selection and dosing) and real-time clinical decision support [6, 1]. A new study evaluates whether open-source artificial intelligence models can bridge this gap by accurately identifying pneumonia cases from free-text clinical narratives.
Validation Against Physician Adjudication
To establish a robust reference standard for the artificial intelligence models, the researchers conducted a retrospective single-center investigation involving 1,000 pediatric emergency department encounters recorded between 2016 and 2022. Each encounter included a chest radiograph, and the associated free-text reports were subjected to a rigorous review process. Two physicians adjudicated these reports, classifying each as positive, negative, or indeterminate for pneumonia. This physician-led adjudication served as the ground truth (the definitive reference standard used to evaluate the accuracy of a diagnostic test) against which the performance of the large language models was measured, ensuring that the automated classifications aligned with expert clinical judgment. The study cohort reflected a typical pediatric emergency population, with a median patient age of 4.2 years and an interquartile range (the middle 50% of the data) of 1.7 to 10.5 years. Clinical severity within the group was notable, as 54.4% of the patients were admitted to the hospital from the emergency department. By utilizing a sample with a high rate of admission, the study captured a broad spectrum of pulmonary presentations, ranging from mild infections to cases requiring inpatient stabilization. This demographic and clinical diversity provides a realistic baseline for evaluating how well automated systems can handle the nuances of pediatric respiratory diagnostics. Following the physician adjudication of the 1,000 radiograph reports, the diagnostic distribution revealed the inherent complexity of interpreting pediatric imaging. The clinicians identified pneumonia in 27.8% of the reports, while 58.5% were labeled as having no pneumonia. A significant subset of 13.7% of the reports was classified as indeterminate, a category that often reflects the diagnostic uncertainty encountered in clinical practice when radiographic findings are subtle or non-specific. These findings highlight the challenge of clear-cut classification in pediatric care and underscore the necessity for diagnostic tools that can accurately navigate ambiguous clinical language.
The primary objective of the study was to develop and internally validate an automated system for classifying chest radiograph reports for community-acquired pneumonia in children. To achieve this, the researchers evaluated five open-source large language models (artificial intelligence systems trained on vast datasets that can be implemented on local servers to maintain patient data privacy). The models included Gemma2 9B, Gemma2 27B, Falcon3 7B, DeepSeek R1 Distill Llama 8B, and Llama3.1 8B. To ensure rigorous testing, the dataset was split into a 70/30 train-test ratio for the pneumonia outcome, meaning 70% of the physician-adjudicated reports were used to teach the models while the remaining 30% served as an independent set to verify their diagnostic accuracy. The researchers reported performance metrics for both three-class classification (pneumonia, no pneumonia, or indeterminate) and binary classification, where pneumonia and indeterminate cases were grouped together against no pneumonia. Gemma2 9B achieved the best overall performance, recording a pneumonia F1 score of 0.82 in the three-class classification. The F1 score is a measure of a test's accuracy that balances precision (the proportion of positive identifications that were actually correct) and recall (the proportion of actual positives that were identified correctly). In the same three-class analysis, Gemma2 9B achieved a no-pneumonia F1 score of 0.97, demonstrating high reliability in ruling out the condition based on free-text reports. Diagnostic accuracy further increased when the models were applied to binary classification tasks. In this simplified framework, Gemma2 9B yielded an F1 score of 0.97, while the larger Gemma2 27B achieved an F1 score of 0.93. These results were notably superior to those of traditional natural language processing classifiers (older machine learning methods that rely on specific word frequencies rather than contextual understanding). All large language models showed higher performance than these traditional classifiers, including XGBoost, random forest, and logistic regression, suggesting that modern language models are better equipped to handle the linguistic nuances and varied terminology found in pediatric radiology reports.
Clinical Subjectivity and Implementation Potential
The researchers conducted a detailed analysis of the instances where the artificial intelligence models disagreed with the physician adjudication. They found that discrepancies between model and human labels often involved ambiguous language, which suggests that these conflicts were rooted in interpretive subjectivity rather than technical model error. In clinical practice, radiologic reporting frequently utilizes hedged or nuanced terminology, and the study demonstrated that the large language models encountered the same challenges as human clinicians when navigating these linguistic uncertainties. This finding indicates that the models are capturing the inherent complexity of pediatric radiographic interpretation, mirroring the diagnostic gray areas that physicians face in the emergency department. Beyond simple classification, these findings support the feasibility of integrating large language models into decision support and quality improvement pipelines within the clinical workflow. Because the models, particularly Gemma2 9B, demonstrated high reliability in identifying pneumonia and ruling out the condition with an F1 score of 0.97 in binary classification, they could serve as automated screening tools to flag high-risk cases or provide real-time feedback to clinicians. Such integration could potentially reduce diagnostic lag and enhance the consistency of radiographic interpretation. By automating the extraction of actionable data from free-text reports, these open-source models offer a scalable method to improve pediatric emergency care and streamline institutional quality monitoring without the need for manual, labor-intensive chart reviews.
References
1. Zubair M. Clinical applications of artificial intelligence in identification and management of bacterial infection: Systematic review and meta-analysis.. Saudi journal of biological sciences. 2024. doi:10.1016/j.sjbs.2024.103934
2. Orso D, Ban A, Guglielmo N. Lung ultrasound in diagnosing pneumonia in childhood: a systematic review and meta-analysis.. Journal of ultrasound. 2018. doi:10.1007/s40477-018-0306-5
3. Bradley JS, Byington CL, Shah SS, et al. The Management of Community-Acquired Pneumonia in Infants and Children Older Than 3 Months of Age: Clinical Practice Guidelines by the Pediatric Infectious Diseases Society and the Infectious Diseases Society of America. Clinical Infectious Diseases. 2011. doi:10.1093/cid/cir531
4. McLaren SH, Mistry RD, Neuman MI, Florin TA, Dayan PS. Guideline Adherence in Diagnostic Testing and Treatment of Community-Acquired Pneumonia in Children.. Pediatric emergency care. 2021. doi:10.1097/PEC.0000000000001745
5. Abid I, Qureshi N, Lategan N, Williams S, Shahid S. Point-of-care lung ultrasound in detecting pneumonia: A systematic review.. Canadian journal of respiratory therapy : CJRT = Revue canadienne de la therapie respiratoire : RCTR. 2024. doi:10.29390/001c.92182
6. Dellit TH, Owens RC, McGowan JE, et al. Infectious Diseases Society of America and the Society for Healthcare Epidemiology of America Guidelines for Developing an Institutional Program to Enhance Antimicrobial Stewardship. Clinical Infectious Diseases. 2006. doi:10.1086/510393