For Doctors in a Hurry
- Radiologists require efficient tools to reduce reporting time and minimize transcription errors during routine clinical practice.
- This prospective study compared 200 reports using conventional speech recognition against 200 reports using a large language model.
- The model reduced median generation time to 238 seconds and significantly decreased grammar and transcription errors (p < 0.01).
- The researchers concluded that the model improves efficiency but introduces new error patterns like non-compliance and occasional confabulations.
- Clinical implementation requires caution because time savings remain heterogeneous and dependent on individual radiologist dictation habits.
The Evolution of Clinical Documentation in Diagnostic Imaging
The integration of artificial intelligence into clinical workflows has transitioned from experimental computer vision tasks to sophisticated documentation support [1, 2]. While convolutional neural networks (a type of deep learning architecture used to identify patterns in medical images) have long assisted in diagnosis, the focus is shifting toward streamlining the administrative burden of reporting [2]. Generative artificial intelligence, particularly large language models (AI systems trained on vast text datasets to predict and generate human-like language), offers potential benefits in medical education and clinical communication [3, 4]. However, the transition to these tools in daily practice raises significant concerns regarding the accuracy of clinical reasoning and the risk of hallucinations (the generation of false or nonsensical information by an AI model) [3, 4]. A new study now evaluates whether these general-purpose language tools can effectively replace conventional speech recognition in the high-stakes environment of radiology reporting.
Comparative Analysis in Routine Clinical Practice
The researchers conducted a prospective, multicenter trial to evaluate the performance of a general-purpose large language model against conventional speech recognition (CSR) systems. In this study, five radiologists generated a total of 400 reports during their routine clinical practice. The workload was evenly split, with 200 reports produced using CSR and 200 reports using a general-purpose large language model that featured integrated speech recognition capabilities. To ensure data security and patient privacy, the authors implemented a strict protocol where no patient-identifying or clinical information was uploaded to the large language model. This methodological safeguard is critical for clinicians considering the adoption of cloud-based AI tools, as it mitigates the risk of violating patient confidentiality regulations. The dataset comprised a variety of diagnostic imaging studies, specifically including 301 out of 400 reports (75.3 percent) from CT scans and 99 out of 400 reports (24.8 percent) from MRI scans. To analyze the differences between the two reporting methods, the researchers employed specific statistical tools. They used the Mann-Whitney U-test (a non-parametric statistical test used to compare differences between two independent groups when the data are not normally distributed) to evaluate quantitative variables such as generation times. For categorical variables, such as the presence or absence of specific error types, the authors utilized the chi-square test (a statistical method used to determine if there is a significant association between two categorical variables).
Efficiency Gains and Individual Variability
The researchers meticulously tracked the workflow by recording the generation times for all 400 reports included in the study to determine the impact of each technology on clinical throughput. When comparing the two modalities, the large language model demonstrated a measurable increase in efficiency. The overall median total generation time was 238 seconds (interquartile range: 154 to 349 seconds) in the large language model group, compared to a median of 318 seconds (interquartile range: 218 to 478 seconds) in the conventional speech recognition group. This reduction in time was statistically significant, with a p-value of less than 0.01, suggesting that the automated processing and structuring capabilities of the large language model can streamline the documentation process. Despite the aggregate improvements in speed, the data revealed significant variability among the participating clinicians. At the individual level, a time reduction with the large language model was observed in only 3 out of 5 radiologists. This finding indicates that the efficiency gains are not universal and may be heavily influenced by the specific dictation habits and technical proficiency of the user. For the practicing radiologist, these results suggest that while large language models offer the potential to accelerate reporting, the actual impact on daily productivity is heterogeneous and depends on how an individual clinician interacts with the software and integrates it into their established workflow. This variability underscores the importance of personalized training and pilot testing before full-scale implementation in a clinical department.
Quantifying Accuracy and New Error Patterns
The researchers evaluated the accuracy of the reports by comparing the frequency of linguistic and technical errors between the two modalities. Grammar and spelling errors totaled 79 in the large language model group compared to 293 in the conventional speech recognition group (p < 0.01). Similarly, the study found that transcription errors totaled 225 in the large language model group compared to 445 in the conventional speech recognition group (p < 0.01). These data points suggest that the large language model is more effective at producing syntactically correct and accurately transcribed text during the dictation process, potentially reducing the time clinicians spend on manual proofreading for minor clerical mistakes. To provide a more granular assessment of the differences between the dictated audio and the final text, the researchers used the Levenshtein distance (a metric measuring the difference between two sequences of characters by counting the minimum number of edits required to change one string into another). This quantitative analysis revealed that the Levenshtein distance at the character scale was 43 (interquartile range: 8 to 156) in the large language model group, which was significantly higher than the 20 (interquartile range: 5 to 43) observed in the conventional speech recognition group (p < 0.01). A higher Levenshtein distance indicates that the large language model made more extensive modifications to the raw input compared to the conventional system, which tends to follow the literal dictation more closely. This suggests that while the large language model group had fewer errors, the software was more prone to altering the radiologist's original phrasing. While the large language model reduced standard errors, it introduced unique qualitative challenges that clinicians must monitor. The researchers identified 69 instances of rewording without loss of meaning and 99 instances of non-compliance to instructions within the large language model group. Most critically, the study documented 4 confabulations (the generation of false or nonsensical information). These errors represent a distinct category of risk not typically found in conventional speech recognition. For the practicing radiologist, these findings emphasize that while large language models can minimize common transcription mistakes, they require vigilant oversight to ensure that the final report remains clinically accurate and adheres to specific formatting requirements without the introduction of fabricated data.
References
1. Jerome JTJ, Jain V. Clinical applications of artificial intelligence in hand surgery: A systematic review and meta-analysis.. Journal of clinical orthopaedics and trauma. 2026. doi:10.1016/j.jcot.2026.103335
2. Yamashita R, Nishio M, Gian RK, Togashi K. Convolutional neural networks: an overview and application in radiology. Insights into Imaging. 2018. doi:10.1007/s13244-018-0639-9
3. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023. doi:10.3390/healthcare11060887
4. Hack S, Attal R, Farzad A, et al. Performance of generative AI across ENT tasks: A systematic review and meta-analysis.. Auris, nasus, larynx. 2025. doi:10.1016/j.anl.2025.08.010