For Doctors in a Hurry
- Clinicians lack efficient methods to document adherence to oncology communication guidelines within unstructured medical record notes.
- The researchers analyzed 134 clinical notes from 30 patients with advanced cancer using a secure artificial intelligence tool.
- The model achieved accuracy ranging from 0.51 to 0.99 across six communication domains compared to manual chart review.
- The authors concluded that large language models can effectively identify specific communication domains within clinical documentation.
- This automated approach reduces abstraction time to 7 seconds per note, facilitating rapid quality improvement feedback for oncologists.
Automating the Audit of Clinical Communication
Effective communication between oncologists and patients with advanced malignancies is a cornerstone of high-quality care, influencing everything from treatment adherence to psychological well-being [1]. To standardize these interactions, professional organizations have established guidelines that emphasize the importance of documenting specific communication domains, such as goals of care and prognosis, within the medical record [2]. However, the manual extraction of this information from unstructured free-text notes remains a significant barrier to systematic quality improvement, often requiring several minutes of focused review per document [3, 4]. While large language models (computational systems trained on vast datasets to process and generate human-like text) have shown potential in streamlining clinical workflows and enhancing documentation, their reliability in capturing nuanced oncological dialogue is still being established [3, 5]. A recent feasibility study now evaluates whether these automated tools can match the precision of expert chart review in identifying critical communication elements.
Validation Against Gold-Standard Manual Review
The researchers evaluated the utility of large language models by analyzing a dataset of 134 clinical notes derived from 30 patients with advanced cancer. These patients were treated in June 2024 across seven Dana-Farber Cancer Institute clinics located in Boston, MA. To ensure the methodology adhered to modern clinical standards, the study utilized the communication guidelines established by the American Society of Clinical Oncology (ASCO). These guidelines were the result of a multidisciplinary panel convened in 2017 to define the essential elements of patient-oncologist dialogue, such as discussing prognosis or end-of-life preferences. The study specifically focused on identifying six communication domains within the unstructured free text of the medical records, which often contain complex narratives that are difficult to categorize through traditional automated means. To process this sensitive data, the researchers employed a HIPAA-secure artificial intelligence tool based on GPT-4o (a multimodal large language model capable of processing text and images) to develop a specialized LLM prompt, which is a set of specific instructions used to guide the model's analysis. This prompt was designed to identify the specific communication domains within the notes while maintaining patient privacy. The performance of the LLM was then validated against gold-standard chart review, a process where expert human reviewers manually examine records to verify the presence of specific information. The researchers used standard performance metrics, including sensitivity, specificity, and accuracy, to compare the LLM prompt's output to the manual review across all six domains. This rigorous comparison aimed to determine if the automated system could reliably replicate the nuanced judgment of a clinician when auditing documentation for adherence to the 2017 ASCO standards.
The researchers evaluated the technical proficiency of the GPT-4o model by comparing its output against manual chart review across the six defined communication domains. The note-level analysis demonstrated a wide range of statistical performance depending on the specific domain being identified. Specifically, the model achieved a sensitivity ranging from 0.43 to 1.0, indicating that while it captured all instances of certain communication topics, it missed more than half of the instances in others. The specificity of the model ranged from 0.32 to 0.99, reflecting its varying ability to correctly identify the absence of a communication domain. Overall accuracy ranged from 0.51 to 0.99, a spread that suggests the model is highly reliable for certain types of documentation but requires further refinement for others before it can be used for comprehensive clinical auditing. To address concerns regarding the reliability of generative artificial intelligence in clinical settings, the study included a hallucination index (a metric used to assess the frequency of false or fabricated information produced by the model). This index is critical for clinicians to understand, as it quantifies how often the software might report a communication event that never actually occurred in the patient record. The researchers found that the average hallucination index for all domains was low, suggesting that the model rarely invented documentation. This low rate of fabrication, combined with the high accuracy observed in specific domains, provides a baseline for using these tools in quality improvement initiatives where rapid, large-scale feedback on physician-patient communication is required.
Efficiency Gains and Clinical Implementation
The primary utility of large language models in this context lies in their ability to address the current inefficiency of manual chart review for communication topics in medical records. While traditional human auditing is a labor-intensive process, these computational methods can rapidly identify communication domains in unstructured free-text notes and provide immediate clinician feedback. In this study, the LLM abstraction required approximately 7 seconds per note, a significant reduction in time compared to the 5 to 7 minutes required for manual chart review. This speed differential suggests that automated tools could facilitate large-scale quality assessments that were previously logistically impossible for busy clinical departments. Beyond simple data extraction, the researchers emphasize that LLMs have the potential to identify ASCO communication domains for broader quality improvement efforts. By automating the auditing process, healthcare systems can monitor adherence to communication guidelines across thousands of patient encounters simultaneously. Future applications of this technology include generating feedback for oncologists on topics requiring follow-up, such as end-of-life preferences or treatment goals that may have been omitted or left unresolved in previous visits. This capability ensures more comprehensive patient care by flagging specific gaps in documentation that require the clinician's attention during subsequent consultations, potentially improving the longitudinal continuity of the patient-physician relationship.
References
1. El‐Shami K, Oeffinger KC, Erb NL, et al. American Cancer Society Colorectal Cancer Survivorship Care Guidelines. CA A Cancer Journal for Clinicians. 2015. doi:10.3322/caac.21286
2. Smith RA, Manassaram‐Baptiste D, Brooks D, et al. Cancer screening in the United States, 2014: A review of current American Cancer Society guidelines and current issues in cancer screening. CA A Cancer Journal for Clinicians. 2014. doi:10.3322/caac.21212
3. Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare. 2023. doi:10.3390/healthcare11060887
4. Li C, Zhao Y, Bai Y, et al. Unveiling the Potential of Large Language Models in Transforming Chronic Disease Management: Mixed Methods Systematic Review. Journal of Medical Internet Research. 2024. doi:10.2196/70535
5. Mudrik A, Tsur A, Nadkarni G, et al. Leveraging Large Language Models in Gynecologic Oncology: A Systematic Review of Current Applications and Challenges. medRxiv. 2024. doi:10.1101/2024.08.08.24311699