AI in Brief: Swimming to the Deep End with Large Language Models

Home Blog

AI in Brief: Swimming to the Deep End with Large Language Models

March 14, 2024 | Po-Hao (Howard) Chen, MD, MBA

Dr. Po-Hao Howard Chen

Po-Hao "Howard" Chen, MD, MBA

Vice Chair for Artificial Intelligence
Medical Director for Enterprise Radiology
Staff Radiologist in Musculoskeletal Imaging
Cleveland Clinic

In the rapidly evolving intersection of radiology and artificial intelligence (AI), large language models (LLMs) stand out as game-changers, offering both opportunities and challenges. This quarter’s AI in Brief delves into the literature, blog posts, and other resources over the past 3-6 months that explore the multifaceted impact of LLMs in radiology. Through studies, practical examples, and a hands-on exercise, these works illuminate the capabilities of LLMs to interpret and summarize clinical texts. Yet, they also caution against the unchecked adoption of LLMs in radiology, highlighting the importance of addressing hallucinations, bias, and ethical considerations.

This quarter’s AI In Brief is a primer on the current state of LLMs in radiology, presenting a balanced view of their transformative potential alongside the hurdles that must be overcome to realize their full value in radiology.

Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications

The Radiology article by Rajesh Bhayana is a good introduction to the world of chatbots and LLMs in radiology, highlighting LLMs’ potential to revolutionize clinical practice and research. The article is written at a high level, outlining the strengths and limitations of LLMs, emphasizing their human-like performance in interpreting both text and images, which mirrors current diagnostic pathways. However, it also discusses challenges such as hallucinations, complex reasoning, and bias. Strategies for optimizing LLM performance and real-world applications that enhance efficiency in radiology are reviewed, indicating a promising future with the integration of AI tools.

If you are looking to understand how radiology and LLMs will collide, start here.

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports

Reporting and communication standardization has been a cornerstone in Radiology for years. The ACR has long been a champion of report standardization both through a formal practice parameter and through the creation of Reporting and Data Systems (RADS). Emerging evidence now suggests there may a way to bring LLMs into the mix. This study by Hasani et al from European Radiology evaluates GPT-4's ability to generate standardized radiology reports, comparing them against those created by radiologists. It found that AI-generated reports were comparable in quality, offering more concise and structurally clear reports, with high content similarity scores. GPT-4 shows promise for improving efficiency and communication in clinical practice, but limitations and ethical concerns need careful consideration for safe, effective use.

Adapted large language models can outperform medical experts in clinical text summarization

In addition to reorganizing a radiology report, LLMs also have the ability to ingest, incorporate, and summarize a body of knowledge. This study from Van Deen et al from Stanford, published in Nature Medicine, demonstrates that adapted Large Language Models (LLMs) may surpass medical experts in clinical text summarization tasks, including radiology reports, patient questions, progress notes, and doctor-patient dialogues. The publication is an excellent read not only because it is a well-designed study but also because the authors shared their source code on GitHub detailing how the publicly available MIMIC-IV dataset was used and processed. The study employed eight LLMs and analyzed their performance using syntactic, semantic, and conceptual NLP metrics to identify the most effective models and adaptation strategies. A significant finding is that summaries from the best-adapted LLMs were often equivalent to or better than those from medical experts, according to a clinical reader study involving 10 physicians. The study also highlights the importance of prompt engineering and adaptation methods like in-context learning (ICL) for improving LLM performance in specific tasks, and the role of safety analysis in the identification of errors and potential downstream patient harm.

Amazon Web Services publishes tutorial blog on generating impressions from findings in radiology reports

As radiologists we take immense pride in building an informative radiology report. Radiology reports, critical for clinical decision-making, are detailed documents summarizing the results of imaging exams. These reports are vital but challenging to compose – much of residency training is spent learning to concisely and informatively build a report for referring providers, other radiologists, and patients. Automating report summarization is an ongoing commercial and research interest, and according to Amazon, a project accessible to anyone with an internet connection (and a credit card). The article proposes automating summarization by using generative AI, specifically fine-tuning pre-trained large language models (LLMs) like FLAN-T5 XL. Focused on a proof of concept the proposed approach leverages the publicly available MIMIC-CXR dataset. It processes it through a collection of proprietary tools and public LLM to build a model that could improve accessibility of radiology reports and streamline radiological reporting.

Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study

Thanks to the work of the AI community, it is now routine in most AI discussions to also discuss this technology’s potential to automate bias at a large scale. But it is never too routine to do a deep dive into how AI can do so. This The Lancet Digital Health study from Zack et al. and funded by Priscilla Chan and Mark Zuckerberg assesses GPT-4's potential to perpetuate racial and gender biases in healthcare, focusing on four clinical applications: medical education, diagnostic reasoning, clinical plan generation, and patient assessment. Despite GPT-4's promise of enhancing healthcare delivery, the study reveals it inaccurately models demographic diversity in medical conditions, leading to stereotypical portrayals and recommendations. This misrepresentation could impact clinical decisions and patient care, underscoring the need for comprehensive bias assessments and mitigation strategies before integrating such AI tools into clinical practice. The findings call for a cautious approach to deploying large language models in healthcare to avoid exacerbating health disparities.