The time is right for all radiologists-in-training to develop the skills that will be in demand when image classification “goes to the machines.” In this post, I lay out a four-step approach to ensuring you are well-prepared for integrating AI into your clinical practice and capitalizing on the gains in efficiency it can deliver.
First, I’ll propose two steps geared toward preparing yourself to critically evaluate the purchase and clinical implementation of a machine learning (ML) tool. The third step is intended to keep you flexible when ML tools increase our efficiency as diagnostic radiologists to the point where we’re able to start focusing on other tasks. Finally, I’ll close with a few recommendations for those who want to leverage their expertise as clinical radiologists to engage in active development of clinical ML tools.
Step 1: Develop a Solid Foundation in Biostatistics
The ML world is filled with discussion of various metrics for assessing the diagnostic performance of an ML model. Many of these metrics refer to the overall performance of the model on a relatively large dataset. While you might be able to extrapolate good performance to the “average” case encountered in that dataset, much more information is required before performance can be extrapolated to individual cases from another institution, a different model of scanner, or even particularly challenging cases from within the original dataset, often referred to as “edge cases.”
You’ll want to become familiar with standard metrics of diagnostic performance, such as accuracy, sensitivity, specificity, and positive (or negative) predictive value – including differences in the dependence of these metrics on the prevalence of the “positive” condition. You’ll also want to learn more about receiver operating characteristic (ROC) analysis and the area-under-the-curve (AUC) metric. ROC curves and AUC values are not calculated at a particular operating point of the model, but over the entire range of operating points, which can be misleading when trying to understand how a model would perform in practice.
For screening tasks — whether or not the results are intended to be reviewed by a radiologist prior to affecting the clinical workflow — it is usually appropriate to trade off some degree of specificity (the ability to confidently rule out negative cases) in favor of higher sensitivity (the ability to identify the target pathology). However, for assisted diagnosis tasks, one may prefer higher specificity or positive predictive value, particularly if the model output could be challenging for a human to interpret.
Step 2: Familiarize Yourself with ML Terminology
In addition to understanding statistics, it is important to understand the basic terminology of data science and ML, so you’ll be ready to ask the right questions when evaluating a potential ML tool for clinical implementation. Recent articles from Esteva et al. in Nature Medicine [1] and Zaharchuk et al. in AJNR [2] provide a nice foundation in ML terminology in the context of healthcare (generally) and neuroimaging (specifically).
ML tasks typically require a large amount of data for “training” the model, which is the process of adjusting the parameters (or weights) in the mathematical formula for the model to achieve the desired outcome. Training is typically performed with a separate “validation” dataset for intermediate evaluation of the model’s performance and subsequent adjusting of other model characteristics, called “hyperparameters.”
Notoriously, ML models can “overfit” to training data or validation data, so a separate or “held-out” testing dataset — to which the model and the person(s) training it is never exposed — is also required for the final test of a model’s performance. If a model overfits to the training or validation data, it is likely identifying confounding patterns in the data not directly related to the target pathology, such as a particular technologist’s radiopaque laterality marker on chest radiographs or the noise pattern on the CT scanner from a given hospital’s emergency department or ICU.
You may also want to be familiar with the different types of ML algorithms that are out there and the types of tasks on which they typically perform well. Broadly speaking, ML algorithms are categorized into “supervised learning” — where the algorithm is trained on labeled data — and “unsupervised learning” — where the training data is not labeled. Most healthcare ML models are based on supervised algorithms.
In the realm of image analysis, you’ll typically hear more about deep learning and various convolutional neural networks (or CNN) architectures, such as ResNet for classification and U-net for segmentation. For text-based applications or natural language processing, you’ll encounter recurrent neural networks (or RNNs) and Bayesian networks. For tabular data, you may also come across decision trees and random forests. Having a basic familiarity with this terminology may be helpful in evaluating the suitability of a given algorithm for its intended task.
Once a trained model is available, the next critical step is performing a comprehensive analysis of model successes and failures. If it’s either (a) working well, but for the wrong reasons, or (b) not working on an important subset of cases, safe and effective implementation depends on a thorough understanding of failure modes and causes.
One example of the cause of failure of an algorithm is called “data drift.” Data drift refers to a shift in the underlying distribution of data over time, usually due to a number of variables which are difficult to pinpoint. In radiology, one could imagine subtle changes to noise patterns on CT scans due to changes in the reconstruction algorithm or adjustments to technique intended to reduce radiation dose. Many ML models are programmed as static models — indeed, the current FDA approval pathway requires this for marketed ML tools — so these models can be quite susceptible to data drift.
Since the current state-of-the-art in deep learning remains highly susceptible to degradation in performance, I also recommend that you approach any new ML venture with a plan for ongoing monitoring of the model’s performance with periodic comprehensive failure analysis. Such analysis will help you detect when the performance dips below acceptable levels early on and, hopefully, avoid major failures.
Step 3: Broaden Your Clinical Skillset
After preparing to assess potential ML tools for clinical implementation, imagine yourself in a radiology department in the future. Your department has multiple ML tools in place and you’re finally seeing the gains in efficiency that have been promised since you were a resident. How will you spend the time you’ve gained as a result?
There may be some increase in volume that arises due to improvements in workflow efficiency, but eventually the gains in interpretive efficiency will probably outpace the increase in volume. When that happens, the job of a diagnostic radiologist will necessarily change. There is a perhaps more urgent pressure to adapt our jobs to the long-promised (or threatened, depending on your perspective) shift to quality-based reimbursement. The most urgent and logical use of extra time afforded by technological gains in interpretive efficiency is to focus on ways to improve the quality of care we deliver to patients and our referring colleagues.
There are a number of admirable efforts out there championing greater involvement with patients and innovative approaches to interdisciplinary collaborations — including the ACR Patient- and Family-Centered Care initiative and Imaging 3.0. Many of these may very well become common practice when radiologists are “liberated from the reading room.”
Step 4: Dig Deeper
For those with a desire to delve deeper into the technical aspects of deep learning, I recommend reviewing previous ACR RFS Journal Club events with recordings on the RFS YouTube Channel, including my own session, “Hands-on Session for the Non-technical Beginner with Model Building in Kaggle.”
For those who want to engage actively in the development of ML technologies for radiology and medical imaging, I recommend starting with a solid foundation in informatics, so you can better understand how these technologies might be implemented within the medical imaging workflow. The National Imaging Informatics Course and Curriculum (NIIC), co-sponsored by the RSNA and the Society for Imaging Informatics in Medicine (SIIM), is a great place to start. Many radiology training programs offer standalone or integrated clinical informatics fellowships, providing more comprehensive training in imaging informatics.
Finally, if you want to try your hand at building and training an ML model, try out the ACR Data Science Institute’s ACR AI-LAB™ platform. It’s a great way to get involved in ML development without having to learn how to code. Plus, there’s no better way to learn than by getting your hands dirty.
Alternatively, consider forming a team and building a model for a past, present, or future Kaggle Challenge. A recent SIIM-ACR Pneumothorax Segmentation Challenge and current RSNA Intracranial Hemorrhage Detection Challenge are great examples. Even if the competition has ended, you can try to recreate one of the winning solutions (posted on the Discussion board) or create something new!
Walter Wiggins, MD, PhD | Neuroradiology Fellow, Duke University Hospital
References
1. Esteva A, Robicquet A, Ramsundar B, et al. Nat Med. 2019 Jan;25(1):24-29.
2. Zaharchuk G, Gong E, Wintermark M, Rubin D, Langlotz CP. Deep learning in neuroradiology. Am J Neuroradiol. 2018 Oct;39(10):1776-1784.
As radiologists, we strive to deliver high-quality images for interpretation while maintaining patient safety, and to deliver accurate, concise reports that will inform patient care. We have improved image quality with advances in technology and attention to optimizing protocols. We have made a stronger commitment to patient safety, comfort, and satisfaction with research, communication, and education about contrast and radiation issues. But when it comes to radiology reports, little has changed over the past century.