In radiology AI, generalizability means that a decision-making-machine can provide the same quality of decisions about new, previously unseen data in different settings, with different scanners, protocols, and groups of patients.
For radiology ML for image evaluation, also known as computer vision machine learning (CVML), generalization is a vexing hurdle, because new images and other medical data being input to the CVML model are not exactly the same as the data the model was trained on. This is known as distribution shift, also called dataset shift or covariate shift.
New settings and the effects of time often impact performance
Suppose I build a model that makes decisions like an A+ radiologist when I use exams conducted with my specific image acquisition machines, my protocols, and my patients’ unique phenotypes. If you install my model to your setting, with your different brands or versions of imaging machines, different protocols, or different patient populations, it might continue to work like an A+ radiologist, but it might also work like a C, D, or even a failing radiologist.
In my own setting, if I change protocols, update scanner software, get a new X-ray tube, or change something about the patient population, my model could fall to average or even worse-than-average performance. Even if a CVML model works well at the start, the imaging data generated will inevitably drift over time due to changes such as:
• Protocol tweaks
• New image acquisition and reconstruction software and hardware
• Patient phenotypes and genotypes
• Demographic variations
• Social, cultural, and environmental modifications
Slowly but surely, our decision-making machines will do worse as time passes.
Concept drift is another challenge for AI
A related topic is concept drift, which is a change over time in the relationship between input data and the output decision, or output data. COVID-19 offers an example. Suppose in 2019 we built a model to diagnose different types of pneumonia on chest CT, and we classified patchy ground glass opacities as influenza. This would have worked fine until early 2020 when those input findings no longer accurately predicted that answer, and instead meant an entirely new disease.
Concept drift can change acutely, as in the case of COVID-19, and also gradually or even cyclically — such as with changing seasons or economic cycles, when people have varying environmental exposures. For radiologists, changes in practice patterns over time can cause concept drift. For example, the 2016 WHO definition of brain tumors changed how radiologists label some brain tumors, and the gradual evolution of high resolution CT, and the knowledge and classification of idiopathic interstitial pneumonias, changes how they are described.
The best way to handle close monitoring of AI products is still under debate
Because today’s CVML products work variably depending on local data, can we trust them to provide accurate and appropriately actionable decisions in our setting? In a word, “No.” We cannot trust them blindly. Nor should we rely on FDA clearance as it exists today, since it currently offers no assurances about how well a CVML product will work on our own data.
Before using a CVML product in busy clinical practice at scale, it is imperative to understand when, how, and why the product is expected to be clinically useful and trustworthy — and under what conditions it might not work as expected. We want to verify that the product works in our individual setting, with data from any input device, which can require weeks or even months of closely monitored trial use. At a minimum, the trial should include input data from a robust portion of all imaging devices in our system that would provide data to the CVML product, using all protocol variations or tweaks used clinically.
Once we verify a product in our setting, it should be monitored closely and continuously for drift in the decisions the product is making. This is similar to medical physicists who evaluate radiation therapy systems, with frequent calibrations and regular full simulations, or a clinical laboratories’ QA policies and procedures.
Because AI tools are so new, we don’t yet know how best to monitor them as they run in busy clinical practice. We don’t currently understand how to monitor output “decisions” on a continuous or semi-continuous basis, similar to performing peer review on every 10th or 50th case. These are hot research topics among highly technical academic and industry computer scientists and systems engineers, and many questions remain.
One possibility ACR is working on would monitor models with interpretive functions, using clinical data registries to record the AI model inference, radiologist agreement or disagreement and metadata about the examination including patient demographics, equipment manufacturer and protocol and other relevant parameters. Data from the registry could be filtered to assess performance on individual machines — which in a busy practice could help drill down a specific problem. Even so, CVML relies on tens or even hundreds of thousands of patterns (or features) of clusters of pixels in each image, so at some level we will need to monitor image pixel data at a much more granular scale that we’ve ever done. It is still unclear how best to do that, but radiologists must be involved to ensure the safety of our patients.
If you purchase a CVML product today, your vendor might offer to fine-tune it on your existing exams, because their product doesn’t generalize to your data right out of the box. “Existing” is a keyword here because, as your data change over time, it is up to you to be sure your AI model continues to work as expected.
J. Raymond Geis, MD | Senior Scientist, ACR Data Science Institute | Adjunct Associate Professor of Radiology, National Jewish Health, Denver, CO
• Blog posts on how to monitor AI in production: https://evidentlyai.com/blog
• Introductory video on generalization from Google: https://developers.google.com/machine-learning/crash-course/generalization/video-lecture