Benefits and Dangers of Synthetic Data

Home Blog

Benefits and Dangers of Synthetic Data for Radiology AI

December 12, 2022 | Bryant Chang, BS, MS, MS3; Tessa Cook, MD, PhD, CIIP, FSIIM, FCPP

Artificial intelligence has the potential to dramatically change the practice of radiology. Developing AI models requires accurate and reliable training data, which can be difficult and expensive to obtain. Synthetic data — which is artificially generated data that is not produced by scanning an actual patient — can address this need, but brings its own set of challenges.

Synthetic Data in Radiology Today

Techniques to generate synthetic data exist today, for modalities such as MRI and CT:

• Synthetic MRI. This involves acquiring preliminary MRI sequences and using them to generate new images based on parametric maps of tissue properties. This technique can decrease scan time and obviate the need for gadolinium-based contrast agents. Synthetic MRI image quality has been shown to be comparable to conventional contrast-weighted images for the identification of brain pathology, and it has similar diagnostic utility compared with conventional MRI.
• Synthetic CT. The recent iodinated contrast media shortage highlighted the potential need for synthetic CT. “CT-like” images synthesized from MRI sequences can reduce patient radiation exposure and have the potential to replace CT in the identification of certain specific musculoskeletal findings .

Techniques for Generating Purely Synthetic Data

Unlike existing synthetic data generation approaches, which require some imaging of the patient of interest as preliminary data, newer techniques can generate truly synthetic data that is not tied to a particular patient.

Generative adversarial networks (GANs) consist of two separate neural networks that oppose each other. The “generator” network produces new data that resembles actual training data. The “discriminator” network evaluates the output of the generator to determine whether it is “true” data or synthesized data. As the two networks work against one another — with the generator creating new data and the discriminator attempting to recognize the generator’s data — the quality of the synthetic data improves after numerous iterations. The ultimate goal is for synthetic data to be indistinguishable from real data.

Stable Diffusion is a recent deep learning technique gaining popularity for synthetic data creation. It uses text descriptions to produce matching images.

Advantages of Synthetic Data

One of the biggest advantages of purely synthetic data is that it does not contain any actual patients’ protected health information, since images are not produced by scanning real individuals. This makes synthetic data more easily shareable among institutions for research, allowing for greater collaboration and innovation.

GANs can be used to augment training datasets. Researchers have used GANs to generate synthetic chest X-ray images to train models to detect COVID-19. Model accuracy was 85% before data augmentation, and improved to 95% with use of the additional synthetic images. More than one group has shown that synthetic data is indistinguishable from data acquired on scanners.

Synthetic data can also be generated to address instances where there is a scarcity of real data in patients who meet certain criteria or have particular conditions. Robust, heterogeneous datasets reflecting a diversity of patient characteristics, imaging equipment, and geographic locations are essential to minimizing data-related bias introduced into AI models. Artificial datasets can be designed to reflect nearly all possible scenarios. For example, researchers were able to generate synthetic abnormal MRI images with brain tumors . This allowed the team to use synthetic MRI images to train automated tumor segmentation software and enhance the accuracy of segmentations on real data.

Synthetic data can mitigate other biases found in real data. Datasets used to train dermatology AI models have been found to be heavily skewed towards lighter skin tones on the Fitzpatrick scale. Models trained with such data do not perform as well when applied to patients at the other end of the scale. Researchers used GANs to create clinical images of skin conditions that closely resembled real skin pathologies, and were able to vary the size, location, and underlying skin color for each disease. The team was then able to train a dermatologic disease classifier that performed comparably to their baseline model but also more frequently classified rare skin malignancies.

Recently, researchers fine-tuned Stable Diffusion using a large dataset of chest radiographs, and used descriptions of findings, such as “large pleural effusion” to produce realistic synthetic chest radiographs depicting the specified abnormality.

Risks of Synthetic Data

Synthetic data is not without limitations and potential pitfalls. The quality of the artificially generated data depends heavily on the quality of the model that created it, as well as the dataset from which it was developed. Furthermore, the process of creating synthetic data requires exhaustive curation of medical data, with ground-truth labels and additional verification steps, such as comparing artificial datasets with real-world data to ensure reliability of results. In some cases, synthetic data might not be sufficiently accurate to substitute for real data.

Synthetic data might also not be complex and nuanced enough to mimic real data. Artifacts in synthetic data could be unique as compared to real data and could occur more frequently. Synthetic images of certain modalities, such as MRI for neuroimaging, are still inferior to their real counterparts . The creation, curation, and utilization of synthetic data is still a relatively new field, with a host of complications that must be addressed.

Conclusions

Synthetic data will play a large role in medical research and informatics for the foreseeable future. A review by Gartner predicts that by 2030, the total amount of artificially generated data will be greater than real data. Synthetic data used today relies on preliminary data acquired from patients and contributes to clinical decision making. However, synthetic data generated de novo could play an indirect role in future practice.

Present-day AI tools have a spectrum of accuracy for various reasons, in part due to the availability of appropriate training data during model development. Synthetic data can fill gaps where real data does not yet exist or cannot be used due to privacy concerns. With the emerging interest in the positive uses of synthetic data, it is also important for researchers and clinicians to be aware of the potential drawbacks and lack of consensus surrounding its use. Improving AI models that assist radiologists is one of the many ways synthetic data will impact future practice.

Bryant Chang, BS, MS, MS3 | Drexel University College of Medicine

Tessa S. Cook, MD, PhD, CIIP, FSIIM, FCPP | Associate Professor of Radiology | Vice Chair of Practice Transformation, Department of Radiology | Perelman School of Medicine at the University of Pennsylvania