Are You Sure You Want to Share Your Data to Develop AI

Home Blog

Are You Sure You Want to Share Your Data to Develop AI?

June 14, 2022 | Rebecca Driessen, MD; Nabile Safdar, MD, MPH

In the 1990s, the Massachusetts Group Insurance Commission released anonymized individual data on all state employees, including every hospital visit from the 1990s. In 1997, while still a computer science student at MIT, Dr. Latanya Sweeney requested the data set and was able to re-identify the data, sending the governor of Massachusetts’ health records to his office. She later went on to show that 87% of people in the U.S. can be identified by only three unique pieces of information — their 5-digit ZIP code, birthdate, and gender.

Over 20 years later the appropriate use and privacy of patient data is as much of a concern. Today, we’re not likely to see broad disclosures of individual data by the government, but large hospital systems use big data, including clinical, imaging, genomic, and demographics to drive health care innovations.

To develop AI algorithms that have widespread applicability, organizations must share anonymized patient data. Radiology practices are sometimes reluctant to share their data, in part because de-identification in imaging is notoriously difficult. Here’s a look at various methods for helping radiologists protect patient privacy and what the ACR Data Science Institute is doing to advance solutions that enable data sharing for AI development.

The Regulatory Environment

In the U.S., healthcare data is protected under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). HIPAA covers Protected Health Information (PHI), which is defined as any piece of individually identifiable health information held by a covered entity transmitted or maintained in any form or medium. The HIPAA Privacy Rule also describes the circumstances under which PHI can be shared with third parties when de-identified.

HIPAA outlines two methods for de-identification. The first is the Expert Determination method, which states that a person with appropriate knowledge of and experience with accepted statistical and scientific principles to render the information not individually identifiable, applies this principle and determines that the risk of re-identification using available information is very small, and then documents the methods and results to justify this determination. The second is the Safe Harbor method, which requires the removal of 18 specific identifiers.

This regulatory environment informs how radiologists representing the interests of their practices, their patients, and their research subjects approach issues of privacy, consent, data ownership, and the concerns of vulnerable populations when embarking on their own AI journey together with third parties.

Special Considerations for Radiologists

Disclosures of research and innovation data often hinge on de-identifying images and related data, usually by the Safe Harbor method, but de-identification in imaging is notoriously difficult. De-identification of medical imaging requires addressing metadata found in DICOM files. While several tools are available, few are 100% successful at de-identification, especially when dealing with large, heterogeneous data sets. Even when the DICOM metadata is de-identified, there is a concern that identifying information might be “burned in” to images by modalities, in scanned reports, or from associated processing software.

Any imaging of the face raises further concerns. Several open-source de-facing software applications are available; however, a review of six available de-face applications for brain MRIs found that the most successful application had only an 89% success rate. De-identification of radiology reports is also a challenge as they might include PHI within their text.

With the limitations of de-identification in medical imaging, there is a need for other methods of protecting data privacy. Differential privacy and federated learning are two methods being explored:

• Differential privacy is a mathematical definition of privacy based on cryptography which publishes a pattern from a large data set, so that an individual’s personal data is not distinguishable. It is a method that works best on large data sets. Because it answers queries approximately, it is useful in general statistics and pattern recognition, but has limited utility in answering specific questions.
• Federated learning is another method of protecting privacy. It independently trains a network on a population’s data and then reports all the independently trained models back to a centralized model.

Both approaches are promising, but still face practical challenges, such as dealing with heterogeneity in distributed systems and maintaining performance in light of increased computational overhead. Including efficient methods of communication between the central and decentralized learning model and limitations of hardware and software memory and processing capacities necessary for the required computations.

How the ACR Data Science Institute Is Helping

The ACR Data Science Institute (ACR DSI) has been at the forefront of dealing with these challenges and others. Besides defining use cases and a dataset directory for AI development, the ACR DSI provides practical tools in a data science toolkit that radiologists can use to develop their models through the ACR AI-LAB™. The ACR DSI is also spearheading a collaborative, multi-institutional federated learning experiment utilizing a combination of central ACR servers and localized institutional datasets that are never shared with other partners.

The ACR has also been addressing some of the stickiest issues associated with working with data. In 2019, the ACR created a data sharing workgroup that identified five key elements within data sharing — informed consent, data standardization, contracts, valuation, and privacy. The workgroup proposed that a governance board might be necessary for developing a system for informed consent in data-sharing agreements, creating a uniform consent process, and determining if scenarios exist where sharing of patient data poses a low enough risk to the patient that informed consent would not be required.

Ongoing Challenges and Opportunities

Despite all the progress, challenges to data sharing remain. HIPAA and other related regulations in the U.S. were codified well before the current AI environment took hold, and are criticized both for not giving adequate protection for privacy and for being overly restrictive in a time when the benefits of AI in healthcare are limited.

While federated learning shows promise for model training and validation, it is in its early stages. Researchers and ethicists are still grappling with the best way to deal with the potential bias that using unrepresentative datasets introduces into AI models. These challenges define the opportunities for improvement and innovation in the responsible development of data-driven technologies and partnerships.

Rebecca Driessen, MD | Diagnostic Radiology Resident | Emory University School of Medicine

Nabile M. Safdar, MD, MPH | Endowed Professor and Vice-Chair of Informatics, Dept. of Radiology and Imaging Sciences, Emory University; Associate Chief Medical Information Officer, Emory Healthcare