How to Share Big Data for Artificial Intelligence

When discussing large datasets intended for automated or semi-automated analysis, “big data” might be a misnomer; perhaps a better name would be “unwieldy data.” Why unwieldy? Because at a certain size of dataset — say all the imaging reports over a decade's span — it becomes impossible to manually categorize those reports for use in trend-detection and longitudinal computer analysis.

To be usable, massive datasets must be trimmed and categorized — largely by computer algorithms. That’s where ethical sharing of patient data comes in. Proper oversight of those data management algorithms is critical to ensure protection of sensitive patient data, especially when it comes to health systems that are sharing their information with AI and big data companies.

Here’s why it matters. A health system’s voluminous stream of private health information (PHI), has become an increasingly valuable source of data for third-party vendors to use as a springboard for AI in medicine. Sharing that PHI data creates several challenges for health care providers to navigate:

• Who will oversee data integrity and stewardship?
• How will the scope of the project influence the most appropriate data subsets?
• What kind of data will be shared and how accessible is it?
• How is the data anonymized?
• How will data sharing occur?
• What are the downstream implications of your data's use?

How Baptist Health Tackled the Challenge

Since 2016, Baptist Health South Florida (BHSF) has been a key partner with IBM Watson Health. As a member of only a handful of companies partnering with IBM, BHSF took the collaboration very seriously with regard to data integrity and patient privacy.

Based on BHSF’s experience, here are six common-sense steps to consider if your organization is considering partnering with a third-party provider on an AI initiative:

1. Who will oversee data integrity and stewardship? At BHSF, there is a Chief Data Officer (CDO) whose purview includes business intelligence, data analytics, and the details of third-party data sharing. Whereas in 2012, surveys showed that only 12% of firms had a CDO, more recent surveys suggest that 90% of large organizations will have a CDO by the end of next year. Having a C-suite level voice at the table (whose primary role is to govern data acquisition, curation, and sharing) is critical to any collaboration with a third-party vendor. In the absence of a CDO, a similarly empowered administrator should take on the steward's role in a data-sharing venture.

2. How will the scope of the project influence the most appropriate data subsets? Similar to an Institutional Review Board's stewardship, a health system should take responsibility for granting sensible access to a select portion of all the PHI available from the medical record. Just as it would be difficult for a researcher to ask for every patient's medical record over the course of five years, so too should a conversation with the third-party vendor begin with a discussion of which subset of patient data is most relevant to the AI or big data effort — such as specific demographics, ICD9/10 codes, admitting diagnoses, or keywords within the data (such as "aortic stenosis" within echocardiography reports).

3. What kind of data will be shared and how accessible is it? At each health care site, there may be several "sources of truth" for patient data — a radiology information system, a hospital information system, an outpatient electronic medical record, etc. Ideally, the hospital has invested in a data warehouse that serves as a centralized repository for all medical data, structured to allow data analytics and, therefore, business intelligence insights. This lends itself easily to access for a big data project. However, even with a data warehouse, there may be images — for example, endoscopy, dermatology photographs, digital pathology slides, radiology images, and ECG strips — that are stored in various other locations. That information is likely more difficult to access and include in the shared datasets.

4. How is the data anonymized? It would be naive to think that a connection between a data warehouse and a third party could be anonymized by toggling a large button that reads "Convert to Anonymous." While PHI can be consistently found in certain data fields, it also shows up in surprising locations. Images might have burned-in DICOM annotations that include PHI; imaging reports and medical notes might include a name or a date of birth in an atypical location.

At BHSF, the approach was to anonymize data on the hospital side (rather than on the IBM side) to maintain control of the process. The NIH's National Library of Medicine has a free, HIPAA-compliant de-identification tool, the NLM-Scrubber , which uses natural language processing to remove identifiers from text-based medical records. After using the tool, BHSF conducted a “PHI safety check” on a sample of records from the data warehouse (approximately 1% of the output to ensure data is correctly anonymized to the satisfaction of a data governance committee.)

Radiology images for the IBM-BHSF collaboration required a separate process — the DICOM headers were anonymized, any non-DICOM data was removed (e.g., scanned paperwork), and the final product was hand-inspected for true anonymization. Because the imaging dataset was smaller, on the order of hundreds to thousands, it was possible for human eyes to quickly scan each dataset to make sure no burned-in DICOM annotations were included on the images.

Ultrasonography and echocardiography imaging studies were excluded from the initial data-sharing agreement, primarily because they pose particularly difficult challenges with anonymization (most of the equipment used at BHSF burns the PHI into the upper margin of each image). In the future, adding black strips along the top of image frames (so-called "masking") might allow true anonymization of ultrasonographic studies.

Since no system is perfect, the data-sharing agreement also includes instructions for the receiving vendor to delete any imperfectly anonymized studies and alert the data governance committee.

5. How will the data sharing occur? Because of the steps outlined above, the BHSF-IBM collaboration uses a "push only" system for data sharing. The vendor cannot directly pull data from the data warehouse. Instead, BHSF creates a batch of de-identified data corresponding to particular ICD9/10codes and pushes it via encrypted secure File Transfer Protocol (FTP) to the vendor.

One can imagine a future where the de-identification tools are so reliable and the "walled garden" of data for the vendor is so well-circumscribed that a vendor might be allowed to pull data continuously from a data warehouse feed. This would reduce the need for human intervention in providing data and allow constantly updated data for the third party. However, it also raises the uncomfortable specter of the inadvertent release of non-anonymized data or accidental access to more data than originally agreed.

6. What are the downstream implications of your data's use? By nature, the historical data used in AI involves a large number of patients, which will be applied to an even larger pool of new patients who might benefit from the technology. After data is shared, it may be several years before issues arise and they may come as a surprise. Data quality problems were partially to blame for MD Anderson’s failed AI project for cancer diagnostics powered by IBM Watson and BHSF's partner, Watson for Oncology, was recently highlighted as having delivered suboptimal recommendations on cancer patients when using algorithms derived from data provided by Memorial Sloan Kettering Cancer Center.

The questions then become:

• Should there be agreements on disclosure of the exact datasets used and transparency of the artificial intelligence process in generating recommendations intended to be used on new patients?
• Is there an obligation for the health system data provider to allow perpetual access to the historical data (and perhaps access to new data from those same historical patients on an ongoing basis) in order to improve and perfect AI-based clinical algorithms?
• Is there a responsibility or a particular set of standards to adhere to in order to make sure the data annotation and structuring was done correctly in a way that is faithful to the AI-based algorithm? Almost by definition, the data provider is helping annotate and structure data. By identifying a patient as a "metastatic lung cancer" patient, for example, the health system has designated that patient's chart in a helpful way for the vendor.
• If a health system is often expected to share malpractice liability for errors committed on its premises or by its practitioners, how does that expectation extend to algorithms developed with a health system's patients as the foundational data source?

These and many more thorny questions arise from the entanglement of a health system with a third party vendor's algorithms.

Get Prepared Before Jumping In

As large enterprises in health care operationalize the collection, curation, and analysis of data, there will be increasing opportunities to use AI to improve the business intelligence, clinical decision support, patient management, and patient outcomes for a health system. The unwieldy nature of the data, however, might make it impossible to move quickly when an opportunity presents itself.

If your organization is considering moving forward on this front, spend some time thinking through your answers to the questions put forth here early in the process. Simply put, ethical sharing of big data takes preparation and thought. It’s best to start early and consider the details in the data-sharing agreement and process well in advance.


By Juan Carlos Batlle, MD, MBA, M. Bioethics | Associate Professor, FIU College of Medicine | Chief of Thoracic Imaging, Baptist Health South Florida


How to Share Big Data for Artificial Intelligence

  • You may also like

    Are You Sure You Want to Share Your Data to Develop AI?

    As radiologists, we strive to deliver high-quality images for interpretation while maintaining patient safety, and to deliver accurate, concise reports that will inform patient care. We have improved image quality with advances in technology and attention to optimizing protocols. We have made a stronger commitment to patient safety, comfort, and satisfaction with research, communication, and education about contrast and radiation issues. But when it comes to radiology reports, little has changed over the past century.