by Anthony B. Cardillo, MD; Resident Physician, PGY-2, Pathology and Laboratory Medicine, University of Rochester Medical Center

 

No matter the application, machine learning requires the use of enormous amounts of input data to train models that can competently classify novel data. Let’s talk about whole slide images (WSI), as they exemplify the problem of transferring large, legally-protected data.

 

Presently, WSI databases are “siloed,” or isolated. They are typically curated at a single academic institution, and occasionally (through great effort) aggregated into a large centralized database. In the United States, the legal protection of patient data via HIPAA makes data aggregation between institutions difficult. On the larger scale – the legality around sharing patient data internationally makes it virtually impossible to create the global repositories of input data that will drive future models in a timely manner.

 

Perhaps more challenging for WSI specifically is the technical infrastructure required to transfer such large files. A single digitized slide is often several gigabytes. A database of 1,000 images could easily be over a terabyte, and 1,000,000 slides could be a petabyte of data to transfer!

 

Despite these cumbersome constraints, organizations like the Digital Pathology Association have created practical repositories that link to some of these siloed collections, which have been graciously anonymized and are suitable for sharing. Even these collections, however, may require proprietary viewers to be downloaded, new accounts to be created, or (in a few cases) external memberships to be purchased. Consider this current reality, and then compare it to the ideal situation asked by this question: What if we could train a deep learning model on ALL the curated whole slide images, regardless of where they are scanned?

 

The upsides are enormous: a model trained on more input data reduces overfitting, sees more low-prevalence cases, and generally improves the metrics we care about as pathologists. Additionally, a model trained on whole slide images from across the globe in real-life conditions would reduce ethnic-bias by including minorities that currently are underrepresented in datasets.

 

Enter “Federated Learning,” the main topic of this post. At its core, federated learning is training a single model across numerous local collections – without any of the collections revealing their input data. Every local collection shares what they’ve learned, but not from what they learned.

 

This sounds complicated – so let’s create a hypothetical example: imagine that ten major academic universities, each from a different country, want to learn from each other’s whole slide imaging databases. They want to do this so they can train a new diagnostic neural network that is enriched with global material. Unfortunately, the legal barriers create an immediate resistance that would significantly slow down the project, and sharing all the original slides would require several petabytes (1,000 terabytes!) of data to be transferred to every university.

 

What if each university agreed to locally train a standard neural network (“let’s all start with the same Inception-ResNet model”), and then shared the weights of the neurons from their local model after training? After all, the weights in a neural network don’t reveal anything about the input data – it’s simply sharing your network’s final “wiring.” All ten universities then share their model weights (measured in megabytes) and combine them together. Once all ten universities are in sync (everyone has combined their weights, creating the same combined model), we then repeat the process with each university running this new combined model on their local data.

 

In this example, we achieve one “global” model, which is made by combining all of the “local” models. The global model, in theory, encodes information from each of the local models, yet none of the local models have revealed their original slides.

 

Certainly, this is only an overview, and there are potential downsides to this approach. Notably, because the original images are not being shared (by design), pathologists cannot verify whether the model is performing appropriately except on their own images. Additionally, because whole slide images require pre-processing (e.g. color normalization, tiling) before being “fed” into the network, each university must agree on using the same pre-processing procedure and build that procedure into the distributed model.

 

The upsides of “privacy-by-design” federated learning make it well worth exploring for machine learning on unfeasibly-large datasets. In whole-slide imaging, where anonymization is usually as simple as cropping patient labels, the biggest advantage is that the need to share gigabyte-sized images is eliminated. For genetic information, this approach actively complies with HIPAA and international laws on sharing patient information. The advantage of this methodology comes close to priceless in terms of the scientific discoveries that await. Federated learning can be used for any type of data where deep learning is a suitable approach, files are too cumbersome to transfer over the internet, or patient anonymity is an absolute necessity. The scientific community has always leaned towards collaboration, and federated learning is the next step in that eternal mindset.

 

Disclaimer: In seeking to foster discourse on a wide array of ideas, the Digital Pathology Association believes that it is important to share a range of prominent industry viewpoints. This article does not necessarily express the viewpoints of the DPA, however we view this as a valuable point with which to facilitate discussion.