Keywords privacy-preserving AI techniques - federated learning - biomedicine
Introduction
Artificial intelligence (AI) strives to emulate the human mind and to solve complex
tasks by learning from available data. For many complex tasks, AI already surpasses
humans in terms of accuracy, speed, and cost. Recently, the rapid adoption of AI and
its subfields, specifically machine learning and deep learning, has led to substantial
progress in applications such as autonomous driving,[1 ] text translation,[2 ] and voice assistance.[3 ] At the same time, AI is becoming essential in biomedicine, where big data in health
care necessitates techniques that help scientists to gain understanding from it.[4 ]
Success stories such as acquiring the compressed representation of drug-like molecules,[5 ] modeling the hierarchical structure and function of a cell[6 ] and translating magnetic resonance images to computed tomography[7 ] using deep learning models illustrate the remarkable performance of these AI approaches.
AI has not only achieved remarkable success in analyzing genomic and biomedical data,[8 ]
[9 ]
[10 ]
[11 ]
[12 ]
[13 ]
[14 ]
[15 ]
[16 ]
[17 ]
[18 ] but has also surpassed humans in applications such as sepsis prediction,[19 ] malignancy detection in mammography,[20 ] and mitosis detection in breast cancer.[21 ]
Despite these AI-fueled advancements, important privacy concerns have been raised
regarding the individuals who contribute to the datasets. While taking care of the
confidentiality of sensitive biological data is crucial,[22 ] several studies showed that AI techniques often do not maintain data privacy.[23 ]
[24 ]
[25 ]
[26 ] For example, attacks known as membership inference can be used to infer an individual's
membership by querying over the dataset[27 ] or the trained model,[23 ] or by having access to certain statistics about the dataset.[28 ]
[29 ]
[30 ] Homer et al[28 ] showed that under some assumptions, an adversary (an attacker who attempts to invade
data privacy) can use the statistics published as the result of genome-wide association
studies (GWAS) to find out if an individual was a part of the study. Another example
of this kind of attack was demonstrated by attacks on Genomics Beacons,[27 ]
[31 ] in which an adversary could determine the presence of an individual in the dataset
by simply querying the presence of a particular allele. Moreover, the attacker could
identify the relatives of those individuals and obtain sensitive disease information.[27 ]
[32 ] Besides targeting the training dataset, an adversary may attack a fully trained
AI model to extract individual-level membership by training an adversarial inference
model that learns the behavior of the target model.[23 ]
As a result of the aforementioned studies, health research centers such as the National
Institutes of Health (NIH) as well as hospitals have restricted access to the pseudonymized
data.[22 ]
[33 ]
[34 ] Furthermore, data privacy laws such as those enforced by the Health Insurance Portability
and Accountability Act (HIPAA), and the Family Educational Rights and Privacy Act
(FERPA) in the U.S. as well as the EU General Data Protection Regulation (GDPR) restrict
the use of sensitive data.[35 ]
[36 ] Consequently, getting access to these datasets requires a lengthy approval process,
which significantly impedes collaborative research. Therefore, both industry and academia
urgently need to apply privacy-preserving techniques to respect individual privacy
and comply with these laws.
This paper provides a systematic overview over various recently proposed privacy-preserving
AI techniques in biomedicine, which facilitate the collaboration between health research
institutes. Several efforts exist to tackle the privacy concerns in several domains,
some of which have been examined in a couple of surveys.[37 ]
[38 ]
[39 ] Aziz et al[37 ] investigated previous studies which employed differential privacy and cryptographic
techniques for human genomic data. Kaissis et al[39 ] briefly reviewed federated learning, differential privacy and cryptographic techniques
applied in medical imaging. Xu et al[38 ] surveyed general solutions to challenges in federated learning including communication
efficiency, optimization, as well as privacy and discussed possible applications including
a few examples in health care. Compared with Aziz et al and Kaissis et al,[37 ]
[39 ] this paper covers a broader set of privacy preserving techniques including federated
learning and hybrid approaches. In contrast with Xu et al[38 ] we additionally discuss cryptographic techniques and differential privacy approaches
and their applications in biomedicine. Moreover, this survey covers a wider range
of studies that employed different privacy-preserving techniques in genomics and biomedicine
and compares the approaches using different criteria such as privacy, accuracy, and
efficiency. It is notable that there are some hardware-based privacy-preserving approaches
such as Intel Software Guard Extensions[40 ]
[41 ]
[42 ] and AMD memory encryption,[43 ] which allow for secure computation via secure hardware which are beyond the scope
of this study.
The presented approaches are divided into four categories: cryptographic techniques,
differential privacy, federated learning, and hybrid approaches. First, we describe
how cryptographic techniques—in particular, homomorphic encryption (HE) and secure
multiparty computation (SMPC)—ensure secrecy of sensitive data by carrying out computations
on encrypted biological data. Next, we illustrate the differential privacy approach
and its capability in quantifying individuals' privacy in published summary statistics
of, for instance, GWAS data and deep learning models trained on clinical data. Then,
we elaborate on federated learning, which allows health institutes to train AIs locally
and to share only selected parameters without sensitive data with a coordinator, who
aggregates them and builds a global model. Following that, we discuss hybrid approaches
which enhance data privacy by combining federated learning with other privacy-preserving
techniques. We elaborate on the strengths and drawbacks of each approach as well as
its applications in biomedicine. More importantly, we provide a comparison among the
approaches with respect to different criteria such as computational and communication
efficiency, accuracy, and privacy. Finally, we discuss the most realistic approaches
from a practical viewpoint and provide a list of open problems and challenges that
remain for the adoption of these techniques in real-world biomedical applications.
Our review of privacy-preserving AI techniques in biomedicine yields the following
main insights: First, cryptographic techniques such as HE and SMPC, which follow the
paradigm of “bring data to computation“, are not computationally efficient and do
not scale well to large biomedical datasets. Second, federated learning follows the
paradigm of “bring computation to data“ is a more scalable approach. However, its
network communication efficiency is still an open problem and it does not provide
privacy guarantees. Third, hybrid approaches that combine cryptographic techniques
or differential privacy with federated learning are the most promising privacy-preserving
AI techniques for biomedical applications, because they promise to combine the scalability
of federated learning with the privacy guarantees of cryptographic techniques or differential
privacy.
Cryptographic Techniques
In biomedicine and GWAS in particular, cryptographic techniques have been used to
collaboratively compute result statistics while preserving data privacy.[40 ]
[44 ]
[45 ]
[46 ]
[47 ]
[48 ]
[49 ]
[50 ]
[51 ]
[52 ]
[53 ]
[54 ]
[55 ]
[56 ] These cryptographic approaches are based on HE[57 ]
[58 ]
[59 ] or SMPC.[60 ] There are different HE-based techniques such as partially HE (PHE)[58 ] and fully HE (FHE).[57 ] PHE allows either addition or multiplication operations to be performed on the encrypted
data while using FHE both addition and multiplication operations can be applied. All
HE-based approaches share three steps ([Fig. 1A ]):
Fig. 1 Different privacy-preserving AI techniques: (A ) homomorphic encryption , where the participants encrypt the private data and share it with a computing party,
which computes the aggregated results over the encrypted data from the participants;
(B ) secure multiparty computation in which each participant shares a separate, different secret with each computing
party; the computing parties calculate the intermediate results, secretly share them
with each other, and aggregate all intermediate results to obtain the final results;
(C ) differential privacy , which ensures the models trained on datasets including and excluding a specific
individual look statistically indistinguishable to the adversary; (D ) federated learning , where each participant downloads the global model from the server, computes the
local model given its private data and the global model, and finally sends its local
model to the server for aggregation and for updating the global model. (A ). Homomorphic encryption. (B ). Secure multiparty computation. (C ). Differential privacy. (D ). Federated learning.
Participants (e.g., hospitals or medical centers) encrypt their private data and send
the encrypted data to a computing party.
The computing party calculates the statistics over the encrypted data and shares the
statistics (which are encrypted) with the participants.
The participants access the results by decrypting them.
In SMPC, there are multiple participants as well as a couple of computing parties
which perform computations on secret shares from the participants. Given M participants and N computing parties, SMPC-based approaches follow three steps ([Fig. 1B ]):
Each participant sends a separate and different secret to each of the N computing parties.
Each computing party computes the intermediate results on the M secret shares from the participants and shares the intermediate results with the
other N − 1 computing parties.
Each computing party aggregates the intermediate results from all computing parties
including itself to calculate the final (global) results. In the end, the final results
computed by all computing parties are the same and can be shared by the participants.
To clarify the concepts of secret sharing[61 ] and multiparty computation, consider a scenario with two participants P
1 and P
2 and two computing parties C
1 and C
2 .[46 ]
P
1 and P
2 possess the private data X and Y , respectively. The aim is to compute X + Y , where neither P
1 nor P
2 reveals its data to the computing parties. To this end, P
1 and P
2 generate random numbers RX
and RY
, respectively; P
1 reveals RX
to C
1 and (X − RX
) to C
2 ; likewise, P
2 shares RY
with C
1 and Y − RY
with C
2 ; RX
, RY
, (X − RX
) and (Y − RY
) are secret shares. C
1 computes (RX
+ RY
) and sends it to C
2 and C
2 calculates (X − RX
) + (Y − RY
) and reveals it to C
1 . Both C
1 and C
2 add the result they computed to the result each obtained from the other computing
party. The sum is in fact (X + Y ), which can be shared with P
1 and P
2 .
Notice that to preserve the data privacy, the computing parties C
1 and C
2 must be non-colluding. That is, C
1 must not send RX
and RY
to C
2 and C
2 must not share (X − RX
) and (Y − RY
) with C
1 . Otherwise, the computing parties can compute X and Y , revealing the participants' data. In general, in an SMPC with N computing parties, data privacy is protected as long as most N − 1 computing parties collude with each other. The larger the N , the stronger the privacy but higher the communication overhead and processing time.
Another point is that, in addition to secret sharing, there are other transfer protocols
in SMPC such as oblivious transfer[62 ] and garbled circuit[63 ] which is a two-party computation protocol in which each of the parties hold its
private input and they jointly learn the output function describing the relation between
their private inputs. Moreover, threshold cryptography combines a secret sharing scheme
with cryptography to secretly share a key across distributed parties such that multiple
parties (more than a threshold) must coordinate to encrypt/decrypt a message.[59 ]
[64 ] That is, threshold cryptography can be considered as the combination of the HE and
SMPC methods.
Most studies use HE or SMPC to develop secure, privacy-aware algorithms applicable
to GWAS data. Kim and Lauter[47 ] and Lu et al[49 ] implemented a secure[2 ] test and Lauter et al[48 ] developed privacy-preserving versions of common statistical tests in GWAS, such
as the Pearson goodness of fit test, tests for linkage disequilibrium, and the Cochran
Armitage trend test using HE. Kim et al[65 ] and Morshed et al[66 ] presented HE-based secure logistic and linear regression algorithms for medical
data, respectively. Zhang et al,[53 ] Constable et al,[52 ] and Kamm et al[51 ] developed a SMPC-based secure χ[2 ] test. Shi et al[67 ] implemented a privacy-preserving logistic regression and Bloom[68 ] proposed a secure linear regression based on SMPC for GWAS data. Cho et al[44 ] introduced a SMPC-based framework to facilitate quality control and population stratification
correction for large-scale GWAS and argued that their framework is scalable to one
million individuals and half million single nucleotide polymorphisms (SNPs).
There are also other types of encryption techniques such as somewhat homomorphic encryption
(SWHE),[57 ] which are employed to address privacy issues in genomic applications such as outsourcing
genomic data computation to the cloud, and are not the main focus of this review.
The main drawback of SWHE is that the number of successive addition and multiplication
operations it can perform on the data are limited.[47 ] For more details, we refer to the comprehensive review by Mittos et al.[69 ]
Despite the promises of HE/SMPC-based privacy-preserving algorithms ([Table 1 ]), the road for the wide adoption of HE/SMPC-based algorithms in genomics and biomedicine
is long.[70 ] The major limitations of HE are few supported operations and computational overhead.[71 ] HE supports only addition and multiplication operations, and as a result, developing
complex AI models with non-linear operations such as deep neural networks (DNNs) using
HE is very challenging. Moreover, HE incurs remarkable computational overhead since
it performs operations on encrypted data. Although SMPC is more efficient than HE
from a computational perspective, it still suffers from high computational overhead,[72 ] which comes from processing secret shares from a large number of participants or
large amount of data by a few computing parties.
Table 1
Literature for cryptographic techniques and differential privacy in biomedicine
Authors
Year
Technique
Model
Application
Kim and Lauter[47 ]
2015
HE
χ
2 statistics
Minor allele frequency Hamming Distance
Edit distance
Genetic associations
DNA comparison
Lu et al[49 ]
2015
HE
χ
2 statistics
D ʹ measure
Genetic associations
Lauter et al[48 ]
2014
HE
D ʹ and r
2 measure
Pearson goodness-of-fit expectation maximization Cochran-Armitage
Genetic associations
Kim et al[65 ]
2018
HE
Logistic regression
Medical decision-making
Morshed et al[66 ]
2018
HE
Linear regression
Medical decision-making
Kamm et al[51 ]
2013
SMPC
χ
2 statistics
Genetic associations
Constable et al[52 ]
Zhang et al[53 ]
2015
2015
SMPC
χ
2 statistics
Minor allele frequency
Genetic associations
Shi et al[67 ]
2016
SMPC
Logistic regression
Genetic associations
Bloom[68 ]
2019
SMPC
Linear regression
Genetic associations
Cho et al[44 ]
2018
SMPC
Quality Control
Population stratification
Genetic associations
Johnson and Shmatikov[78 ]
2013
DP
Distance-score mechanism
p -value and χ
2 statistics
Querying genomic
databases
Cho et al[95 ]
2020
DP
α -geometric mechanism
Querying biomedical
databases
Aziz et al[79 ]
2017
DP
Eliminating random positions
Biased random response
Querying genomic databases
Han et al[80 ]
Yu et al[81 ]
2019
2014
DP
Logistic regression
Genetic associations
Honkela et al[82 ]
2018
DP
Bayesian linear regression
Drug sensitivity prediction
Simmons et al[83 ]
2016
DP
EIGENSTRAT
Linear mixed model
Genetic associations
Simmons and Berger[84 ]
2016
DP
Nearest neighbor optimization
Genetic associations
Fienberg et al[85 ]
Uhlerop et al[86 ]
Yu and Ji[87 ]
Wang et al[88 ]
2011
2013
2014
2014
DP
Statistics such as p -value,
χ
2 and contingency table
Genetic associations
Abay et al[97 ]
2018
DP
Deep autoencoder
Generating artificial biomedical data
Beaulieu et al[98 ]
2019
DP
GAN
Simulating SPRINT trial
Jordon et al[99 ]
2018
DP
GAN
Generating artificial biomedical data
Abbreviations: DP, differential privacy; HE, homomorphic encryption, SMPC, secure
multiparty computation.
Differential Privacy
One of the state-of-the-art concepts for eliminating and quantifying the chance of
information leakage is differential privacy .[73 ]
[74 ]
[75 ] Differential privacy is a mathematical model that encapsulates the idea of injecting
enough randomness or noise to sensitive data to camouflage the contribution of each
single individual. This is achieved by inserting uncertainty into the learning process
so that even a strong adversary with arbitrary auxiliary information about the data
will still be uncertain in identifying any of the individuals in the dataset. This
has become standard in data protection and has been effectively deployed by Google[76 ] and Apple[77 ] as well as agencies such as the United States Census Bureau. Furthermore, it has
drawn the attention of researchers in privacy-sensitive fields such as biomedicine
and health care[78 ]
[79 ]
[80 ]
[81 ]
[82 ]
[83 ]
[84 ]
[85 ]
[86 ]
[87 ]
[88 ]
[89 ]
[90 ]
[91 ]
[92 ]
[93 ].
Differential privacy ensures that the model we train does not overfit the sensitive
data of a particular user. The model trained on a dataset containing information of
a specific individual should be statistically indistinguishable from a model trained
without the individual ([Fig. 1C ]). As an example, assume that a patient would like to give consent to his/her doctor
to include his/her personal health record in a biomedical dataset to study the coordination
between age and cardiovascular disease. Differential privacy provides a mathematical
guarantee which captures the privacy risk associated with the patient's participation
in the study and explains to what extent the analyst or the potential adversary can
learn about that particular individual in the dataset. Note that, differential privacy
is typically employed for centralized datasets, where the output of the algorithm
is perturbed with noise. However, SMPC and HE are leveraged for use cases where data
are distributed across multiple clients, and carry out computation over the encrypted
data or secret shares from the data of the clients. Formally, a randomized algorithm
(an algorithm that has randomness in its logic and whose output can vary even on a
fixed input) A: Dn → Y is (ε , δ )-differentially private if for all subsets y ⊆ Y and for all adjacent datasets D, D ʹ ∈ Dn
that differ in at most one record, then the following inequality holds:
Pr [A (D ) ∈ y ] ≤ eε Pr [A (D
j
) ∈ y ] + δ
Here, ε and δ are privacy loss parameters where lower values imply stronger privacy guarantees.
δ is an exceedingly small value (e.g., 10
−
[5 ]) indicating the probability of an uncontrolled breach, where the algorithm produces
a specific output only in the presence of a specific individual and not otherwise.
ε represents the worst case privacy breach in the absence of any such rare breach.
If you assume δ = 0, you will have a pure (ε )-differentially private algorithm, while if you consider δ> 0 to approximate the case in which pure differential privacy is broken, you will
have an approximate (ε , δ )-differentially private algorithm.
Two important properties of differential privacy are composability[94 ] and resilience to post-processing. Composability means that combining multiple differentially
private algorithms yields another differentially private algorithm. More precisely,
if you combine k (ε , δ )-differentially private algorithms, the composed algorithm is at least (kε , kδ )-differentially private. Differential privacy also assures the resistance to post-processing
theorem which states passing the output of a (ε , δ )-differentially private algorithm to any arbitrary randomized algorithm will still
uphold the (ε , δ )-differential privacy guarantee.
The community efforts to ensure the privacy of sensitive genomic and biomedical data
using differential privacy can be grouped into four categories according to the problem
they address ([Table 1 ]):
Approaches to querying biomedical and genomics databases.[78 ]
[79 ]
[93 ]
[95 ]
Statistical and AI modeling techniques in genomics and biomedicine.[80 ]
[81 ]
[82 ]
[83 ]
[84 ]
[92 ]
[96 ]
Data release, i.e., releasing summary statistics of a GWAS such as p -values and χ
2 contingency tables.[85 ]
[86 ]
[87 ]
[88 ]
Training privacy-preserving generative models.[97 ]
[98 ]
[99 ]
Studies in the first category proposed solutions to reduce the privacy risks of genomics
databases such as GWAS databases and genomics beacon service.[100 ] The Beacon Network[31 ] is an online web service developed by the Global Alliance for Genomics and Health
(GA4GH) through which the users can query the data provided by owners or research
institutes, ask about the presence of a genetic variant in the database, and get a
YES/NO as response. Studies have shown that an attacker can detect membership in the
Beacon or GWAS by querying these databases multiple times and asking different questions.[27 ]
[101 ]
[102 ]
[103 ] Very recently, Cho et al[95 ] proposed a theoretical differential privacy mechanism to maximize the utility of
count query in biomedical systems while guaranteeing data privacy. Johnson and Shmatikov[78 ] developed a differentially private query-answering framework. With this framework
an analyst can retrieve statistical properties such as the correlation between SNPs
and get an almost accurate answer while the GWAS dataset is protected against privacy
risks. In another study, Aziz et al[79 ] proposed two algorithms to make the Beacon's response inaccurate by controlling
a bias variable. These algorithms decide when to answer the query correctly/incorrectly
according to specific conditions in the bias variable so that it gets harder for the
attacker to succeed.
Some of the efforts in the second category addressed the privacy concerns in GWAS
by introducing differentially private logistic regression to identify associations
between SNPs and diseases[80 ] or associations among multiple SNPs.[81 ] Honkela et al[82 ] improve drug sensitivity prediction by effectively employing differential privacy
for Bayesian linear regression. Moreover, Simmons et al[83 ] presented a differentially private EIGENSTRAT (PrivSTRAT)[104 ] and linear mixed model (PrivLMM)[105 ] to correct for population stratification. In another paper, Simmons et al[84 ] tackled the problem of finding significant SNPs by modeling it as an optimization
problem. Solving this problem provides a differentially private estimate of the neighbor
distance for all SNPs so that high scoring SNPs can be found.
The third category focused on releasing summary statistics such as p -values, χ
2 contingency tables, and minor allele frequencies in a differentially private fashion.
The common approach in these studies is to add Laplacian noise to the true value of
the statistics, so that sharing the perturbed statistics preserves privacy of the
individuals. They vary in the sensitivity of the algorithm (that is, the maximum change
on the output of an algorithm in presence or absence of a specific data point) and
hence require different injected noise.[85 ]
[86 ]
[88 ]
The fourth category proposed novel privacy-protecting methods to generate synthetic
health care data leveraging differentially private generative models ([Fig. 2 ]). Deep generative models, such as generative adversarial networks (GANs),[106 ] can be trained on sensitive genomics and biomedical data to capture its properties
and generate artificial data with similar characteristics as the original data.
Fig. 2 Differentially private deep generative models: The sensitive data holder (e.g., health
institutes) train a differentially private generative model locally and share just
the trained data generator with the outside world (e.g., researchers). The shared
data generator can then be used to produce artificial data with the same characteristics
as the sensitive data.
Abay et al[97 ] presented a differentially private deep generative model, DP-SYN, a generative autoencoder
that splits the input data into multiple partitions, then learns and simulates the
representation of each partition while maintaining the privacy of input data. They
assessed the performance of DP-SYN on sensitive datasets of breast cancer and diabetes.
Beaulieu et al[98 ] trained an auxiliary classifier GAN (AC-GAN) in a differentially private manner
to simulate the participants of the SPRINT trial (Systolic Blood Pressure Trial),
so that the clinical data can be shared while respecting participants' privacy. In
another approach, Jordon et al[99 ] introduced a differentially private GAN, PATE-GAN, and evaluated the quality of
synthetic data on Meta-Analysis Global Group in Chronic Heart Failure (MAGGIC) and
the United Network for Organ Transplantation (UNOS) datasets. Despite the aforementioned
achievements in adopting differential privacy in the field, several challenges remain
to be addressed. Although differential privacy involves less network communication,
memory usage, and time complexity compared with cryptographic techniques, it still
struggles with giving highly accurate results within a reasonable privacy budget,
namely, intended ε and δ , on large-scale datasets such as genomics datasets.[37 ]
[107 ] In more detail, since genomic datasets are huge, the sensitivity of the applied
algorithms on these datasets is large. Hence, the amount of distortion required for
anonymization increases significantly, sometimes to the extent that the results will
not be meaningful anymore.[108 ] Therefore, to make differential privacy more practical in the field, balancing a
tradeoff between privacy and utility demands more attention than it has received[88 ]
[90 ]
[91 ]
[92 ].
Federated Learning
Federated learning[109 ] is a type of distributed learning where multiple clients (e.g., hospitals) collaboratively
learn a model under the coordination of a central server while preserving the privacy
of their data. Instead of sharing its private data with the server or the other clients,
each client extracts knowledge (that is, model parameters) from its data and transfers
it to the server for aggregation ([Fig. 1D ]).
Federated learning is an iterative process in which each iteration consists of the
following steps[110 ]:
The server chooses a set of clients to participate in the current iteration of the
model.
The selected clients obtain the current model from the server.
Each selected client computes the local parameters using the current model and its
private data (e.g., runs gradient descent algorithm initialized by the current model
on its local data to obtain the local gradient updates).
The server collects the local parameters from the selected clients and aggregates
them to update the current model.
The data of the clients can be considered as a table, where rows represent samples
(e.g., individuals) and columns represent features or labels (e.g., age, case vs.
control). We refer to the set of samples, features, and labels of the data as sample space , feature space , and label space , respectively. Federated learning can be categorized into three types based on the
distribution characteristics of the clients' data:
Horizontal (sample-based) federated learning
[111 ]: Data from different clients shares similar feature space but is very different
in sample space. As an example, consider two hospitals in two different cities which
collected similar information such as age, gender, and blood pressure about the individuals.
In this case, the feature spaces are similar; but because the individuals who participated
in the hospitals' data collections are from different cities, their intersection is
most probably very small, and the sample spaces are hence very different.
Vertical (feature-based) federated learning
[111 ]: Clients' data are similar in sample space but very different in feature space.
For example, two hospitals with different expertise in the same city might collect
different information (different feature space) from almost the same individuals (similar
sample space).
Hybrid federated learning : Both feature space and sample space are different in the data from the clients.
For example, consider a medical center with expertise in brain image analysis located
in New York and a research center with expertise in protein research based in Berlin.
Their data are completely different (image vs. protein data) and disjoint groups of
individuals participated in the data collection of each center.
To illustrate the concept of federated learning, consider a scenario with two hospitals
A and B . A and B possess lists X and Y , containing the age of their cancer patients, respectively. A simple federated mean
algorithm to compute the average age of cancer patients in both hospitals without
revealing the real values of X and Y works as follows: For the sake of brevity, we assume that both hospitals are selected
in the first step and that the current global model parameters (average age) in the
second step are zero (see federated learning steps).
Hospital A computes the average age (MX
) and number of its cancer patients (NX
). Hospital B does the same, resulting in MY
, NY
. Here, X and Y are private data while MX
, NX
, MY
, NY
are the parameters extracted from the private data.
The server obtains the values of local model parameters from the hospitals and computes
the global mean as follows:
The emerging demand for federated learning gave rise to a wealth of both simulation[112 ]
[113 ] and production-oriented[114 ]
[115 ] open source frameworks. Additionally, there are AI platforms whose goal is to apply
federated learning to real-world health care settings[116 ]
[117 ]. In the following, we survey studies on federated AI techniques in biomedicine and
health care ([Table 2 ]). Recent studies in this regard mainly focused on horizontal federated learning
and there are a few vertical or hybrid federated learning algorithms applicable to
genomic and biomedical data.
Table 2
Literature for FL and hybrid approaches in biomedicine
Authors
Year
Technique
Model
Application
Sheller et al[118 ]
2018
FL
DNN
Medical image segmentation
Chang et al[123 ]
Balachandar et al[122 ]
2018
2020
FL
Single weight transfer
Cyclic weight transfer
Medical image segmentation
Nasirigerdeh et al[129 ]
2020
FL
Linear regression
Logistic regression
χ
2 statistics
Genetic associations
Wu et al[126 ]
Wang et al[127 ]
Li et al[128 ]
2012
2013
2016
FL
Logistic regression
Genetic associations
Dai et al[124 ]
Lu et al[125 ]
2020̀
2015
FL
Cox regression
Survival analysis
Brisimi et al[132 ]
2018
FL
Support vector machines
Classifying electronic health records
Huang et al[133 ]
2018
FL
Adaptive boosting ensemble
Classifying medical data
Liu et al[134 ]
2018
FL
Autonomous deep learning
Classifying medical data
Chen et al[135 ]
2019
FL
Transfer learning
Training wearable health care devices
Li et al[150 ]
2020
FL + DP
DNN
Medical image segmentation
Li et al[149 ]
2019
FL + DP
Domain adoption
Medical image pattern recognition
Choudhury et al[159 ]
2019
FL + DP
Neural network
Support vector machine
Logistic regression
Classifying electronic health records
Constable et al[52 ]
2015
FL + SMPC
Statistical analysis
(e.g., χ
2 statistics, ...)
Genetic associations
Lee et al[158 ]
2019
FL + HE
Context-specific hashing
Learning patient similarity
Kim et al[156 ]
2019
FL + DP + HE
Logistic regression
Classifying medical data
Abbreviations: DP, differential privacy; FL, federated learning; HE, homomorphic encryption;
SMPC, secure multiparty computation.
Several studies provided solutions for the lack of sufficient data due to the privacy
challenges in the medical imaging domain.[117 ]
[118 ]
[119 ]
[120 ]
[121 ]
[122 ]
[123 ] For instance, Sheller et al developed a supervised DNN in a federated way for semantic
segmentation of brain gliomas from magnetic resonance imaging scans.[118 ] Chang et al[123 ] simulated a distributed DNN in which multiple participants collaboratively update
model weights using training heuristics such as single weight transfer and cyclical
weight transfer (CWT). They evaluated this distributed model using image classification
tasks on medical image datasets such as mammography and retinal fundus image collections,
which were evenly distributed among the participants. Balachandar et al[122 ] optimized CWT for cases where the datasets are unevenly distributed across participants.
They assessed their optimization methods on simulated diabetic retinopathy detection
and chest radiograph classification.
Federated Cox regression, linear regression, logistic regression as well as Chi-square
test have been developed for sensitive biomedical data that is vertically or horizontally
distributed.[124 ]
[125 ]
[126 ]
[127 ]
[128 ]
[129 ] VERTICOX[124 ] is a vertical federated Cox regression model for survival analysis, which employs
the alternating direction method of multiplier (ADMM) framework[130 ] and is evaluated on acquired immunodeficiency syndrome (AIDS) and breast cancer
survival datasets. Similarly, WebDISCO[125 ] presents a federated Cox regression model but for horizontally distributed survival
data. The grid binary logistic regression (GLORE)[126 ] and the expectation propagation logistic regression (EXPLORER)[127 ] implemented a horizontally federated logistic regression for medical data.
Unlike GLORE, EXPLORER supports asynchronous communication and online learning functionality
so that the system can continue collaborating in case a participant is absent or if
communication is interrupted. Li et al presented VERTIGO,[128 ] a vertical grid logistic regression algorithm designed for vertically distributed
biological datasets such as breast cancer genome and myocardial infarction data. Nasirigerdeh
et al[129 ] developed a horizontally federated tool set for GWAS, called sPLINK , which supports Chi-square test, linear regression, and logistic regression. Notably,
federated results from sPLINK on distributed datasets are the same as those from aggregated analysis conducted
with PLINK .[131 ] Moreover, they showed that sPLINK is robust against heterogeneous (imbalanced) data distributions across clients and
does not lose its accuracy in such scenarios.
There are also studies that combine federated learning with other traditional AI modeling
techniques such as ensemble learning, support vector machines (SVMs), and principal
component analysis (PCA).[132 ]
[133 ]
[134 ]
[135 ]
[136 ] Brisimi et al[132 ] presented a federated soft-margin support vector machine (sSVM) for distributed
electronic health records. Huang et al[133 ] introduced LoAdaBoost, a federated adaptive boosting method for learning biomedical
data such as intensive care unit data from distinct hospitals[137 ] while Liu et al[134 ] trained a federated autonomous deep learner to this end. There have also been a
couple of attempts at incorporating federated learning into multitask learning and
transfer learning in general.[138 ]
[139 ]
[140 ] However, to the best of our knowledge, FedHealth[135 ] is the only federated transfer learning framework specifically designed for health
care applications. It enables users to train personalized models for their wearable
health care devices by aggregating the data from different organizations without compromising
privacy.
One of the major challenges for adopting federated learning in large scale genomics
and biomedical applications is the significant network communication overhead, especially
for complex AI models such as DNNs that contain millions of model parameters and require
thousands of iterations to converge. A rich body of literature exists to tackle this
challenge, known as communication-efficient federated learning.[141 ]
[142 ]
[143 ]
[144 ]
Another challenge in federated learning is the possible accuracy loss from the aggregation
process if the data distribution across the clients is heterogeneous (i.e., not independent
and identically distributed [IID]). More specifically, federated learning can deal
with non-IID data while preserving the model accuracy if the learning model is simple
such as ordinary least squares linear regression (sPLINK
[129 ]). However, when it comes to learning complex models such as DNNs, the global model
might not converge on non-IID data across the clients. Zhao et al[145 ] showed that simple averaging of the model parameters in the server significantly
diminishes the accuracy of a convolutional neural network model in highly skewed non-IID
settings. Developing the aggregation strategies which are robust against non-IID scenarios
is still an open and interesting problem in federated learning.
Finally, federated learning is based on the assumption that the centralized server
is honest and not compromised, which is not necessarily the case in real applications.
To relax this assumption, differential privacy or cryptographic techniques can be
leveraged in federated learning, which is covered in the next section. For further
reading on further directions of federated learning in general, we refer the reader
to comprehensive surveys.[110 ]
[146 ]
[147 ]
Hybrid Privacy-Preserving Techniques
Hybrid Privacy-Preserving Techniques
The hybrid techniques combine federated learning with the other paradigms (cryptographic
techniques and differential privacy) to enhance privacy or provide privacy guarantees
([Table 2 ]). Federated learning preserves privacy to some extent because it does not require
the health institutes to share the patients' data with the central server. However,
the model parameters that participants share with the server might be abused to reveal
the underlying private data if the coordinator is compromised.[148 ] To handle this issue, the participants can leverage differential privacy and add
noise to the model parameters before sending them to the server (FL + DP)[149 ]
[150 ]
[151 ]
[152 ]
[153 ] or they employ HE (FL + HE),[55 ]
[154 ]
[155 ] SMPC (FL + SMPC) or both DP and HE (FL + DP + HE)[103 ]
[156 ]
[157 ] to securely share the parameters with the server.[51 ]
[158 ]
In the genomic and biomedical field, several hybrid approaches have been presented
recently. Li et al[149 ] presented a federated deep learning framework for magnetic resonance brain image
segmentation in which the client side provides differential privacy guarantees on
selecting and sharing the local gradient weights with the server for imbalanced data.
A recent study[150 ] extracted neural patterns from brain functional magnetic resonance images by developing
a privacy-preserving pipeline that analyzes image data of patients having different
psychiatric disorders using federated domain adaptation methods. Choudhury et al[159 ] developed a federated differential privacy mechanism for gradient-based classification
on electronic health records.
There are also some studies that incorporate federate learning with cryptographic
techniques. For instance, Constable et al[52 ] implemented a privacy-protecting structure for federated statistical analysis such
as χ
2 statistics on GWAS while maintaining privacy using SMPC. In a slightly different
approach, Lee et al[158 ] presented a privacy-preserving platform for learning patient similarity in multiple
hospitals using a context-specific hashing approach which employs HE to limit the
privacy leakage. Moreover, Kim et al[156 ] presented a privacy-preserving federated logistic regression algorithm for horizontally
distributed diabetes and intensive care unit datasets. In this approach, the logistic
regression ensures privacy by making the aggregated weights differentially private
and encrypting the local weights using HE.
Incorporating HE, SMPC, and differential privacy into federated learning brings about
enhanced privacy but it combines the limitations of the approaches, too. FL + HE puts
much more computational overhead on the server, since it requires to perform aggregation
on the encrypted model parameters from the clients. The network communication overhead
is exacerbated in FL + SMPC, because clients need to securely share the model parameters
with multiple computing parties instead of one. FL + DP might result in inaccurate
models because of adding noise to the model parameters in the clients.
Comparison
We compare the privacy-preserving techniques (HE, SMPC, differential privacy, federated
learning, and the hybrid approaches) using various performance and privacy criteria
such as computational/communication efficiency, accuracy, privacy guarantee, and exchanging
sensitive traffic through network and privacy of exchanged traffic ([Fig. 3 ]). We employ a generic ranking (lowest = 1 to highest = 6)[37 ] for all comparison criteria except for privacy guarantee and exchanging sensitive traffic through network , which are binary criteria. This comparison is made under the assumption of applying
a complex model (e.g., DNN with a huge number of model parameters) on a large sensitive
genomics dataset distributed across dozens of clients in IID configuration. Additionally,
there are a few computing parties in SMPC (practical configuration).
Fig. 3 Comparison radar plots for all (A ) and each of (B–H ) the privacy preserving approaches including homomorphic encryption (HE), secure
multiparty computation (SMPC), differential privacy (DP), federated learning (FL)
and hybrid techniques (FL + DP, FL + HE and FL + SMPC). (A ) All. (B ) HE. (C ) SMPC. (D ) DP. (E ) FL. (F ) FL + DP. (G ) FL + HE. (H ) FL + SMPC.
Computational efficiency is an indicator of the extra computational overhead an approach
incurs to preserve privacy. According to [Fig. 3 ], federated learning is best from this perspective because it follows the paradigm
of “bringing computation to data“, distributing computational overhead among the clients.
HE and SMPC are based on the paradigm of moving data to computation. In HE, encryption
of the whole private data in the clients and carrying out computation on encrypted
data by the computing party causes a huge amount of overhead. In SMPC, a couple of
computing parties process the secret shares from dozens of clients, incurring considerable
computational overhead. Among the hybrid approaches, FL + DP has the best computational
efficiency given the lower overhead of the two approaches whereas FL + HE has the
highest overhead because the aggregation process on encrypted parameters is computationally
expensive.
Network communication efficiency indicates how efficient an approach utilizes the
network bandwidth. The less data traffic is exchanged in the network, the more communication
efficient is the approach. Federated learning is the least efficient approach from
the communication aspect since exchanging a large number of model parameter values
between the clients and the server generates a huge amount of network traffic. Notice
that network bandwidth usage of federated learning is independent of the clients'
data because federated learning does not move data to computation but depends on the
model complexity (i.e., the number of model parameters). The next approach in this
regard is SMPC, where not only each participant sends a large traffic (almost as big
as its data) to each computing party but also each computing party exchanges intermediate
results (which might be large) with the other computing parties through the network.
Although recent research has shown that there is still potential for reducing the
communication overhead in SMPC,[160 ] many limitations cannot be fully overcome. The network overhead of HE comes from
sharing the encrypted data of the clients (assumed to be almost as big as the data
itself) with the computing party, which is small compared with network traffic generated
by federated learning and SMPC. The best approach is differential privacy with no
network overhead. Accordingly, FL + DP and FL + SMPC are the best and worst among
the hybrid approaches from a communication efficiency viewpoint, respectively.
Accuracy of the model in a privacy-preserving approach is a crucial factor in whether
to adopt the approach. In the assumed configuration, SMPC and federated learning are
the most accurate approaches incurring little accuracy loss in the final model. Next
is differential privacy where the added noise can considerably affect the model accuracy.
The worst approach is HE whose accuracy loss is due to approximating the non-linear
operations using addition and multiplication (e.g., least squares approximation[65 ]). In the hybrid approaches, FL + SMPC is the best and FL + DP is the worst considering
the accuracy of SMPC and differential privacy approaches.
The rest of the comparison measures are privacy related. The traffic transferred from
the clients (participants) to the server (computing parties) is highly sensitive if
it carries the private data of the clients. HE and SMPC send the encrypted form of
the clients' private data to the server. Federated learning and hybrid approaches
share only the model parameters with the server. In HE, if the server has the key
to decrypt the traffic from the clients, the whole private data of the clients will
be revealed. The same holds if the computing parties in SMPC collude with each other.
This might or might not be the case for the other approaches (e.g., federated learning)
depending on the exchanged model parameters and whether they can be abused to infer
the underlying private data.
Privacy of the exchanged traffic indicates how much the traffic is kept private from
the server. In HE/SMPC, the data are encrypted first and then shared with the server,
which is reasonable since it is the clients' private data. In federated learning,
the traffic (model parameters) is directly shared with the server assuming that it
does not reveal any details regarding individual samples in the data. The aim of the
hybrid approaches is to hide the real values of the model parameters from the server
to minimize the possibility of inference attacks using the model parameters. FL + HE
is the best among the hybrid approaches from this viewpoint.
Privacy guarantee is a metric which quantifies the degree to which the privacy of
the clients' data can be preserved. Differential privacy and the corresponding hybrid
approach (FL + DP) are the only approaches providing a privacy guarantee, whereas
all other approaches can only protect the privacy under a set of certain assumptions.
In HE, the server must not have the decryption key; in SMPC, not all computing parties
must collude with each other; in federated learning, the model parameters should not
give any detail about a sample in the clients' data.
Discussion and Open Problems
Discussion and Open Problems
In HE, a single computing party carries out computation over the encrypted data from
the clients. In SMPC, multiple computing parties perform operations on the secret
shares from the clients. In federated learning, a single server aggregates the local
model parameters shared by the clients. From a practical point of view, HE and SMPC
that follow the paradigm of “move data to computation“ do not scale as the number
of clients or data size in clients become large. This is because they put the computational
burden on a single or a few computing parties. Federated learning, on the other hand,
distributes the computation across the clients (aggregation on the server is not computationally
heavy) but the communication overhead between the server and clients is the major
challenge to scalability of federated learning. The hybrid approaches inherit this
issue and it is exacerbated in FL + SMPC. Combining HE with federated learning (FL + HE)
adds another obstacle (computational overhead) to the scalability of federated learning.
There is a growing body of literature on communication-efficient approaches to federated
learning that can dramatically improve the scalability of federated learning and make
it suitable for large-scale applications including those in biomedicine.
Given that federated learning is the most promising approach from the scalability
viewpoint, it can be used as a standalone approach as long as inferring the clients'
data from the model parameters is practically impossible. Otherwise, it should be
combined with differential privacy to avoid possible inference attacks and exposure
of clients' private data and to provide privacy guarantee. The accuracy of the model
will be satisfactory in federated learning but it might be deteriorated in FL + DP.
A realistic trade-off needs to be considered depending on the application of interest.
Moreover, differential privacy can have many practical applications in biomedicine
as a standalone approach. It works very well for low-sensitivity queries such as counting
queries (e.g., number of patients with a specific disease) on genomic databases and
their generalizations (e.g., histograms) since the presence or absence of an individual
changes the query's response by at most one. Moreover, it can be employed to release
summary statistics of GWAS such as χ
2 and p -values in a differentially private manner while keeping the accuracy acceptable.
A novel promising research direction is to incorporate differential privacy in deep
generative models to generate synthetic genomic and biomedical data.
Future studies can investigate how to reach a compromise between scalability, privacy,
and accuracy in real-world settings. The communication overhead of federated learning
is still an open problem since although state-of-the-art approaches considerably reduce
the network overhead, they adversely affect the accuracy of the model. Hence, novel
approaches are required to preserve the accuracy, which is of great importance in
biomedicine, while making federated learning communication efficient.
Adopting federated learning in non-IID settings, where genomic and biomedical datasets
across different hospitals/medical centers are heterogeneous, is another important
challenge to address. This is because typical aggregation procedures such as simple
averaging do not work well for these settings, yielding inaccurate models. Hence,
new aggregation procedures are required to tackle non-IID scenarios. Moreover, current
communication-efficient approaches which were developed for an IID setting might not
be applicable to heterogeneous scenarios. Consequently, new techniques are needed
to reduce network overhead in these settings, while keeping the model accuracy satisfactory.
Combining differential privacy with federated learning to enhance privacy and to provide
a privacy guarantee is still a challenging issue in the field. It becomes even more
challenging for health care applications, where accuracy of the model is of crucial
importance. Moreover, the concept of privacy guarantee in differential privacy has
been defined for local settings. In distributed scenarios, a dataset might be employed
multiple times to train different models with various privacy budgets. Therefore,
a new formulation of privacy guarantee should be proposed for distributed settings.
Conclusion
For AI techniques to succeed, big biomedical data needs to be available and accessible.
However, the more AI models are trained on sensitive biological data, the more the
awareness about the privacy issues increases, which, in turn, necessitate strategies
for shielding the data.[70 ] Hence, privacy-enhancing techniques are crucial to allow AI to benefit from the
sensitive biological data.
Cryptographic techniques, differential privacy, and federated learning can be considered
as the prime strategies for protecting personal data privacy. These emerging techniques
are based on either securing sensitive data, perturbing it or not moving it off site.
In particular, cryptographic techniques securely share the data with a single (HE)
or multiple computing parties (SMPC); differential privacy adds noise to sensitive
data and quantifies privacy loss accordingly, while federated learning enables collaborative
learning under orchestration of a centralized server without moving the private data
outside local environments.
All of these techniques have their own strengths and limitations. HE and SMPC are
more communication efficient compared with federated learning but they are computationally
expensive since they move data to computation and put the computational burden on
a server or a few computing parties. Federated learning, on the other hand, distributes
computation across the clients but suffers from high network communication overhead.
Differential privacy is an efficient approach from a computational and a communication
perspective but it introduces accuracy loss by adding noise to data or model parameters.
Hybrid approaches are studied to combine the advantages or to overcome the disadvantages
of the individual techniques. We argued that federated learning as a standalone approach
or in combination with differential privacy is the most promising approach to be adopted
in biomedicine. We discussed the open problems and challenges in this regard including
the balance of communication efficiency and model accuracy in non-IID settings, and
the need for a new notion of privacy guarantee for distributed biomedical datasets.
Incorporating privacy into the analysis of genomic and biomedical data is still an
open challenge, yet preliminary accomplishments are promising to bring practical privacy
even closer to real-world settings. Future research should investigate how to achieve
a trade-off between scalability, privacy, and accuracy in real biomedical applications.