Chen J, Podchiyska T, Altman R. OrderRex: clinical order decision support and outcome
predictions by data-mining electronic medical records. J Am Med Inform Assoc 2016;23:339-48
Compliance with evidence-based guidelines is low and a majority of clinical decisions
are not supported by randomized control trials. Thus, a large part of medical practice
is thus driven by individual expert opinion. The authors present a clinical order
recommender system which operates on a database which has been mined from existing
patient data. The input to the data mining system is around 1,500 common electronic
medical record (EMR) data elements (out of 5.4 million structured data elements) from
labs results, orders, and diagnosis codes, including temporal separation in the form
of patient timelines. This data was extracted for 18 thousand patients and stored
in an association matrix. Queries to the database come in the form of clinical terms
for the captured data elements for a patient. A ranking of suggested orders based
on the input data and the association matrix is output to the user. By mixing outcomes
such as death and hospital readmission in with the order results, the system also
acts as a predictor of outcomes. The authors observe that including the temporal data
increased precision from 33 to 38%, but also note that continued work is required
to differentiate simply common behaviors on certain data from the correct ones.
Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to
Predict the Future of Patients from the Electronic Health Records. Sci Rep 2016;6:26094
Proposed in this paper is a novel unsupervised deep feature learning method to derive
a patient representation from EHR data that facilitates the prediction of clinical
outcomes. Deep learning techniques, using neural networks with more than one hidden
layer, have not previously been broadly used with EHR data. The authors used aggregated
medical records from the Mount Sinai data warehouse with a stack of denoising auto-encoders
to capture stable structures and regular patterns from pre-processed EHR data. Then,
they implemented random forest classifiers (one-vs.-all learning) to predict the probability
that patients might develop a certain disease. On 76,214 test patients comprising
78 diseases from diverse clinical domains and temporal windows, the results significantly
outperformed those achieved using representations based on raw EHR data and alternative
feature learning strategies such as principal component analysis and Gaussian mixture
models.
Prasser F, Kohlmayer F, Kuhn KA. The Importance of Context: Risk-based De-identification
of Biomedical Data. Methods Inf Med 2016;55:347-55
The authors propose the evaluation of variability in data distributions as a criterion
which could be used systematically in assessing data quality. This variability is
assessed first on different sources of data (i.e., from different sites), and second,
over time. The authors proposed a novel statistics-based assessment method providing
data quality metrics and exploratory visualizations. The method is empirically driven
on a public health mortality registry of the region of Valencia, Spain, with >500,000
entries from 2000 to 2012, separated into 24 health departments. The repository was
partitioned into two temporal subgroups following a change in the Spanish National
Date certificate in 2009. Several types of data quality issues were identified including
punctual temporal anomalies, and outlying or clustered health departments. The authors
note that these issues can occur because of biases in practice, different populations,
and changes in protocols or guidelines over time - none of which are solved through
usual techniques of mapping to standard semantics.
Saez C, Zurriaga O, Perez-Panades J, Melchor I, Robles M, Garcia-Gomez JM. Applying
probabilistic temporal and multisite data quality control methods to a public health
mortality registry in Spain: a systematic approach to quality control of repositories.
J Am Med Inform Assoc 2016;23:1085-95
As data sharing becomes more common, concerns about maintaining the privacy of patients
in such data sets is growing as well. International laws, such as HIPAA, and European
Directive on Data Protection emphasize the importance of context when implementing
measures for data protection. With methods of de-identification such as k-anonymity
(dataset is transformed in such a way that each record is not different from k-1 other
records), the degree of protection is high, but it is associated with a loss of information
content. Indeed, a major challenge of data sharing is the adequate balance between
data quality and privacy. The authors propose a generic de-identification method based
on risk models, which assesses the risk of re-identification. An experimental evaluation
was performed to assess the impact of different risk models and assumptions about
the background knowledge/context of an attacker. Compared with reference methods,
the loss of information was between 10% and 24% less, depending on the strength of
the adversary being protected against.