Keywords
data quality - electronic health record - secondary use - real-world data
Background and Significance
Background and Significance
Medical real-world data (RWD) is increasingly used for research studies. Electronic
medical records (EMRs) and claims data are major sources, with more focused data repositories
such as tumor registries and clinical departmental systems also being mined. It is
important that users of RWD are aware of the limitations on a dataset that may not
be explicitly apparent. For instance, given the various sources of patient medications—inpatient
and outpatient hospital-based dispensing, community pharmacies—the completeness of
a dataset's medication data may not be evident.[1]
Other RWD limitations arise from patient privacy regulations.[2] Access to these data is governed by the Health Insurance Portability and Accountability
Act (HIPAA), which requires either that research access be approved by an institutional
review board (IRB) or that the data be deidentified per the HIPAA privacy rule.[3]
[4]
[5] To be deidentified, all protected health information (PHI) must be removed from
a dataset. PHI includes direct identifiers such as name, address, social security
number, and medical record number. Also, the dates can only be specified by a year.
Compliance with this restriction on dates poses both technical and analytical problems.
Database date formats typically require specifying a month and day as well as the
year. If this technical problem is addressed by setting all dates to the same arbitrary
day and month, such as January 1, then all temporal relationships among observations
within a given year are destroyed (as they would be if only the year was specified).
As an alternative to deidentified data, HIPAA also provides for “limited datasets”
(LDS) in which direct identifiers (such as name and address) cannot be included, but
actual complete dates for patient encounters may be used. IRB approval must be obtained
for deployment of LDS, and justification for complete dates be provided. Many institutions
employ LDS in their research clinical data repositories. However, even though LDS
allow full actual dates, many health care organizations (HCO) still alter these dates
as an additional measure to increase the protection of patient privacy, retaining
temporal relationships but not actual temporal values.
Shifting of these encounter dates became a common method employed by these institutions.
This approach aims to preserve the temporal integrity of a patient's chart by consistently
shifting all the dates in the chart by the same randomly chosen number of days.[6]
[7] While the number used to shift data is random among patients, it is constant within
a single patient's data. So, for a specific patient, a date-shifted prostate biopsy
precedes an initial radiation treatment by the same number of days as in the actual
data, whereas the deidentified dates differ from the actual dates. Typically, dates
are shifted randomly anywhere from ± 7 to ± 365 days. This algorithm of shifting dates
by a constant random value for a given patient, but varying the constant value between
patients, up to some maximum number of days, is the common method used across multiple
institutions. Only one variation was observed in which a site only selected random
multiples of 7 days to shift dates. This preserved the day of the week, which is potentially
of value for certain studies. No evidence of other algorithms for shifting were observed
in the sites we evaluated.
The primary motivation for date shifting in the United States is HIPAA's base requirement
that dates have only the year specified. Adherence to this for encounter dates is
clearly unsuitable for many studis. Researchers and IRB panels may view limited datasets,
in which encounter dates can have their actual values, as posing too much of a risk
to patient privacy. It may be that some researchers and panel members may not even
be aware of the possibility of using limited datasets. And in some instances, date
shifting may be seen as added assurance of maintaining patient privacy, although the
costs to studis where actual dates are necessary are likely not considered.
It should be noted that the ability of date shifting to protect patient privacy has
been questioned.[8]
[9] Dates can be inferred, and patients identified by observing clinical content or
family relationships. A malevolent attempt to uncover patient identities, or even
one motivated by curiosity, is now punishable by law and is grounds for job dismissal
at most institutions. Of course, it can be argued that obscuring dates certainly protects
against technically unsophisticated individuals attempting to identify patients and
is definitely of value when a data breach has occurred.
While the efficacy of date shifting to protect patient privacy is defensible, there
are situations in which date shifting significantly detracts from the usefulness of
RWD, even making it altogether unusable for particular research purposes. These are
instances where having access to actual dates is more important than a temporal relationship
among a patient's clinical encounters. An example of a need for actual dates is a
study requiring an accurate number of patients initially diagnosed with liver cancer
in the calendar year 2017. It cannot be assumed that random date shifting results
in an equal number of patients being shifted in and out of the date range being studied.
Another example are studies that may need to know the number of coronavirus disease
2019 (COVID-19) patients diagnosed within a specific time period given the peculiarities
of timelines associated with viral variant emergence and approvals of treatments and
vaccines.[10]
[11] The fact that within a patient's data, temporal relationships have been preserved
is irrelevant. These studies will not yield valid results if the dates within the
real-world dataset have been shifted.
Objectives
Regardless of whether date shifted RWD may or may not be appropriate for a given study,
researchers should be aware of whether the dates of a dataset being used have been
altered. Perhaps surprisingly, the authors experience with datasets from multiple
institution has been that this information is not easy to obtain. Not only is it not
always readily available, but also it may not even be known by the current owners
of the dataset. Because it is not uncommon for reported date shifting status to be
unreliable or difficult to obtain, in this study we set out to develop analytic metrics
to detect the presence of date shifting and estimate its maximum magnitude in a given
dataset. We have defined and evaluated the reliability of several such metrics. We
propose that this methodology be used by data analysts to ascertain whether a dataset
they are working with contains actual or altered dates. Knowing this is of crucial
importance when a dataset is being used for such use cases as infectious disease research.
This methodology is meant to inform an analyst of the presence as well as an approximate
magnitude of a date shift. It is not intended to, nor is it capable of, deriving actual
dates in date-shifted data.
Methods
Feasibility of Detection
Given that certain studies require actual dates, it is crucial to know whether a particular
dataset contains actual or shifted dates before utilizing it. This might be learned
from a dataset's provenance, if available. But even if it is, it should be verified.
A reliable method for detecting possible shifted dates, therefore, is required prior
to any analysis of data.
Our first approach to creating a date shifting detection method was checking a dataset
to see whether a specific occurrence of medical significance occurred on the expected
date. A recent event presents a useful example: on March 19, 2020, then President
Trump declared hydroxychloroquine—an anti-malarial drug treatment—a “game changer”
in the fight against COVID-19.[12] Many datasets in the United States subsequently showed a pronounced spike in the
volume of hydroxychloroquine use ([Fig. 1]). This suggested that “sharp” temporal events—hydroxychloroquine spike and others
like it—can serve as markers to detect the presence of date shifting. In data where
dates have been shifted, a sharp spike may disappear, or appear in the wrong month,
indicating that date shifting has occurred.
Fig. 1 Patients treated with hydroxychloroquine show a spike at a large academic institution
in 2020.
However, the hydroxychloroquine spike, while a great illustration of a useful temporal
marker, has a limited applicability; only datasets with medication information and
preferably well-represented outpatient coverage would be expected to demonstrate this
feature. Of course, the dataset also must cover early 2020 to observe this event.
Using other possible sentinel dates entail similar potential pitfalls. The date of
the chosen event may not be necessarily as unique as thought. For example, if one
found the earliest instance of International Classification of Diseases (ICD)-10-CM
coding as an indication of the start of the use of ICD-10 as opposed to ICD-9, it
ignores institutions that delayed implementation of the new coding after its official
start. Also, it would not work at all for datasets in which all prior ICD-9 coding
was translated to their ICD-10 equivalent.
In addition to a singular event indicating date shifting, several other temporal features
which could give an indication of date shifting were studied. An obvious candidate
was the occurrence of seasonal medical events, such as influenza diagnosis or heat
stroke. One would expect surges in these diagnoses during appropriate months (fall
and winter or summer, respectively, for these examples). This approach of looking
for increased numbers of certain medical diagnoses appearing in expected months led
to yet another method: the day-of-the-week differences in the occurrence of medical
procedures that would not typically occur on weekends, such as routine physicals and
elective medical procedures.
Synthetic Models
To detect potential shifts of different magnitudes, we wanted to simulate how various
temporal “markers” would appear when dates were shifted by various amounts. We hypothesized
that small magnitude date shifts would obscure high-frequency (e.g., weekly) temporal
events, whereas large magnitude shifts would be necessary to obscure events that happen
only infrequently (e.g., once a year). We studied three date patterns:
-
A pattern that occurs with high frequency is the drop in volume of observations during
weekends. For example, in the United States, it is very unlikely that an annual physical
would be scheduled on a Sunday.
-
A seasonal pattern where a disease is more prominent in the summer or winter.
-
A one-time drop, as in the case of patients postponing elective care at the beginning
of the COVID-19 pandemic.
For each of these scenarios, we modeled the behavior of these patterns when shifting
dates by various amplitude (specifically, the maximum amplitude in the common date
shifting algorithm as described above) using synthetic data designed to exhibit the
given pattern. This allowed us to calibrate and understand, for example, how much
shift is needed to obscure a weekday/weekend or a seasonal or yearly pattern. We later
used this information to compare to the observed patterns in real-world dataset and
judged the possible presence of each degree of shift scenario. In [Fig. 2] we see the pattern observed for an encounter, such as a checkup, which normally
occurs only Monday through Friday. When the synthetic data are not shifted (i.e.,
number of days shifted = 0) we see almost no counts on Sunday and Saturday, with any
small counts attributable most likely to data entry error or special screening events.
As the number of days shifted assume greater values (± 3, ± 7, ± 30, ± 90, ± 365),
we see a smoothening of the encounter counts over the days of the week such that there
is little difference between the weekend and weekdays. To an analyst ignoring a potential
date shift, this would erroneously imply that the checkups occur as likely on the
weekend as during the week.
Fig. 2 Weekly pattern using synthetic data. The number above each graph indicates the number
of days shifted.
[Fig. 3] shows the distortion seen when data with a one-time drop caused by a sentinel event
in certain encounters are date shifted. In the synthetic dataset an abrupt drop in
months 16 through 18 is totally obscured as shifting values are increased.
Fig. 3 One-time drop caused by a sentinel event using synthetic data. The number above each
graph indicates the number of days shifted.
In [Fig. 4] we see the effect of date shifting on synthetic data for seasonal encounters such
as influenza. Once again, increasing the number of days shifted obscures peaks actually
observed during certain seasons.
Fig. 4 Yearly pattern of seasonal events using synthetic data. The number above each graph
indicates the number of days shifted.
Our experimentation with synthetic models confirmed that date shifting affects the
appearance of expected temporal patterns such as volume fluctuations between weekdays
and weekends, significant events, such as the COVID-19 pandemic, and seasonal disease
patterns. Furthermore, they demonstrated that the maximum magnitude of date shift
affects these patterns in a predictable manner, obscuring them proportionally to the
magnitude of the shift.
Real-World Data
The TriNetX Global Network of data from health care organizations were utilized for
this study.[13] The patient data available is from EMR systems, tumor registries, and departmental
systems (e.g., pathology). We hypothesized that observations of the following medical
encounters for the health care organizations in our study would exhibit seasonal patterns
when date-shifting was not employed. The ICD-10-CM codes for these encounters are
noted in parentheses:
Sunburn, heatstroke, URI, and influenza all should have a strong yearly pattern, meaning
a few months of very high volume, a few months of a shoulder season, and otherwise
relatively low volume. If this pattern is not observed, it can be assumed the HCO
is shifting, perhaps by as many as 365 days. To evaluate whether this pattern was
present we considered only sites with at least 500 observations of each of these specific
diagnoses and determined the distribution of these encounters by computing monthly
sums (ignoring the year) and then the median of the monthly sums. If a site exhibited
at least 2 months with the number of encounters for a diagnosis exceeding 1.5 times
the median value, then we concluded the data were not shifted. Correspondingly, if
the monthly encounter numbers were more uniformly distributed, the site was deemed
to have provided date-shifted data.
We further hypothesized that nonacute visits, such as routine physical encounters
or encounters for dermatitis, would follow a Monday–Friday occurrence, with infrequent
weekend occurrences in unshifted data:
-
Routine physical (Z00)
-
Dermatitis (L20–L30)
Routine checkups and treatment for dermatitis would have likely been postponed during
March and April 2020 due to COVID-19. Once again, only sites with at least 500 observations
for routine checkups and/or dermatitis treatment were considered. We computed the
median of monthly encounter sums and concluded that any site having a decrease in
such encounters of at least 80% in March 2020 and/or at least 50% in April 2020 had
not date shifted their data. Otherwise, a shift of 30 days or more would be assumed.
We selected these thresholds because we hypothesized that these values would capture
substantial changes and not minor variation.
In addition, a previous study[10] by our group showed decreases in breast and colorectal cancer screenings of 89%
and 84% for April 2020.
Similarly, we expect checkups and dermatitis to occur mainly on weekdays and not weekends.
The weekday distribution of routine physicals and dermatitis treatments were computed
and then the median of the weekday encounters sums obtained. If the sums for Saturday
or Sunday exceeded 0.25 times the median, we concluded a shift of at least 7 days.
[Fig. 5] summarizes the decision tree applied to the encounter data from the sites in our
study.
Fig. 5 Outline of the procedure used to detect the presence and degree of date shifting
in the datasets studied.
Results
We applied our date shifting detection methods—looking at sentinel events (COVID-19
pandemic), seasonal patterns for certain diagnoses, and weekday patterns for elective
encounters—to 76 sites in the United States. Twenty-two sites exhibited a conflict,
an inconsistency between our preexisting records of the presence of data shifting
and comparison of the observed data patterns with our synthetic models. We contacted
those 22 institutions in an attempt to reconcile these conflicts. We established a
dialog with 17 of 22 sites where we asked the sites to double-check their date shifting
status and we reevaluated our interpretation of the model to arrive at a consensus.
The remaining five sites did not respond to our inquiries, and we excluded these sites
from further analyses.
These conflict resolution attempts allowed us to fine-tune our date shifting detection
methodology. Most significantly, we made the decision to remove seasonal URI from
consideration as it did not appear as predictive as anticipated. Respiratory diseases
do not appear to be as sharply seasonal as we originally hypothesized. When we compared
URI with the other seasonal predictors, we found that it was more likely to give false
positives, potentially due to differences in onset and duration of URI waves by location
within the United States, or due to differences in how the URI diagnoses are captured.
Seasonal predictors also are dependent on geographic location. At a top level, whether
the data source was in the northern or southern hemisphere. Our decision to confine
this study to U.S. data providers was primarily based on removing geographic considerations
from seasonal predictors, although north–south location within the United States is
still a factor on the onset and degree of seasonal predictors.
For the 71 HCOs in this study, [Fig. 6] shows the observed presence of date shifting. Thirty-nine organizations, or 55%,
displayed no date shifting by our methodology. This conclusion was confirmed by individual
data providers. Our methodology concluded that another 28 organizations showed some
amount of date shifting, which was also confirmed by the providers. For four organizations
our methodology's conclusions differed from the provider's description and we were
not able to resolve the conflict to a mutual satisfaction.
Fig. 6 Observed presence of date shifting by study methodology, which was in agreement with
data provider's description of their dataset. Datasets where there was disagreement
between the study's methodology and the provider on the presence of date shifting
are shown as “conflict.”
[Fig. 7] shows the distribution of the number of days for which dates were shifted among
the 28 HCOs in the study for which the observed date shifting was confirmed by the
data provider. A quarter of the shifting institutions do so by 7 days, and a fifth
by 365, with the rest in the middle.
Fig. 7 Distribution of the magnitude of the date shift (in days) for the 28 health care
organizations with confirmed observed date shifting.
Discussion
Prediction of Date Shifting Presence
The most reliable encounter observation for predicting date shifting is the routine
medical checkup (ICD-10-CM Z00) when its distribution over day-of-week is extracted
([Table 1]). In the United States these encounters should rarely if ever occur over the weekend
(Saturday or Sunday). This conclusion is applicable only to countries like the United
States where these exams are normally scheduled for Mondays through Fridays exclusively.
Other medical encounters, such as for sunburn, are not reliably tracked in datasets.
Encounters for upper respiratory infections or influenza, while having seasonal increases,
are not sufficiently distinct during certain months.
Table 1
Correlation between observed encounter and date shift detection
Measure
|
Yearly/day-of-week
|
Correlation
|
Number of HCOs
|
Sunburn
|
Yearly
|
0.71
|
30
|
Influenza
|
Yearly
|
0.69
|
70
|
URI
|
Yearly
|
0.20
|
70
|
Checkup
|
Yearly
|
0.25
|
61
|
Dermatitis
|
Yearly
|
0.53
|
70
|
Checkup
|
Day-of-week
|
0.92
|
61
|
Dermatitis
|
Day-of-week
|
0.79
|
70
|
Abbreviation: HCO, health care organization; URI, upper respiratory infection.
Note: The strength of the correlation demonstrates the measure's ability to correctly
predict date shifting.
[Fig. 8] shows the distribution by day of week of routine medical checkup encounters for
the HCOs studied. The left plot shows the expected pattern with almost no encounters
occurring on Sunday or Saturday. The data provider confirmed that no date shifting
was applied to this dataset. The right plot shows an almost uniform distribution of
these encounters over every day of the week. The degree of date shifting for this
dataset, confirmed by the data provider, was ± 7. Thus, date shifting of any magnitude
from a week to a year will obliterate the expected weekday-only occurrence of routine
medical checkups.
Fig. 8 Distribution by day of week of routine medical checkup encounters for the health
care organizations studied.
Another limitation of using day of week occurrence of routine encounters was observed
for the single data provider who chose to shift their data by a multiple of 7 days,
thus preserving the day of week of the encounter. This form of date shifting can be
exposed by the unexpected volume of routine encounters on holidays such as Christmas
day, New Year's Day, Memorial Day, Independence Day, Labor Day, and Thanksgiving.
No Potential for Date Reidentification
Concern that this methodology may be improperly used to restore (reidentify) actual
encounter dates is understandable. It should be kept in mind that use of actual encounter
dates is permissible if IRB approval has been obtained for use of an LDS. As mentioned
in the Background section above, whenever individual patient data are loaded or refreshed
in a dataset, a random patient-specific date offset value within the range chosen
by the provider is applied to that patient's dates. For example, if the dataset offset
range is ± 365, the first patient's data may randomly be assigned an offset value
of −34. All dates in that patient's data in the dataset has 34 days subtracted. The
second patient in the dataset may be randomly assigned a date offset of +182. All
dates in that patient's data in the dataset has 182 days added. Determining that the
entire dataset is date shifted by up to 365 days using our methodology, it is not
evident how one would know what offset value to apply to the dates in each patient
record in the dataset. (In this example, adding 34 days for the first patient while
subtracting 182 days for the second patient.) Possibly one could do multiple attempts
adjusting each patients' dates by all possible combinations of date offset values
up to 365 to find the adjusted dataset with the minimal deviance from nonurgent encounters
occurring on weekdays. But there is no guarantee that this minimum deviance dataset
is correct or unique. To reiterate, the methodology to detect the presence and maximum
amplitude of a date shift we are proposing is neither meant to circumvent the deidentification
in any way, nor can the authors see how it could be used for this purpose.
Our study shows that shifting by a varying random number of days for each patient's
chart provides a signal that the dates have been shifted. This signal is desirable
as it indicates the dataset is not suitable for studies in which actual dates are
necessary. However, if the intent is to confuse or hide the date shifting from the
research data consumer, then this can be achieved by using nonrandom numbers of days
for shifting, particularly multiples of 7. We do not recommend this obscuring of the
application of date shifting.
Conclusion
The obscuring effect of date shifting diminishes the usefulness of RWD for studies
when actual dates are required. Dataset provenance is not always reliable even when
available. Knowledge of whether date shifting has occurred is necessary when using
real-world datasets. The objective of this study was to develop methods for detecting
date shifting of encounter data in datasets but not for date reidentification.
We found a simple test to detect the existence of date shifting in an EMR dataset
from U.S. HCOs, observing whether almost all routine medical exams or similar nonurgent
encounters were performed during weekdays is predictive of a date shift of at least
7 days. Other measures such as the patterns of seasonal illnesses or sentinel temporal
events allowed us to detect the magnitude of the shift.
For non-U.S. data sources, or datasets not containing routine physical encounters,
a comparable test could be developed using other nonurgent encounters having a customary
distribution over certain days of the week.
Slightly over one-half (39 of 71, 55%) of institutions in our U.S. sample do not shift
dates. Of those that shift, a quarter do so by ± 7 days and about one-fifth by ± 365
days, with the rest shifting by various amounts in between. The heterogeneity of the
shift magnitude is telling; there does not seem to be a widely accepted agreement
on the application of date shifting among HCOs.
Clinical Relevance Statement
Clinical Relevance Statement
This study provides methodology for users of RWD to ascertain whether the dates in
the data being used has been shifted in value and to what degree if it had. This contributes
to providing valid results of studies utilizing RWD.
Multiple-Choice Questions
Multiple-Choice Questions
-
What is the most reliable encounter type for detecting date shifting?
-
Emergency surgeries
-
Routine physicals
-
CT scans
-
Phlebotomy
Correct Answer: The correct answer is option b. These encounters are scheduled and do not occur
unexpectedly. They are usually scheduled only on certain days of the week. (For example,
routine physicals usually occur only on weekdays Monday through Friday in the United
States.) Hence, they should display an appropriate day-of-week pattern.
-
Can real-world datasets ever contain actual complete dates and be HIPAA Safe Harbor-compliant?
-
No, never
-
Yes, if dates are at least 5 years ago
-
Yes, if it is an LDS
-
Yes, if leap years are not included
Correct Answer: The correct answer is option c. Under HIPAA, an LDS is health information that excludes
certain direct identifiers (such as patient's name) but may include certain dates
with all elements. Use of an LDS requires IRB approval and is not considered deidentified
data under the privacy rule.