CC BY-NC-ND 4.0 · Gesundheitswesen 2020; 82(S 02): S94-S100
DOI: 10.1055/a-0883-5098
Übersichtsarbeit
Eigentümer und Copyright ©Georg Thieme Verlag KG 2019

Data Privacy Compliant Validation of Health Insurance Claims Data: the IDOMENEO Approach

Datenschutzkonforme Validierung von Routinedaten – Die IDOMENEO Methode
Christian-Alexander Behrendt
1   Department of Vascular Medicine, Work Group GermanVasc, University Medical Center Hamburg-Eppendorf, Hamburg
,
Thea Schwaneberg
1   Department of Vascular Medicine, Work Group GermanVasc, University Medical Center Hamburg-Eppendorf, Hamburg
,
Sandra Hischke
2   Department of Medical Psychology, University Medical Center Hamburg-Eppendorf, Hamburg
1   Department of Vascular Medicine, Work Group GermanVasc, University Medical Center Hamburg-Eppendorf, Hamburg
,
Tobias Müller
3   Department of Informatics, University of Hamburg, Hamburg
,
Tom Petersen
3   Department of Informatics, University of Hamburg, Hamburg
,
Ursula Marschall
4   BARMER, Healthcare Research, Wuppertal
,
Sebastian Debus
1   Department of Vascular Medicine, Work Group GermanVasc, University Medical Center Hamburg-Eppendorf, Hamburg
,
Levente Kriston
2   Department of Medical Psychology, University Medical Center Hamburg-Eppendorf, Hamburg
› Author Affiliations
Funding The IDOMENEO study is funded by the German Joint Federal Committee (Gemeinsamer Bundesausschuss, G-BA) (01VSF16008) and the GermanVasc registry is co-funded by the German Stifterverband as well as by the CORONA foundation (S199/10061/2015).
Further Information

Correspondence

Dr. Christian-Alexander Behrendt, MD
Department of Vascular Medicine, Work Group GermanVasc,
University Medical Center Hamburg-Eppendorf
Martinistraße 52
20246 Hamburg

Publication History

Publication Date:
23 May 2019 (online)

 

Abstract

Recently, health insurance claims have regained the attention of the scientific community as a source of real-world evidence in health care research and quality improvement. To date, very few studies are available which investigate the validity of health insurance claims; these may be affected by bias from several sources, such as possible upcoding of co-morbidities and complications for reimbursement advantages. The IDOMENEO study investigates the inpatient treatment of peripheral arterial disease (PAD) comprehensively using various data sources with a consortium involving experts from health care research and data privacy, a large health insurance fund, biostatisticians, jurists, and computer scientists. Prospective registry data were collected from 30–40 vascular centres in Germany using the GermanVasc registry. In addition, health insurance claims data were prospectively collected from BARMER, the second largest health insurance fund in Germany. The consortium is currently developing a data privacy compliant method of health insurance claims data validation, the methodological foundations of which are described here.


#

Zusammenfassung

Routinedaten gewinnen zunehmend an Aufmerksamkeit durch die wissenschaftliche Community bei Projekten der Versorgungsforschung und Qualitätsentwicklung. Bis heute sind allerdings nur wenige Studien verfügbar, die sich mit der Validität von Routinedaten beschäftigen; Diese können einem Bias unterliegen, z. B. durch Fehlkodierungen von Komorbiditäten oder Komplikationen, um Vorteile bei der Erlösabrechnung zu erreichen. Die IDOMENEO-Studie untersucht die stationären Behandlungen von Patienten mit peripherer arterieller Verschlusskrankheit (PAVK) und nutzt dafür verschiedene Datenquellen. Das Studienkonsortium umfasst Experten aus den Bereichen Versorgungsforschung, Datenschutz, Kostenträger, Biostatistik, Rechtswissenschaften und Informatik. Primärdaten aus Registern werden an 30–40 Gefäßzentren prospektiv über das GermanVasc-Register erhoben. Zusätzlich werden Routinedaten der BARMER, der zweitgrößten gesetzlichen Krankenversicherung in Deutschland, analysiert. Das Konsortium entwickelt derzeit datenschutzkonforme Methoden, um die Routinedaten zu validieren. Die methodischen Grundlagen werden in diesem Artikel beschrieben.


#

Background

Due to the arising digital revolution and the implementation of diagnosis-related groups (DRG) in the United States healthcare financing system, hospital data that were originally collected for statistical and administrative purposes are becoming accessible and sufficiently structured for broad research purposes [1]. This development is accompanied by an ongoing controversy regarding validity of administrative health data.

Recently, health insurance claims have regained attention of the scientific community as a source of real-world evidence in health care research, quality improvement, and further development of so called pragmatic trials [2.] To date, very few studies are available which investigate the validity of health insurance claims, which may be suffer from several sources of bias, for example possible upcoding of co-morbidities and complications for reimbursement advantages. Since national reimbursement and classification systems differ, the transferability of validation studies from one country to another remains unknown [3] [4]. Various ways to validate data exist, while a direct comparison of data from a source with unknown validity with another data from a source with known high validity remains the method of choice. Under most circumstances, the European Union General Data Privacy Regulation (GDPR) and ethical considerations complicate the direct cross-linking of registries with claims for validation purposes [5] [6]. Thus, the development and testing of alternative methodological approaches for the validation of claims data is mandatory. The IDOMENEO study investigates the inpatient treatment of peripheral arterial disease (PAD) comprehensively using various data sources with a consortium involving experts from healthcare research and data privacy, a large health insurance fund, biostatisticians, jurists, and computer scientists [7] [8]. The consortium is currently developing a data privacy compliant method of health insurance claims data validation, of which methodological foundations are described here.


#

Data Sources

The IDOMENEO study

The IDOMENEO study aims to prospectively collect both primary data and health insurance claims data in the field of PAD, a widespread disease with more than 1 million affected inhabitants in Germany undergoing more than 300,000 invasive revascularisation procedures per year [9]. The target population of the IDOMENEO study comprises patients hospitalised with symptomatic PAD who are treated with catheter-based endovascular revascularisations, open-surgical endarterectomy, or bypass surgery. Information on patients’ co-morbidities and outcomes in claims data will be contrasted to prospectively collected registry data to answer the question how claims data can be utilised for research and quality improvement in vascular medical care [7] [8] [10] [11].


#

Registry data

Registry data will be obtained through the GermanVasc registry, a prospective non-randomized multicentre registry including invasive revascularisations performed in 10,000 patients treated for symptomatic PAD at 30 to 40 German hospitals with 3 follow-up measures within 12 months. Automated completeness and plausibility checks and independent site visit monitoring will be performed to assure high internal validity [11] [12].


#

Health insurance claims data

Parallel to the registry, health insurance claims data routinely collected by BARMER health insurance fund will be obtained. BARMER is Germany’s second largest insurance provider documenting the medical care provided to approximately 9.4 million German citizens (13.2% of Germany’s population). Data from patients with symptomatic PAD by the International Classification of Diseases (ICD-10-GM: I70.0, I70.20–24, and I70.9 up to 2014 and I70.0, I70.20–25, I70.29, and I70.9 from 2015) treated by invasive revascularisation by the Operationen- und Prozedurenschlüssel (OPS; the German version of International Classification of Procedures in Medicine) in all German hospitals will be included. We expect to comprise data on around 10,000 to 20,000 patients.


#
#

Validation Methods

Validation approaches

We present 2 main approaches to contrast claims with registry data without individual cross-linking for aims of validation. The first approach (model-based validation) compares models fitted to both data sets, while the second approach (stratification-based validation) contrasts descriptive estimates for comparable subgroups from the 2 data sets. ([Fig. 1])

Zoom Image
Fig. 1 Illustration of the IDOMENEO approaches to validate health insurance claims data (BARMER) with prospectively collected and quality assured registry data (GermanVasc).

#

Model-based validation

Basic principles

The central presumption of the first approach is that validity is not a feature of the data but rather of the interpretation and consequences of the analyses that are performed on the data [13]. Therefore, this approach validates not the health insurance claims data themselves but rather the models that are fitted to the data ([Fig. 1]). The essence of the approach is fitting the same statistical model to the claims and the registry data and using global and local indicators of model fit to compare the results. In order to account for the hierarchical structure of the data (patients clustered within hospitals), multilevel models of various complexity are an ideal choice for data analysis [14] [15].


#

Random intercept model without covariates

The random intercept model in the registry (R) can be defined as

Y R,ij = β R,0j + r R,ij,

where y R,ij , is the observed outcome of patient i in hospital j, β R,0j is the mean outcome in hospital j, and the residuals are normally distributed as

r R,ij ~N(0,σ 2  R,r )

with a mean of zero and a variance of σ2 R,r .

The distribution of the hospital means (random intercept) can be defined as

β R,0j = γ R,00 + u R,0j ,

where γ R, 00 is the grand mean of the outcome, and the hospital-specific deviation from this mean is normally distributed as

u R,0j ~N(0,σ 2 R,u )

with a mean of zero and a variance of σ 2 R,u .

Correspondingly, the same model fitted in the claims data (C) can be written as

y C,ij =β C,0j + r C,ij ,

r C,ij ~N(0,σ 2 C,r ),

β C0j =γ C,00 + u C,0j ,

u C,0j ~N(0,σ2 C,u ).

Parameter γ R,00 and γ C,00 describe the mean level of an outcome in the population (e. g., amputation-free survival time or prevalence of major adverse limb events), which is frequently targeted in clinical and epidemiological research. Comparing parameter γ R,00 with parameter γ C,00 reveals whether the estimated level of outcome in the target population is estimated similarly across the data sources and thus, whether claims data can be used to answer corresponding research questions. Comparing parameter σ 2 R,r to σ 2 C,r tells whether the amount of variation regarding the outcome within hospitals is similarly estimated and contrasting σ 2 R,u to σ 2 C,u informs about the comparability of estimates of the variation across hospitals. A statistically non-significant formal comparison of these parameters between the data sources with a reasonably narrow confidence interval of the estimated difference is necessary to support the validity of the claims data for investigating the grand mean of the analysed outcome.

For quality assessment, investigating the association of u R,0j with u C,0j reveals whether ranking of the hospitals regarding the outcome is similar using the 2 data sources. A high rank correlation (e. g., 0.80 or higher) would support the validity of using claims data for ranking hospitals with regard the outcome. It is important that for this ranking comparison hospitals should be identifiable in both data sources.


#

Random intercept model with covariates

Adding patient-level covariates to the random intercept model results in the registry model

Y R,ij = β R,0j +β R,1 × x R,1+ r R,ij ,

r R,ij ~N(0,σ 2  R,r ),

β R,0j = γ R,00+u R,0j ,

u R,0j ~N(0,σ 2 R,u ).

and claims data model

Y C,ij = β C,0j +β C,1 × x C,1+ r C,ij ,

r C,ij  ~ N(0,σ 2  C,r ),

β C,0j =γ C,00+u C,0j ,

u C,0j ~ N(0,σ 2 C,u ).

Compared to the random intercept model, interpretation of the parameters γ R,00 and γ C,00, σ 2 R,r and σ 2 C,r , as well as σ 2 R,u and σ 2 C,u changes, as they are now controlled (adjusted) for the effect of the patient-level variable x R,1 and x C,1, respectively. Thus,γ R,00 and γ C,00 describe grand means given a fixed value of the covariate, and variance related to the covariate is removed from σ 2 R,r and σ 2 C,r . The hospital-level variance parameters σ 2 R,u and σ 2 C,u as well as the deviations u R,0j and u C,0j , should be interpreted in this model as variation that is not due to differences between hospitals due to the covariate.

Particularly when several patient-level covariates are included, comparing estimates of the models based on registry and claims data reveals the trustworthiness of claims data in answering prognostic and predictive research questions (i. e., associations between covariates and outcomes) as well as in case-mix adjusted description, comparison, and ranking of hospitals for benchmarking and quality improvement.


#

Model appraisal

In addition to comparing single estimates (local parameters) from models fitted to claims and registry data, classification of models as a whole is possible. Based on the literature on the cross-group invariance of measurement,[16] the following levels of validity (e. g., correspondence between model estimates from claims and registry data) can be outlined:

  • Configural validity of a model from claims data is given, when it is possible to include the same covariates (x 1 , x 2 , etc.) in the claims data model as in the reference registry data-based model.

  • Metric validity is given, when the estimated regression coefficients (β 1 , β 2 , etc.) from the claims data display the same direction as the corresponding coefficients from the registry data with overlapping 95% confidence intervals.

  • Scalar validity requires that the intercepts (γ) are comparable between the models from claims and registry data.

  • Strict validity is given, when the variance parameters (σ 2 ) estimated from the claims data are identical to those estimated from the registry data.

Particularly in complex models, it is possible that some parts of the model show strong (scalar or strict) while others weak (configural or metric) validity, resulting in partial validity.


#

Extensions

Multilevel models are flexible tools, which fit to a broad range of modelling situations. If the outcome is not quantitative and/or its distribution is not normal, generalised linear mixed models can be applied. If necessary, further data levels can be added (e. g., treatment episodes nested within patients). Furthermore, complex associations can be investigated by using random slope models (where the effect of covariates may vary across hospitals), adding hospital-level covariates (e. g., number of beds) that may explain variation in the intercept and/or the slope of the outcome, and modelling (cross-level) interaction and nonlinear effects. Further extensions include using latent (not directly observed) variables and performing path analyses for investigating even more complex research questions.


#
#

Stratification-based validation

Basic principles

The central presumption of the second approach is that results of descriptive or sophisticated methods in healthcare research usually focus on subgroups with comparable co-morbidities or treatments rather than single individuals. If a specific subgroup (e. g., females ≥80 years of age undergoing open surgery) in the registry data is comparable to a corresponding subgroup in the claims data in terms of their relevant co-morbidities, the measured outcomes will likely be comparable. The stratification-based validation of registry and health insurance claims data can be interpreted as a successive approximation to an individual cross-linking. In order to respect the patients’ privacy, the method ought to ensure the principles of k-anonymity of the data (group wise linking)[17]. An underlying assumption is the existence of comparable subgroups in both data sources that can be matched in terms of relevant descriptive estimates (e. g., mean and standard deviation).


#

Principles of k-anonymity for group wise linking in small subgroups

For this study, the patients did not give their informed consent for an individual cross-linking of their personal data collected by the GermanVasc registry to corresponding health insurance claims data. In order to still be able to process the patient data, the method ought to ensure k-anonymity [17]. This notion of privacy expresses that at least k-1 individuals can be identified with any given attributes of a record. Several approaches to achieve k-anonymity exist. For example, removing attributes that expose different values for different patients, but are not of interest for the research question at hand, can be used to remove quasi-identifiers of a record. A generalisation mechanism can be used to group, for example, the age of the patients to, e. g. ≤60, ≥61 to ≤70, and ≥71. It has been proven that the selection of an optimal k-anonymity technique is nondeterministic polynomial time (NP)-hard [18] and that even an optimal k-anonymity technique cannot prevent certain attacks [19] For example, if all attributes of records for a quasi-identifier are of the same value, then stripping that information does not prevent an attacker from inferring it for all matching records. This homogeneity attack can be countered with a technique called l-diversity. Data is said to have the l-diversity property if the attributes are of at least l distinct values [20]. However, not every possible value is of equal distribution or entropy. For example, if a disease is occurring very often for a given set of patients, then the positive indicator is of less entropy than the negative indicator. To defend against an attacker exploiting such knowledge, the notion of t-closeness has been established. If distance of the distribution of values in the target set to the distribution of the whole table is no more than t, the data set is said to possess the t-closeness property [21].


#

Group wise linking and comparison for internal validity

In the prospectively collected registry data (R), we have a pre-defined set of p variables which were consented before [10] [22] and have q variables in the health insurance claims data (C), whereas we can transform information in variables as International Classification of Diseases (ICD) coding in several dummy variables (e. g. diabetes yes/no). We have a set of m variables which are present in both data sets m<p,q, (intersection variables e. g. age, gender, diabetes, death etc.). For further variables (e. g., treatment costs) we cannot match data. Within the m presenting variables in both data sets, we identify n, n>m relevant subgroups in the registry data: R 1 , R 2 , …, R n (e. g., R 1 : females undergoing bypass surgery in the registry data) and n corresponding subgroups in the health insurance claims data C 1 , C 2 , …, C n (e. g., C 1 : females undergoing bypass surgery in the BARMER cohort). Obviously, subgroups R i and C i (i=1,2, …, n) are not identical, neither their crude sample size, nor their composition, because not every female patient in the registry is insured by BARMER and not every vascular centre in Germany is recruiting for the registry, but we assume that R i and C i are more like each other than to other subgroups. We link the subgroups group wise R 1 -C 1 , R 2 -C 2 , …, R n -C n due to protection of k-anonymity. We suppose that if these linked subgroups have high similarity regarding their co-morbidity rates as measured by Elixhauser co-morbidity groups,[ 23] [24] we could see a higher similarity in their outcomes than to other subgroups (variance between groups smaller or as small as variance within groups). If we consider registry data is the gold standard, increasing similarity of the linked subgroups in the 2 different data sets suggests increasing internal validity.


#
#
#

Discussion

Adding to the methodological foundations of the validation of health insurance claims data may contribute to a more efficient utilisation of this valuable resource for health services research and quality improvement. In times of increased workload in medical care, physicians and nurses often decline additional documentation requirements emphasising the need to utilise data already collected for reimbursement or management of medical care provided to the patients. If health insurance claims data contain valid information regarding co-morbidities and quality indicators, research and quality improvement registries may be run with less effort by adding complementary information from these resources instead.

Recently, 2 comprehensive reforms of the European Union regulatory framework with major impact on real-world evidence have been implemented. On the one side, the GDPR aims to modernise data protection and privacy in times of big data techniques and significantly strengthens informed consent of the data subjects [5] [25.] On the other side, a new Medical Device Regulation promotes the utilisation of available real-world data for market access and surveillance of medical devices what actually affects a wide spectrum of multidisciplinary vascular medicine. This extensive development of Union law encourages an ongoing controversy regarding real-world data complementing evidence from randomised and controlled trials (RCT) [26] [27]. Despite their potential, there are important risks associated to the use of real-world data [3]. Most importantly, the value of registry-based and claims-based research and quality improvement depends on its validity [28]. This apparent assumption leads to the question if data from health insurance claims can be stated valid for research or quality improvement purposes. To the best of our knowledge, there is no commonly-accepted international standard on how to validate these data. This is mainly due to differences in coding classifications and reimbursement systems. Furthermore, it is caused by the paucity of appropriate data sources to compare claims data with. Even the validity of data from well-designed prospective registries remains unknown until it has been validated itself. The question arises what data can be considered as real-world and there is no simple answer. Several research groups such as the working group for the collection and utilisation of secondary data (AGENS) of the German society for social medicine and prevention (DGSMP) and the German society for epidemiology (DGEPI) are currently evaluating suitable methods to validate health insurance claims data. Although it is usually recognised as gold standard to match patient files or registry data to claims data, several aspects limit this approach. Firstly, to ensure the lawfulness of any processing of personal data in terms of record linkage, an explicit informed consent by the data subject is necessary until no other legal justification exists. This, however, requires a great deal of effort and costs and potentially introduces harms of individual privacy [5] [17] [25]. Secondly, the reference data (e. g., registry data) must have enough internal and external validity to be suitable for validation purposes. Especially the completeness of data and possible missing data remain an important problem of registries limiting their scientific value. Against this backdrop, the completeness of follow-up visits in registries remains a critical issue. To encounter these challenges, the data quality assurance of the GermanVasc registry data is implemented by various measures including random-sample and risk-based independent site visits [11]. Strongly related to the aspect of internal validation, the question arises if registry data and health insurance claims data cover the same target population. There is probably a significant selection bias that should be further illuminated to pounder the value of both data sources for research and quality improvement. The items defined by the data dictionary of a registry probably differs from the ICD codes used to identify PAD patients in health insurance claims. It is well-known that PAD patients differ in their risk profiles among the different stages of disease and from other patient populations. By comparing these risk profiles and clusters of comorbidities, the approach aims to examine this aspect. Thirdly, it seems impossible to examine the validity of all different aspects of health insurance claims data. These rapidly growing data sources involve not only inhomogeneous data collected during hospitalisations but also on medication prescriptions, outpatient treatments, and others. Thus, it will never be justified to state validity of health insurance claims data in general. Validation projects can only prove enough validity of data concerning a specific context including the defined target population. Lastly, this study is limited to the data from the second largest health insurance company from Germany covering approximately 13% of Germany’s population. In Germany, approximately 73 million inhabitants (87%) are insured by 110 statutory health insurance companies and additional 9 million inhabitants (11%) are insured by 50 private insurance companies (data for 2017/2018). However, standardization methods can help to generalize results from single insurance providers to the German target population.

Although PAD is widely distributed and causes a significant burden in modern healthcare systems, approximately 50% of all recommendations in practical guidelines are based on consensus of experts due to lacking high quality studies. To develop data privacy compliant validation methods can help to complement the insufficient knowledge-base especially in fields of medical care where evidence from RCTs remain uncommon. There is a good case to believe that these sources of real-world data will be of increasing importance in the future and there is already a strong interest to use health insurance claims data in a further development of pragmatic trials [2]. Cross-disciplinary consortia involving healthcare researchers, statisticians, computer scientists, and jurists can help to develop suitable and feasible methods and technical infrastructures following the privacy by design principles. The IDOMENEO-approach aims to contribute to this endeavour by providing insights on how and to what extent claims data can be utilised for research and quality improvement in vascular medical care.


#

Conclusions

The utilisation of health insurance claims data in health care research and quality improvement will increase in the future emphasising the need to validate these data. The European Union data protection regulations complicate direct crosslinking of personal data without legal justification or informed consent. The IDOMENEO study aims to prospectively collect registry and claims data and to develop methods for a data privacy compliant validation.

Ethics

The GermanVasc registry trial complies with the Helsinki Declaration 2013. The primary ethics approval was granted by the Hamburg Medical Chamber Ethics Committee (PV5691, January 2018) and the approval was confirmed by the local ethics committees. An insurance contract was concluded for the 10,000 patients included in this registry trial. All technical and conceptual measures, and the access to the anonymised BARMER health insurance claims data is in accordance with European Union and German regulations.


#
#

Conflicts of interest

The authors declare no conflicts of interest.


Correspondence

Dr. Christian-Alexander Behrendt, MD
Department of Vascular Medicine, Work Group GermanVasc,
University Medical Center Hamburg-Eppendorf
Martinistraße 52
20246 Hamburg


Zoom Image
Fig. 1 Illustration of the IDOMENEO approaches to validate health insurance claims data (BARMER) with prospectively collected and quality assured registry data (GermanVasc).