Methods Inf Med 2016; 55(03): 276-283
DOI: 10.3414/ME15-01-0152
Original Articles
Schattauer GmbH

A Simple Sampling Method for Estimating the Accuracy of Large Scale Record Linkage Projects

James H. Boyd
1   Curtin University, Centre for Population Health Research, Perth, WA, Australia
,
Tenniel Guiver
2   Australian Institute of Health and Welfare, Canberra, ACT, Australia
,
Sean M. Randall
1   Curtin University, Centre for Population Health Research, Perth, WA, Australia
,
Anna M. Ferrante
1   Curtin University, Centre for Population Health Research, Perth, WA, Australia
,
James B. Semmens
1   Curtin University, Centre for Population Health Research, Perth, WA, Australia
,
Phil Anderson
2   Australian Institute of Health and Welfare, Canberra, ACT, Australia
,
Teresa Dickinson
3   Statistics New Zealand, Wellington, New Zealand
› Author Affiliations
Further Information

Publication History

received: 20 September 2015

accepted in revised form: 11 March 2016

Publication Date:
08 January 2018 (online)

Summary

Background: Record linkage techniques allow different data collections to be brought together to provide a wider picture of the health status of individuals. Ensuring high linkage quality is important to guarantee the quality and integrity of research. Current methods for measuring linkage quality typically focus on precision (the proportion of incorrect links), given the difficulty of measuring the proportion of false negatives.

Objectives: The aim of this work is to introduce and evaluate a sampling based method to estimate both precision and recall following record linkage.

Methods: In the sampling based method, record-pairs from each threshold (including those below the identified cut-off for acceptance) are sampled and clerically reviewed. These results are then applied to the entire set of record-pairs, providing estimates of false positives and false negatives. This method was evaluated on a synthetically generated dataset, where the true match status (which records belonged to the same person) was known.

Results: The sampled estimates of linkage quality were relatively close to actual linkage quality metrics calculated for the whole synthetic dataset. The precision and recall measures for seven reviewers were very consistent with little variation in the clerical assessment results (overall agreement using the Fleiss Kappa statistics was 0.601).

Conclusions: This method presents as a possible means of accurately estimating matching quality and refining linkages in population level linkage studies. The sampling approach is especially important for large project linkages where the number of record pairs produced may be very large often running into millions.

 
  • References

  • 1 Bradley CJ, Penberthy L, Devers KJ, Holden DJ. Health services research and data linkages: issues, methods, and directions for the future. Health Services Research 2010; 45 5p2 1468-88.
  • 2 Virnig BA, McBean M. Administrative data for public health surveillance and planning. Annual Review of Public Health 2001; 22 (Suppl. 01) 213-30.
  • 3 Goldacre M. editor. The value of linked data for policy development, strategic planning, clinical practice and public health: An international perspective. Symposium on Health Data Linkage. 2003 Public Health Information Development Unit, Adelaide University.
  • 4 Brook EL, Rosman DL, Holman CDAJ. Public good through data linkage: measuring research outputs from the Western Australian Data Linkage System. Australian and New Zealand Journal of Public Health 2008; 32 (Suppl. 01) 19-23.
  • 5 Hall SE, Holman CDAJ, Finn J, Semmens JB. Improving the evidence base for promoting quality and equity of surgical care using population-based linkage of administrative health records. International Journal for Quality in Health Care 2005; 17 (Suppl. 05) 415-20.
  • 6 Sibthorpe B, Kliewer E, Smith L. Record linkage in Australian epidemiological research: Health benefits, privacy safeguards and future potential. ANZ Journal of Public Health 1995; 19.
  • 7 Boyd JH, Ferrante AM, O’Keefe CM, Bass AJ, Randall SM, Semmens JB. Data linkage infrastructure for cross-jurisdictional health-related research in Australia. BMC Health Services Research 2012; 12 (Suppl. 01) 480.
  • 8 Newcombe HB. Handbook for Record Linkage: Methods for Health and Statistical Studies, Administration and Business. New York: Oxford University Press; 1988
  • 9 Boyd JH, Randall SM, Ferrante AM, Bauer JK, Brown AP, Semmens JB. Technical challenges of providing record linkage services for research. BMC Medical Informatics and Decision Making 2014; 14 (Suppl. 01) 23.
  • 10 Christen P, Goiser K. Quality and Complexity Measures for Data Linkage and Deduplication. In: Guillet F, Hamilton H. editors. Quality Measures in Data Mining Studies in Computational Intelligence 43. Springer; 2007. p. 127-51.
  • 11 Roos LL, Wajda A. Record Linkage Strategies: Part 1: Estimating Information and Evaluating Approaches. Winnipeg: University of Manitoba, Medicine Fo; 1990. 6 June 1990. Report No.
  • 12 Kendrick SW, Clarke JA. The Scottish Medical Record Linkage System. Health Bulletin (Edinburgh) 1979; 51: 72-9.
  • 13 Gill LE. OX-LINK: The Oxford Medical Record Linkage System. Record Linkage Techniques. Oxford: University of Oxford; 1997. p. 19.
  • 14 Newcombe H, Kennedy J. Record linkage: making maximum use of the discriminating power of identifying information. . Commun ACM 1962; 5 (Suppl. 11) 563-6.
  • 15 Fellegi I, Sunter A. A Theory for Record Linkage. Journal of the American Statistical Association 1969; 64: 1183-210.
  • 16 Holman D, Bass A, Rouse I, Hobbs M. Population-based linkage of health records in Western Australia: Development of a health services research linked database. Australian and New Zealand Journal of Public Health 1999; 23.
  • 17 Ford DV, Jones KH, Verplancke J-P, Lyons RA, John G, Brown G. et al. The SAIL Databank: building a national architecture for e-health research and evaluation. BMC Health Services Research 2009. 2009 9. 157
  • 18 Rosman D, Garfield C, Fuller S, Stoney A, Owen T, Gawthorne G. editors. Measuring data and link quality in a dynamic multi-set linkage system. Symposium on Health Data Linkage. Available from: http://wwwadelaideeduau/phidu/publications/pdf/1999–2004/symposium-proceed-ings-2003/rosman_apdf 2002 20–21 March 2002; Sydney.
  • 19 Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R. Opening the black box of record linkage. Journal of epidemiology and community health 2012; 66 (Suppl. 12) 1198.
  • 20 Ferrante A, Boyd J. A transparent and transportable methodology for evaluating Data Linkage software. Journal of Biomedical Informatics 2012; 45 (Suppl. 01) 165-72.
  • 21 Neter J, Maynes ES, Ramanathan R. The effect of mismatching on the measurement of response errors. Journal of the American Statistical Association 1965; 60 (Suppl. 312) 1005-27.
  • 22 Altman DG. Practical statistics for medical research. CRC Press; 1990
  • 23 Pudjijono A. Probabilistic Data Generation. Canberra: Australian National University; 2008
  • 24 Christen P. editor. Probabilistic Data Generation for Deduplication and Data Linkage. Sixth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05); 2005 Brisbane.:
  • 25 Nasseh D, Stausberg J. Evaluation of a Binary Semi-supervised Classification Technique for Probabilistic Record Linkage. Methods of Information in Medicine. 2016
  • 26 Christen P. editor. Febrl – A Freely Available Record Linkage System with a Graphical User Interface. Second Australasian workshop on Health data and knowledge management. 2008 Wollongong, NSW.:
  • 27 Hernandez MA, Stolfo SJ. editors. The Merge/Purge Problem for Large Databases. Proceedings of the ACM SIGMOD conference. 1995. San Jose, California: ACM New York.;
  • 28 Hernandez M. UIS Database Generator. 1997
  • 29 Bertolazzi P, Santis LD, Scannapieco M. editors. Automated record matching in cooperative information systems. Proceedings of the international workshop on data quality in cooperative information systems. 2003 Siena, Italy.:
  • 30 Jaro MA. Probabilistic Linkage of Large Public Health Data Files. Statistics in Medicine 1995; 14: 491-8.
  • 31 Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 1989; 84 (Suppl. 406) 414-20.
  • 32 Copas JB, Hilton FJ. Record Linkage: Statistical Models for Matching Computer Records. Journal of the Royal Statistical Society 1990; 153 (Suppl. 03) 287-320.
  • 33 Randall S, Ferrante A, Boyd J, Semmens J. The effect of data cleaning on data linkage quality. BMC Medical Informatics and Decision Making 2013; 13 (Suppl. 64) e1.
  • 34 Winkler WE. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. 1990
  • 35 Herzog TN, Scheuren FJ, Winkler WE. Data quality and record linkage techniques. Springer; 2007
  • 36 Richards D, Chellen V, Compton P. editors. The reuse of ripple down rule knowledge bases: Using machine learning to remove repetition. Proceedings of the 2nd Pacific Knowledge Acquisition Workshop (PKAW’96), Coogee, Australia; 1996: Citeseer.
  • 37 Bishop G, Khoo J. Methodology of Evaluating the Quality of Probabilistic Linking. Canberra: Australian Bureau of Statistics; Analytical Services Branch, 2007 5 April 2007. Report No.: 1351.0.55.018.
  • 38 Cochran WG. Sampling techniques. 1977. New York: John Wiley and Sons.;
  • 39 Guiver T. Sampling-Based Clerical Review Methods in Probabilistic Linking. ABS Website http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/1351.0.55.034May%202011?OpenDocument Australian Bureau of Statistics, Branch AS; 2011 Contract No.: ABS Cat. no.1351.0.55.034.
  • 40 Freelon DG. ReCal: Intercoder reliability calculation as a web service. International Journal of Internet Science 2010; 5 (Suppl. 01) 20-33.
  • 41 Gill LE. OX-LINK: The Oxford Medical Record Linkage System. Oxford: University of Oxford.;
  • 42 Kendrick S, Clarke J. The Scottish Record Linkage System. Health bulletin 1993; 51 (Suppl. 02) 72.
  • 43 Ford DV, Jones KH, Verplancke J-P, Lyons RA, John G, Brown G. et al. The SAIL Databank: building a national architecture for e-health research and evaluation. 4 September 2009
  • 44 Roos LL, Nicol JP. A research registry: uses, development, and accuracy. Journal of Clinical Epidemiology 1999; 52 (Suppl. 01) 39-47.
  • 45 Lyons RA, Hutchings H, Rodgers SE, Hyatt MA, Demmler J, Gabbe BJ. et al. Development and use of a privacy-protecting total population record linkage system to support observational, interventional, and policy relevant research. The Lancet 2012; 380 Supplement 3(0) S6.
  • 46 Karmel R, Anderson P, Gibson D, Peut A, Duckett S, Wells Y. Empirical aspects of record linkage across multiple data sets using statistical linkage keys: the experience of the PIAC cohort study. 2010