Summary
Objectives: There have been many developments and applications of mathematical methods in the
context of record linkage as one area of interdisciplinary research efforts. However,
comparative evaluations of record linkage methods are still underrepresented. In this
paper improvements of the Fellegi-Sunter model are compared with other elaborated
classification methods in order to direct further research endeavors to the most promising
methodologies.
Methods: The task of linking records can be viewed as a special form of object identification.
We consider several non-stochastic methods and procedures for the record linkage task
in addition to the Fellegi-Sunter model and perform an empirical evaluation on artificial
and real data in the context of iterative insertions. This evaluation provides a deeper
insight into empirical similarities and differences between different modelling frames
of the record linkage problem. In addition, the effects of using string comparators
on the performance of different matching algorithms are evaluated.
Results: Our central results show that stochastic record linkage based on the principle of
the EM algorithm exhibits best classification results when calibrating data are structurally
different to validation data. Bagging, boosting together with support vector machines
are best classification methods when calibrating and validation data have no major
structural differences.
Conclusions: The most promising methodologies for record linkage in environments similar to the
one considered in this paper seem to be stochastic ones.
Keywords
Record linkage - object identification - decision trees - support vector machines
- EM algorithm