The digitization of health records is making the aggregation of EHRs, prescription claims information, and other clinical data easier than ever before. But before analyzing clinical data collected from different databases, researchers need to first determine where the databases overlap. Doing so has two major purposes:

  1. To identify duplicate information: For example, if John Smith’s medical history was collected at two hospitals where he sought care, treating his medical history data as being from separate people means this data set over-weights the importance of John Smith’s medical history.
  2. To piece together a fuller picture of a research subject: For example, if John Smith reported side-effects of a medication at two hospitals six months apart, identifying that both sets of data are for the same person can provide insight into the type, severity, and duration of the side-effects. In other words, determining the link between the data points providers a richer picture of John Smith’s experience.

Given the importance of understanding how different records are connected, two classes of record linkage approaches have arisen: deterministic matching and probabilistic matching.

Deterministic matching is straightforward. For this, rules are coded that flag records as being linked if certain discrete fields are equivalent. For example, a simple rule might identify two sets of patient EHRs as being duplicates if they have the same surname, date of birth, and home address information. Although intuitive, this approach can lead to false positives if the rules are too broad, and false negatives if they are too narrow. More specifically, false positives occur for two patients with similar identifying information, such as similar names, zip codes, and birthdates. False negatives can be frequent given the transience of certain fields (e.g., home address), errors like misspellings, missing data, and more. Generally, deterministic matching should be limited to scenarios where the matching task needs to be done quickly, data sets are smaller, and when matching errors are of little consequence (e.g., a ‘first pass’ at matching).

Probabilistic matching is an alternative, statistics-based approach. Specifically, probabilistic matching algorithms look at many fields, assign a weight to a data value based on how unique it is, and uses these weights in aggregate to quantify the overall likelihood that two records are linked. A researcher can then tune the threshold for what level of likelihood defines linkage versus not. The data-driven nature of this approach means it is both adaptable and robust. For example, an algorithm might weigh the surname “Rodriguez” differently when identifying links in data collected in Southern California versus Minnesota. Similarly, these algorithms can better manage errors, recognizing John Smith and John Smithh (a typo) as the same person, for example. Overall, probabilistic matching is more accurate than deterministic matching.

Because of its utility, probabilistic matching algorithms are being used more frequently. One such matching algorithm, for example, was used to examine records from outpatient clinics, antenatal care facilities, and a medical unit’s delivery ward in Senegal. From hundreds of disparate records, 75 patients were identified who received an antimalarial medication in early pregnancy, and these records were used to assess the medication’s effect on congenital malformations. Another study linked EHRs for almost 2,000 patients with breast cancer to corresponding data from a tumor registry (part of the California Cancer Registry). A third study used a probabilistic matching algorithm to link pregnant patients’ EHRs to those of their child post-delivery.

Altogether, collaborations and medical registries that seek to merge clinical data should actively employ probabilistic matching methodologies in order to maintain rich, high-integrity data sets. To do so, researchers and steering committees can take advantage of publicly-available resources like probabilistic matching software (e.g., RecordLinkage library in R, routines in SAS) and the National Library of Medicine’s extensive implementation guidelines. Ultimately, these resources can be combined with flexible and analyst-configurable data management platforms like RexDB that offer a centralized pipeline between clinical databases and high-quality, research-friendly data sets.

For more information on how RexDB by Prometheus Research was used as an end-to-end pipeline for data aggregation and transformation, see our poster presentation from the 2014 American Medical Informatics Association (AMIA) conference.