by Frank Farach, Staff Scientist

Suppose you and I are clinical researchers who both work in the Boston area. We want to collaborate on a project together. We’re interested in finding genetic markers of treatment response and non-response among patients with heart disease. We need a large sample size to answer this question with acceptable precision. I have treatment outcome data from several clinical trials for heart disease using a variety of treatments. You have access to a large database of genetic data on patients with heart disease. Our plan is to find matching individuals between our two data sources, and then to correlate your genetic data with my treatment outcome data and publish a paper together on the findings.

We have a lot of work ahead of us before we can share our data. Both of our datasets are highly sensitive, including protected health information (PHI). We’ll need IRB approval from our respective institutions. I’ll probably have to get approval from the sponsors of the clinical trials involved on my end. There will be a careful review of whether the patients in our respective studies consented to have their data used this way. Since we work for different institutions, there will be a lot of administrative paperwork involved. We’ll both need to justify why we need the other’s data on scientific grounds, exactly what we plan to do with the data, and how we will safeguard it.

However, all this is putting the cart before the horse. Before we go through all of the trouble to get approval, we need to know the number of patients in our datasets that we have in common. We hope it’s a large number — recall that we need a large sample size — but all we have at this point is an assumption of overlap based on the fact that your data and my data were collected in Boston. The higher the overlap, the more worthwhile it will be to go through the trouble of getting approval to share our data. If the overlap is small, we won’t be able to answer our research question, in which case a data sharing agreement is not worth the effort (and may even be considered unethical).

So, we need to find a method for estimating the overlap between our datasets before we apply for approval to share our data. The most direct method of computing overlap would be to count the number of matching private IDs across datasets, but that would constitute sharing identifiable information, which is what we’re seeking approval for in the first place! Even if we had a way to get this estimate without sharing the data, we’d want to be sure that the method was, at least, informative and stable (i.e., accurate regardless of sample size), and we’d want to be able to quantify and adjust the risk of patient re-identification. It would also be nice to handle this without entrusting data to a third party, a so-called “honest broker.”

As it turns out, previous methods for securely computing or estimating patient overlap don’t satisfy all of these requirements. However, as attendees learned earlier this month at the annual meeting of the American Medical Informatics Association (AMIA) – Clinical Research Informatics, there is a new method that does. This method was developed by Dr. S. Josh Swamidass of Washington University – St. Louis, with assistance from Dr. Leon Rozenblit of Prometheus Research.