De-duplication

Monday, April 6, 2009

De-duplication

I've been thinking about the "De-duplication" grid service a bit lately and want to get my thoughts out here in case someone else has been mulling it over as well.

De-duplication is related to the De-identification/Re-identification problem of Anonymized data across multiple data sources (interesting MIT paper here). But in a grid environment the problem becomes tougher because the traditional, really-hard-to-do-solution of checking the various MPIs is even tougher because data sources 1) don't frequently have MPIs and 2) don't want to share them if they do.

Here's the scenario I use to talk about the de-duplication problem set:
The scenario is this:
3 hospitals in Atlanta with overlapping catchment areas. A federated query is sent out for SyndromeX for April 1, 2009. Hospital A returns a count of 10, Hospital B returns a count of 15 and Hospital C returns a count of 20. The combined result from the federated query shows a total of 45 counts of SyndromeX for April 1, 2009.

There exists a question of data quality as it's possible that the same patient visited multiple hospitals and had his SyndromeX recorded at multiple locations. In a federated/grid configuration this is even more probable as Hospitals may use data from multiple feeder sources (labs, small clinics, etc) that may report the same patient encounter multiple times.

Dr. Lenert had a good idea where each Hospital could send over a list of PII (name, address, dob, social) without the medical information for purposes of de-duping. This requires a trusted intermediary to receive these matches and send them on to each Hospital to respond back with number of matches. A hospital cannot send directly to each other hospital because it would reveal its patients to potential competitors.

So the de-duping scenario may change to something like this:
1) User submits query for SyndromeX on April 1, 2009 and selects de-duplication.
2) Trusted intermediary receives the query and passes it on to Hospitals A,B,C.
3) Hospital A returns count of 10, Hospital B returns count of 15, Hospital C returns count of 20. Interim total count 55 is not returned to user.
4) Trusted intermediary then requests patient id info from each hospital in a separate request.
5) Trusted intermediary compares the PII from each hospital (perhaps using a tool like febrl) and determines that there is overlap of patients and actual count. Hospital A shares 1 patient with Hospital B. Hospital A shares 2 patients with Hospital C. Hospital B shares 3 patients with Hospital C. Therefore there are 6 unique duplicates.
6) Total count of 39 is returned to the user (45-1-2-3).
Of course, this is going to require some pretty serious data use agreements with the hospitals and the trusted intermediary to control the storage of PII (i.e. don't persist any PII)

The alternative without the trusted intermediary:
1) User submits query for SyndromeX on April 1, 2009 and selects de-duplication.
2) Each hospital receives the query and list of nodes to which is was submitted.
3) Hospital A identifies a preliminary count of 10. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital B and C for comparison.
4) Hospital B compares the PII to its own results for query ID and returns the number of matches to Hospital A (duplicate count=1).
5) Hospital C compares the PII to its own results for query ID and returns the number of matches to Hospital A (duplicate count=2).
6) At the same time, Hospitals B&C are performing similar communications with the other nodes participating in the query.
7) Hospital B identifies a preliminary count of 15. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital A and C for comparison.
8) Hospital A compares the PII to its own results for query ID and returns the number of matches to Hospital B (duplicate count=1).
9) Hospital C compares the PII to its own results for query ID and returns the number of matches to Hospital B (duplicate count=3).
10) Hospital C identifies a preliminary count of 20. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital A and B for comparison.
11) Hospital A compares the PII to its own results for query ID and returns the number of matches to Hospital C (duplicate count=1).
12) Hospital B compares the PII to its own results for query ID and returns the number of matches to Hospital C (duplicate count=3).
13) Hospital A returns a count of 10 and notes that duplicates with B=1 and duplicates with C=2.
14) Hospital B returns a count of 15 and notes that duplicates with A=1 and duplicates with C=3.
15) Hospital C returns a count of 20 and notes that duplicates with A=2 and duplicates with B=3.
16) The user then feeds these results into another service that processes the results and determines that the actual count is 39 based on the following calculated relationships: (A|B)-1; (A|C)-2; (B|C)-3. These reductions are subtracted from the preliminary total of 45 to yield 39.

Approach #2, while not requiring a trusted intermediary requires a lot of coordination with the queried nodes and also involves the sharing of patient info with other partners that may not be desired. Also, due to set math will grow exponentially with the number of nodes involved in the query (for 3 nodes, the number of service calls is 9 [3 from user to nodes, then 6 among the nodes-using n!/{n-r}!]; but for 4 nodes, the number of service calls is 16; and for 5 nodes the number of service calls is 25). This is also suspect due to my deficient math skills and just some back of the napkin calculations.

Anyway, I wanted to get these two approaches to de-duping out there and will update with more as I keep thinking.

1 comment:

Jim Tobias said...: Hi Brian,
The febrl stuff is difficult to get going because it requires many additional Python modules such as NumPy and NumericPy and others.

The documentation is not great as far as the order of installing these modules and it is also unclear exactly which version of all of these modules work.

Python is also making major changes in 3.0 versus 2.x and not sure how febrl will "roll with the changes"...

Python is really cool in some ways but a lot of people are resisting the move to 3.0 scripts because they fear that their 2.6 scripts will not work as modules are deprecated and such.

I have no time to work on these things in my new hectic life but just wanted to send some feedback to you on febrl and some issues....

Jim; April 8, 2009 at 9:37 PM

Public Health Grid (PHGrid) - Research and Development

Monday, April 6, 2009

De-duplication

1 comment:

What is our story?

PHGrid Wiki

Active Lab Projects

PHGrid Service Registry

Grid Source Code Repositories

Useful Web Resources

Documents

In the news / Publications

Contact Us

PHGrid Participants

Blog Archive

Search by Labels

Disclaimer