Tuesday, January 6, 2009

Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface

One of the key services that will eventually have to be provided on the grid, given the vast amounts of data to be integrated, is one that focuses on Data Cleaning, Deduplication and Record Linkage. This domain has been a significant challenge for years- even off the grid. Clearly, this is a very important and exciting area of research.

I found out about Febrl yesterday. It has been developed since 2002 as part of a collaborative research project conducted between the Australian National University in Canberra and the New South Wales Department of Health in Sydney, Australia.

The group is working on Parallel Large Scale Techniques for High-Performance Record Linkage
Link to their efforts: http://datamining.anu.edu.au/linkage.html

The NCPHI Grid team is going to investigate this - and find out what, if any, aspects of Febrl might be leveraged to help provide this service on the PHGrid.

Note: It requires Python to be installed.

Here are two papers on Febrl.

Here is the sourceforge link to Febrl

Features include:

* Probabilistic and rules-based cleaning and standardisation routines for names, addresses, dates and telephone numbers.

* A geocoding matching system based on the Australian G-NAF (Geocoded National Address File) database.

* A variety of supplied look-up and frequency tables for names and addresses.

* Various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons. Two new approximate string comparison methods (bag distance and compression based) have been added in this release.

* Several blocking (indexing) methods, including the traditional compound key blocking used in many record linkage programs.

* Probabilistic record linkage routines based on the classical Fellegi and Sunter approach, as well as a 'flexible classifier' that allows a flexible definition of the weight calculation.

* Process indicators that give estimations of remaining processing times.

* Access methods for fixed format and comma-separated value (CSV) text files, as well as SQL databases (MySQL and new PostgreSQL).

* Efficient temporary direct random access data set based on the Berkeley database library.

* Possibility to save linkage and deduplication results into a comma-separated value (CSV) text file (new).

* One-to-one assignment procedure for linked record pairs based on the 'Auction' algorithm.

* Supports parallelism for higher performance on parallel plat- forms, based on MPI (Message Passing Interface), a standard for parallel programming, and Pypar, an efficient and easy-to-use module that allows Python programs to run in parallel on multiple processors and communicate using MPI.

* A data set generator which allows the creation of data sets of randomly generated records (containing names, addresses, dates, and phone and identifier numbers), with the possibility to include duplicate records with randomly introduced modifications. This allows for easy testing and evaluation of linkage (deduplication) processes.

* Example project modules and example data sets allowing simple running of Febrl projects without any modifications needed.

An Evaluation of Febrl

Conclusion of Evaluation:

Febrl is an excellent research tool to study the deduplication process. It provides refinements in record standardization, geocoding and field matching. It provides flexibility in how data is selected for scoring, how record fields are compared, and how the untimate score is calculated. Febrl is also useful as an outboard record standardizer, of particular interest where names, addresses and phone numbers are not already separated into component parts.
It became clear, as we understood the inherent flexibility of Febrl that with sufficient effort, we could have used our long experience in deduplication to improve our results. It's all in the configuration.
However, we couldn't think of any way to deal effectively with twins. Probabilistic record linking schemes that rely solely on aggregating scores of individual field comparisons do not perform well on data with twins. Febrl couldn't distinguish between a true match with data errors and a twin. It didn't "understand" data in context.
Febrl places considerable resources at the researcher's command. It provides a deduplication laboratory allowing the experimenter lots of knobs to turn in order to arrive at a deduplication solution. But a laboratory is not a factory. We found ourselves "trying just one more thing" over and over again in the interests of fairness for this comparison with no way to decide when enough is enough. With endless opportunities for tinkering, adjustment and trying "just one more thing", it's an hourly consultant's paradise. For this reason, look very carefully at any consultant's recommendation to use Febrl.
While Febrl is free open source software, we believe that the actual cost of moving it from the laboratory into production could be significant. Its inability to deal with twins is a fatal flaw.

1 comment:

Jim Tobias said...

I tried to install / use the FEBRL and ran into some difficulties.
The FEBRL requires various Python modules including NumPy, GTK, PyGTK and others; I followed the instructions and installed the various modules with Python 2.5 but the guiFEBRL still did not find some of the installed modules;
I think that this could be a path issue...but would suggest that this is not a 5-minute install.
Jim Tobias