Thursday, January 29, 2009

DRN upcoming presentations

DRN upcoming presentations as follows:

Demonstrate to DRN group ("internal presentation"): Thurs., 2/5 @ 4:30PM ET

Demonstrate to large group ("external presentation"): Thurs., 2/19 @ 2:30PM ET

WebEx Demo

Wednesday, January 28, 2009

Content for a Health Channel

Hello all;
This morning, the concept of the Health Channel was mentioned and I recalled an article from Dr. Linda Pickle (National Cancer Institute) regarding the notion of creating weather maps of disease.

The article is here: http://www.ij-healthgeographics.com/content/5/1/49

Abstract
Background

To communicate population-based cancer statistics, cancer researchers have a long tradition of presenting data in a spatial representation, or map. Historically, health data were presented in printed atlases in which the map producer selected the content and format. The availability of geographic information systems (GIS) with comprehensive mapping and spatial analysis capability for desktop and Internet mapping has greatly expanded the number of producers and consumers of health maps, including policymakers and the public.

Because health maps, particularly ones that show elevated cancer rates, historically have raised public concerns, it is essential that these maps be designed to be accurate, clear, and interpretable for the broad range of users who may view them. This article focuses on designing maps to communicate effectively. It is based on years of research into the use of health maps for communicating among public health researchers.
Results

The basics for designing maps that communicate effectively are similar to the basics for any mode of communication. Tasks include deciding on the purpose, knowing the audience and its characteristics, choosing a media suitable for both the purpose and the audience, and finally testing the map design to ensure that it suits the purpose with the intended audience, and communicates accurately and effectively. Special considerations for health maps include ensuring confidentiality and reflecting the uncertainty of small area statistics. Statistical maps need to be based on sound practices and principles developed by the statistical and cartographic communities.
Conclusion

The biggest challenge is to ensure that maps of health statistics inform without misinforming. Advances in the sciences of cartography, statistics, and visualization of spatial data are constantly expanding the toolkit available to mapmakers to meet this challenge. Asking potential users to answer questions or to talk about what they see is still the best way to evaluate the effectiveness of a specific map design.

Tuesday, January 27, 2009

Dimdim - the leading provider of open source web meeting software

Dec 3, 2008 - Dimdim, the easy-to-use web conferencing tool that delivers live presentations, whiteboards, voice and video, has just exited their beta period today. With the exit, the service has also added features like co-browsing and their new SynchroLive Communication Platform which automatically scales performance. The feature which you might be the most excited about, though, is Dimdim's decision to release their source code.

Thought you all might find this interesting.

CommVault adds deduplication to data management software

http://cwflyris.computerworld.com/t/4260806/6716305/164910/0/

Monday, January 26, 2009

PoiConDai II, revenge of PoiConDai

So, Poicondai II (Which I think is going to be called "PoiCenDai" because they are just called poison centers not poison _control_ centers anymore) is in the works.

The new service that will be feeding PoiCenDai is going to be giving back a lot more complex information, namely count aggregations based on region and time. Dr. Jeremy Espino has recommended Apache CXF as the service handler (which is basically the next generation of Apache AXIS, which will allow for this new PoiCenDai to work with globus that much more seamlessly should we need to distribute its operations on the grid) and using JAXB as the object marshaller (At least I am anticipating this because the last poison center webservice returned the data as an XML "any" stream with a "schema" stream)

Thus, the PoiCenDai service is going to need to take a command, determine the output type, marshal it all out, and then take the results and filter them into the appropriate regional polygons which were also drummed up with the help of the Gmap Polygon libraries.

Thus it's becoming one of those little moments where it's awkward figuring out where to start, whether I delve in with the new technologies for objectifying the PoiCen data, or start planning ways to keep methods for stuffing PoiCenDai data into polygons simple and relatively agnostic to JSPs.

I think I am going to focus on making the stubs for PoiCenDai closer to the JSP side for a couple of reasons... First off, I have just been dealing with the JSP end of things so it will be an easy bridge, second is that I am not sure of the service endpoints and will need to fiddle with JAXB and CXF and it would be good to know where the resulting objects are going to go.

Of course this is all liable to change as assumptions will probably be made and then be proven wrong or not feasible. Another wacky day in coding big enterprise type things.

Friday, January 23, 2009

New Grid Node

I worked with Joe Terdiman on getting his DRN grid node installed. Globus has been installed and configured to the point where the container comes up with no errors. Monday at 12EST we will install and test the SST service.

Thursday, January 22, 2009

Gmap-Polygon now demonstrable

Greetings, if you check out the new service registry entry you will find a description of the gmap-polygon project I have been working on and a link to a demo.

Now, I will work on taking the revamped NPDS service and getting it to populate the polygons so that little line graphs and colorations occur instead of just "0 cases" showing up all the time.

The other idea is that anyone can use this project if they need a google maps polygon mapper, especially if you need it to work in a secure environment and don't think you would be able to publicly expose a KML file.

Either way, I am happy with the results, I think it is nifty, and the more you poke things the more they cache and the faster they should (hopefully) load.

Cheers!

AMDS draft structure changing

Recently, Dr. Savel presented some ideas on AMDS to the CDC HIEs on a call set to facilitate collaboration between the PHGrid team's aggregate data efforts and the emerging HIE group's data aggregation efforts.

The powerpoint is available on the HIE/AMDS wiki and outlines the new data structure for AMDS. It's a simplification/consolidation of what Felicia is currently working on in that there is a single operation that allows a query by syndrome by group of zip codes. It also expands the specification by including specific fields for cell suppression rules.

So this means that we're closer to a working draft, but all the example services we have up on sourceforge and in the service registry need updating.

Wednesday, January 21, 2009

Centroids.

I have the centroids loaded, I have the map moving and drawing and zooming based on what regions were picked or changed, and I am planning to move the examples of Gmap-Polygon over to the staging node tomorrow afternoon.

This also means I will be ramping up on the NPDS changes, and will be focusing on getting time series populated into the gmap Polygon, and then we will have NPDS II (Or Poicondai II if you will).

Cheers!

Tuesday, January 20, 2009

Polygon Centroid Updates.

So, I have gotten the table and the logic set up for centroids so that the map will automagically zoom in on selected states and zip3s, but I need to tweak a few things and ran out of time today. I am hoping to get things set tomorrow morning, move some of the java-in-jsp code I have made into the actual helper java bean, and then plan out a migration of the sample to the staging node for a Thursday demo (so other people can tinker with the gmap polygon drawer).

After that, it is time to sit down with all the NPDS sample services and see if I can't get the NPDS-service that Dr. Jeremy Espino put together (who has already suggested a bevy of new technologies to look at) working with some of the new constructs and creating new time-series to put into the polygons.

Then it is a hop-skip and debug session from the new NPDS demo, which will be the marriage of the gmap-polygon tech with the new NPDS service.

I'm hoping it shall be sufficiently nifty.

List of Open Geospatial Consortium (OGC) Web Mapping Services

This site lists Open Geospatial Consortium (OGC)Web Mapping Services:

http://www.ogc-services.net/

Friday, January 16, 2009

Gmap Polygon coming along nicely

I now have states, zip3s and zip5s all loading and caching nicely in GmapPolygon. The next step is to make it prettier on the back end and easier for other applications to load data.

Also, while the polygons are all in the database, multiplied tho they may be, I need to get the centroids into the database so that the map zooms in better for when individual states or zip3s are picked.

Another thing is that the NPDS service will be updating, so I am finishing up some of the tweaks to the gmapping service just in time.

I am going to go ahead and schedule a deploy early next week so people can start playing with it and seeing how the mapper behaves.

Thursday, January 15, 2009

AMDS Progress

I have been working with Dan to get the Simple Secure Transfer service working for the Harvard group for the last few days (see SecureSimpleTransfer Odyssey postings). I did manage to upload Jim Tobias' RODS data set into my local Postgresql database yesterday and wade through a significant amount of reading on Globus security which is essential, since the clients we develop need to be able to run without a Globus installation on the client machine in much the same way that vBrowser works. I am getting my head back into where I left off with my code so that I can complete the AMDS Web UI. I still have a little work on the amds-rods-client in terms of resolving library dependencies.

Succesful DRN Test

The following tests were successfully conducted with the Geisinger SAS server:
  1. List files on remote SAMBA share with SST service
  2. Upload a SAS file via SST
  3. Run SAS program on a remote SAS server
  4. Download SAS output file in CSV format via SST
  5. View output on client node


Lessons Learned:

The following errors were thrown by the SST service for various reasons:

Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error Message: An error has occurred, error message = org.globus.gsi.GlobusCredentialException: Proxy file (/tmp/x509up_[uid-number]) not found
Explanation: You need to create a proxy before running SST commands. Run the grid-proxy-init command to correct this problem.

Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error message: An error has occurred, error message = java.netSocketException: Connection Reset
Explanation: You client has trust issues with the remote node. Make sure the remote node has your user credentials has been entered into the grid-map file and your signing CA is trusted by the remote node.

Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error message: An error has occurred, error message = java.lang.NullPointerException
Explanation:
This means there is no data in the simpleTransfer directory.

Command: java -jar caGrid-Transfer-client-1.0.jar upload [file name]
Error message: An error has occurred, error message = java.io.IOException: Internal Server Error
Explanation: This means the mapped user account is unable to write to the simpleTransfer directory. You will need to create the directory or check the directory permissions to correct this issue.

Order of operation for SST installation:

  1. Start Globus Container with the Root user
  2. Copy the caGridTransfer.war to $CATALINA_HOME/webapps
  3. Run the globus-deploy-gar command on cagrid_TransferService.gar
  4. Run the ant command to deploy Globus to Tomcat
  5. Shutdown the old Globus instance
  6. Start Tomcat with the Globus user (This starts Globus within Tomcat)
  7. Verify Globus is listening on the correct ports
  8. Run the grid-proxy-init command on the client node
  9. Test SST

Globus Toolkit User Perspectives Report Available

The report, Perspectives on Distributed Computing: Thirty People, Four User Types, and the Distributed Computing User Experience, is available for download. This report chronicles and anayzes the responses of thirty users to questions about using the Globus Toolkit - starting with summaries of results and conclusions but also including very detailed appendices and even transcripts of the interviews. Very interesting information for those involved in distributed computing:

http://www.mcs.anl.gov/~childers/perspectives/

Focus is HPC (high performance computing) but the toolkit comments are useful.

R and Cloud Computing

The following was posted on the R statistical computing group on LINKEDIN:

Got models in R? Deploy and score them in ADAPA in minutes on the Amazon EC2 cloud computing infrastructure!

Zementis ( http://www.zementis.com ) has been working with the R community, specifically to extend the support for the Predictive Model Markup Language (PMML) standard which allows model exchange among various statistical software tools ( http://adapasupport.zementis.com/2008/02/how-can-i-export-pmml-code-from-r.html ).

If you develop your models in R, you can easily deploy and execute these models in the Zementis ADAPA scoring engine ( http://www.zementis.com/products.htm ) using the PMML standard. This not only eliminates potential memory constraints in R but also speeds execution and allows SOA-based integration. For the IT department, ADAPA delivers reliability and scalability needed for production-ready deployment and real-time predictive analytics.

SST Security Settings

The SST has been configured to use the grid-map file for user credential verification. Here is an excerpt from the log that demonstrates an unauthorized user attempting to access the SST.

Log File:
Jan 13, 2009 3:20:18 PM org.globus.wsrf.impl.security.authorization.GridMapAuthorization isPermitted
WARNING: Gridmap authorization failed: peer "/O=Messed/OU=updomain.net/OU=Some Organization/OU=updomain.net/CN=Bubba Gump" not in gridmap file.
Jan 13, 2009 3:20:18 PM org.globus.wsrf.impl.security.authorization.ServiceAuthorizationChain authorize
WARNING: "/O=Messed/OU=updomain.net/OU=Some Organization/OU=updomain.net/CN=Bubba Gump" is not authorized to use operation: {http://transfer.blahdomain.com/TransferService}storeFile on this service

We are currently researching a method for doing this via the Globus / Tomcat configuration files. This configuration is currently built into the SST gar file.

Wednesday, January 14, 2009

DRN Notes

DRN Notes:

Current error message:
An error has occurred, error message = java.lang.ClassCastException: org.globus.gsi.gssapi.GlobusGSSContextImpl

Actions Taken:

1.Verified that a valid proxy was created for the SST Client side
2.Verified all Globus user and host certificates were installed with the correct permissions.
3.Checked the grid-mapfile to verify the remote user DN was mapped to a valid user account
4.Created a simpleTransfer directory at the default location. (Permissions may need to be changed – Action for tomorrow)
5.Verified the permissions were set correctly for containercert.pem and containerkey.pem
6.Viewed the Tomcat server.xml to verify the paths for the cert, key, and cacertdir variables were set correctly
7.Decided to do a full service SST redeploy. Currently in progress - Shut down Tomcat and deleted the webapps/caGridTransfer and wsrf directories
8.Remote Admin had to leave during the operation. We will continue the redeploy first thing in the morning.

Next Steps:

1.Start Globus via VDT
2.Run the globus-deploy-gar command on the SST gar file
3.Shutdown VDT Globus
4.Redeploy Globus to Tomcat
5.Start Tomcat
6.Test Transfer

Other Notes:

We were able to generate output with the SAS code and the SAS Template that was provided by Roy.

Harvard paper on SAS Macros for Satscan inputs

This paper from Harvard provides a set of SAS Macros to automate the
generation of input files for SatScan and could be useful as we move ahead
with the SatScan grid service from Univ of Utah.

A SatScan macro accessory for cartography (SMAC) package implemented with SAS

http://www.ij-healthgeographics.com/content/6/1/6

Jim Tobias

Tuesday, January 13, 2009

Zip3's now much better.

Well, the zip3's are now all showing up. But, as you can see from the wonderful comment that Jim left, , zip3's are so-so geographical locators and spurious populational references. As Jim pointed out, having one contiguous zip3 for 'Manhattan' pretty much makes epedemiology by zip3 useless in New York City. Then again I don't think we're using Zip3's for epedepiology, they are "things which are bigger than zip5s but smaller than states".

But this all brings up a good point: Our system is going to have to be all sorts of flexible. This is because we are dealing with GeoData, and in addition to GeoData changing all the time (zipcodes move, census numbers are updated), everyone will want to use a different set.

I don't think anyone will ever agree on a good way to geographically handle reporting systems like RODS or Poicondai, and the requirements will be completely different depending on whether things are being veiwed based on geography or population density, so I feel like the easiest thing we can do is allow for our systems to easily start using whatever new-fangled (or old-fangled) visualization technique is preferred at the time by whomever wants to use our system.

I feel like Gmap-Polygon is a good step in that direction. It basically is a way to put polygons on the map for whatever region you want to draw, so as long as you have the polygons put into the database before hand, and you know how you want to shade them, you should be able to use the software easily. This is of course assuming that you wanted to put it in a secure portal and didn't just want to expose KML (because for drawing polygons with nifty shading and effects, KML is a bit more straightforward).

I feel like the other thing that PHGrid products will need to start focusing on are ways to securely associate data, because a lot of people will want to focus on things like the Census Tracks that Jim mentioned, and no-one will know what census track they live in when they go to a hospital, and pulling up address data for a count item is going to be a very big no-no for things outside the secure network. I wonder if one of the applications of the Natural Language Processor can be backfilling the AMDS regional info with census track data and the like.

Monday, January 12, 2009

Oh look, Zip3s have multiple polygons in them.

I am beginning to think there is a new adage: New Polygon, new thing discovered about your system that doesn't support the new polygon.

So, ZIP3's do not form one big contiguous block of area in all cases. In some cases, they will form multiple polygons because some zipcode changed from 61299 to 61304 or whatever.

But, the polygon system I had set up was expecting only one polygon per region. Thus, I will need to do a bit of an overhaul (it actually won't be too bad, yay object oriented coding) but will need to massage a new list of zip3's (and probably zip5s and states) out of the KML files.

Then, hopefully, I'll have entire states full of zip3s, and then be able to select zip5s within zip3s, and then comes the PoiConDai refurb.

Friday, January 9, 2009

Obama's Speech (January 8, 2009)

Smart Grid technology (not what you think; it's an upgrade to the existing national power grid, which still uses 120 year old technology; but it has a strong analog to what we're trying to accomplish with grid computing).

Read about Smart Grid here.

Read article on Obama's speech below.

Obama includes broadband, smart grid in stimulus package

January 8, 2009 (IDG News Service) U.S. President-elect Barack Obama laid out his plan for a huge economic stimulus package, with a broadband rollout, an Internet-based smart energy grid and computers for schools as part of the plan.

During his campaign, Obama included rolling out broadband, energy issues and computers for schools in his list of goals, but in Thursday's speech in Fairfax, Va., he called for those items to be included in a giant stimulus package he'll push Congress to pass within weeks. The stimulus package could cost close to $1 trillion.

The president-elect called the economic situation in the U.S. a "crisis unlike any we have seen in our lifetime."

He also called for all U.S. medical records to be computerized within five years. "This will cut waste, eliminate red tape and reduce the need to repeat expensive medical tests," he said. "But it just won't save billions of dollars and thousands of jobs -- it will save lives by reducing the deadly but preventable medical errors that pervade our health care system."

Obama called on Congress to approve funding for rolling out broadband to unserved and underserved areas, although his speech did not provide details on how he wants it to happen. Several tech groups have called for a national broadband policy that would include a mixture of tax credits, loans and payments to broadband providers that bring broadband to new areas.
Part of the package should include rebuilding physical infrastructure such as roads and bridges, Obama said. "But we'll also do more to retrofit America for a global economy," he added. "That means updating the way we get our electricity by starting to build a new smart grid that will save us money; protect our power sources from blackout or attack; and deliver clean, alternative forms of energy to every corner of our nation. It means expanding broadband lines across America, so that a small business in a rural town can connect and compete with their counterparts anywhere in the world."

Smart energy grids would allow real-time monitoring of a customer's energy use through Internet technology. Proponents of a national smart grid say it would likely result in decreased electricity use, allow energy companies to more efficiently distribute electricity, and encourage homeowners to install alternative energy generators such as solar panels and sell their excess energy back to the grid.

Obama also called for Congress to approve money for "21st-century" classrooms, laboratories and libraries. "We'll provide new computers, new technology and new training for teachers so that students in Chicago and Boston can compete with kids in Beijing for the high-tech, high-wage jobs of the future," he said.
Obama's priorities line up with several tech groups that have been calling for more broadband and smart-grid funding. The Information Technology Industry Council (ITI), a trade group, praised Obama's stimulus plan. The package outlined by Obama represents an "excellent starting point," ITI President Dean Garfield said in a statement.

"Our firms know that technology investments are the quickest way to dramatically turn the
economy around," he added. "Increased broadband spending, electronic medical records, green energy investments and new computers for schools and libraries are all smart ways to keep America competitive while also creating new jobs and spending."

Wednesday, January 7, 2009

Updates to the AMDS draft

After Tom received some feedback from Farzad Mostashari, we changed the AMDS schema a bit to make it more usable.
Now Level 1 has:
Syndrome
Syndrome classifier
Age Group (0<2;2<5;5<18;18<54;54<65; 65+)
Patient State
Patient Zip (either zip3 or zip5)
Date Start
Date End (in at least 1 day increments)
Count

We removed denominator as that's not really necessary (since you can query for all syndromes).

Level 2 (not developed yet):
Gender
Race
Ethnicity
Detailed Geography

Also, Dr. Lenert created different stages of access support for AMDS:
Stage 0: Scheduled transfers of AMDS documents to a supernode/central database
Stage 1: AMDS data available using secure grid services
Stage 2: AMDS data de-duped via secure grid services
Stage 3: AMDS data de-duped and combined together from multiple nodes (although after Stage 1, results can be combined together)

Update on Collaboration with SatScan on Cloud

Ron Price and I chatted yesterday about some progress that he has made with the SatScan on the cloud service;

I have build a simulation dataset with 53,018 random points within NC, SC, and GA that is based upon a RODS dataset that was provided to me.

I am working with a statistician in NCZVED to write scripts to automatically generate SatScan parameter files (.prm) as well as case files (.cas), population files (.pop) and coordinate files (.geo) that would all be used as inputs
to the SatScan batch mode (for the grid service).

Also, I have emailed the CCID TB group several singular genotype spatial cluster maps as examples as well as built one web mapping service to visualize spatial clusters. It is my hope that Ron and I can collaborate on a demo of the grid service and simulated data to reignite the interest of the CCID TB epidemiologists.

Jim Tobias

R and Rweb

R is an open source statistical software package: http://www.r-project.org/

R is used extensively by epidemiologists and geographers.

One can download R (free): http://cran.cnr.berkeley.edu/

The base install of R can be extended with R packages.
A fairly comprehensive list of R packages is here: http://cran.r-project.org/web/packages/

The R software is level III approved for use at the CDC.

The R software runs on your desktop but the R code can be run through a web service.

An example of an R web service is RWeb: http://www.math.montana.edu/Rweb/

Specific packages such as EpiTools have been developed by UC Berkeley and they support full epidemiology coursework with R.

If you have questions or wish to see this in action...just stop by;

Jim Tobias

Tuesday, January 6, 2009

package changes and SVN

So, most of this morning was spent doing some data file massaging into CSVs, which were then loaded into the postgresql database.

It was sort of humbling, because my prior approaches were to build a big parser library and then spend another day or two debugging it. In this case, I just took the files, got a macro on them, and them told the database to import them. The result is I spent just 2/3rd of a day getting all the data I wanted in the right spot. The thing that will be bad is I'll have to do it all again (well, probably spend 1/3 of a day) when database updates need to be made, then again, I probably would have to tweak things in my loader scripts if they needed to be made, and it's still the same cost to the end user... and in fact probably less. Before it was download files, download code, adjust properties, package, build, run. Now it's download CSV, create table in database, import CSV into table in database. Yeah. It's easier now.

Otherwise, there are lists of relational data (all the zip3s in Nevada, all the zip5s in a given zip3, etc). Now I get to start building the code libraries that will serve it up as needed. Before I did that I had to do a package update... which involves changing one line in a lot of places.

Luckily eclipse handled the refactoring for me beautifully, and it terribly broke the subversion management of it all. Thus, I spent the rest of the day deleting out the appropriate files and committing the changes by hand.

Tomorrow is the implementation of the lists, and updating documentation for the loader and the list-libraries and, well, everything else.

Dallas Network Issues

  • The Tarrant node is back online. The root cause of the problem was the VM failed to restart after a host server reboot. The VM was manually restarted successfully.

  • The Dallas node is experiencing slow response times due to network issues that are currently being addressed by the Dallas network team. They are in the process of investigating the root cause of the issue. This issue currently affects the RPDSAdai demo and any data transfers to and from the Dallas node. I was informed that this issue would be corrected by close of business on 01-06-2009.

Link Plus: From CDC's National Program of Cancer Registries (NPCR)

Link Plus is a probabilistic record linkage program developed at CDC's Division of Cancer Prevention and Control in support of CDC's National Program of Cancer Registries (NPCR). Link Plus is a record linkage tool for cancer registries. It is an easy-to-use, standalone application for Microsoft® Windows® that can run in two modes: to detect duplicates in a cancer registry database, or to link a cancer registry file with external files. Although originally designed to be used by cancer registries, the program can be used with any type of data in fixed width or delimited format. Used extensively across a diversity of research disciplines, Link Plus is rapidly becoming an essential linkage tool for researchers and organizations that maintain public health data. http://www.cdc.gov/cancer/NPCR/tools/registryplus/lp.htm

Download Application Here:
ftp://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/RPLinkPLus-2.0.exe

Features
http://www.cdc.gov/cancer/NPCR/tools/registryplus/lp_features.htm

Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface

One of the key services that will eventually have to be provided on the grid, given the vast amounts of data to be integrated, is one that focuses on Data Cleaning, Deduplication and Record Linkage. This domain has been a significant challenge for years- even off the grid. Clearly, this is a very important and exciting area of research.

I found out about Febrl yesterday. It has been developed since 2002 as part of a collaborative research project conducted between the Australian National University in Canberra and the New South Wales Department of Health in Sydney, Australia.

The group is working on Parallel Large Scale Techniques for High-Performance Record Linkage
Link to their efforts: http://datamining.anu.edu.au/linkage.html

The NCPHI Grid team is going to investigate this - and find out what, if any, aspects of Febrl might be leveraged to help provide this service on the PHGrid.

Note: It requires Python to be installed.

Here are two papers on Febrl.
http://cs.anu.edu.au/~Peter.Christen/publications/kdd2008christen-febrl-demo.pdf
http://crpit.com/confpapers/CRPITV80Christen.pdf

Here is the sourceforge link to Febrl
http://sourceforge.net/projects/febrl

Features include:

* Probabilistic and rules-based cleaning and standardisation routines for names, addresses, dates and telephone numbers.

* A geocoding matching system based on the Australian G-NAF (Geocoded National Address File) database.

* A variety of supplied look-up and frequency tables for names and addresses.

* Various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons. Two new approximate string comparison methods (bag distance and compression based) have been added in this release.

* Several blocking (indexing) methods, including the traditional compound key blocking used in many record linkage programs.

* Probabilistic record linkage routines based on the classical Fellegi and Sunter approach, as well as a 'flexible classifier' that allows a flexible definition of the weight calculation.

* Process indicators that give estimations of remaining processing times.

* Access methods for fixed format and comma-separated value (CSV) text files, as well as SQL databases (MySQL and new PostgreSQL).

* Efficient temporary direct random access data set based on the Berkeley database library.

* Possibility to save linkage and deduplication results into a comma-separated value (CSV) text file (new).

* One-to-one assignment procedure for linked record pairs based on the 'Auction' algorithm.

* Supports parallelism for higher performance on parallel plat- forms, based on MPI (Message Passing Interface), a standard for parallel programming, and Pypar, an efficient and easy-to-use module that allows Python programs to run in parallel on multiple processors and communicate using MPI.

* A data set generator which allows the creation of data sets of randomly generated records (containing names, addresses, dates, and phone and identifier numbers), with the possibility to include duplicate records with randomly introduced modifications. This allows for easy testing and evaluation of linkage (deduplication) processes.

* Example project modules and example data sets allowing simple running of Febrl projects without any modifications needed.

-----
An Evaluation of Febrl
http://www.qsinc.com/dedup/results_FEBRL.html

Conclusion of Evaluation:

Febrl is an excellent research tool to study the deduplication process. It provides refinements in record standardization, geocoding and field matching. It provides flexibility in how data is selected for scoring, how record fields are compared, and how the untimate score is calculated. Febrl is also useful as an outboard record standardizer, of particular interest where names, addresses and phone numbers are not already separated into component parts.
It became clear, as we understood the inherent flexibility of Febrl that with sufficient effort, we could have used our long experience in deduplication to improve our results. It's all in the configuration.
However, we couldn't think of any way to deal effectively with twins. Probabilistic record linking schemes that rely solely on aggregating scores of individual field comparisons do not perform well on data with twins. Febrl couldn't distinguish between a true match with data errors and a twin. It didn't "understand" data in context.
Febrl places considerable resources at the researcher's command. It provides a deduplication laboratory allowing the experimenter lots of knobs to turn in order to arrive at a deduplication solution. But a laboratory is not a factory. We found ourselves "trying just one more thing" over and over again in the interests of fairness for this comparison with no way to decide when enough is enough. With endless opportunities for tinkering, adjustment and trying "just one more thing", it's an hourly consultant's paradise. For this reason, look very carefully at any consultant's recommendation to use Febrl.
While Febrl is free open source software, we believe that the actual cost of moving it from the laboratory into production could be significant. Its inability to deal with twins is a fatal flaw.

Monday, January 5, 2009

Setting up the locational library.

Well, right before I went off for the holidays, I completed the part of the gmap polygon libraries that would easily provide javascript for an array of polygons.

Now, I am in the process of building lists so that any application that wants to provide locations will be able to get lists of states, zip3's in said states, and zip5s in said zip3s. That, and improving the graphic awareness to be able to center on given areas after they are selected.

This of course means massaging large KML lists, and discussions on what would be most useful to someone trying to build their own databases on their own systems. Right now it seems the best answer would be "Comma delimited files", considering every database has a way to very quickly ingest such files.

I also am happy to have gotten some examples of text massaging using macros, and am once again making sure to earmark some time in the new year for basic scripting niftiness, namely because there are just some times when creative uses of grep and regular expressions will do so much more with a file than trying to create a SAX parser.

Friday, January 2, 2009

NIH Webcast - Transforming Science: Cancer Control on caBIG

Videocast of the November 10th 2008 Forum:

http://videocast.nih.gov/Summary.asp?File=14755.

Special thanks to Abdul R Shaikh, PhD, MHSc, Behavioral Scientist Program Director, Health Communication and Informatics Research Branch, Behavioral Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute.

FIY - the NCI is moving forward to launch a division web site on grid-enabled research in population health.

Thursday, January 1, 2009

HealthGrid 2009 conference Berlin - Call for Papers

Papers are now being accepted for the HealthGrid 2009 Conference in Berlin - June 28 - July 1, 2009

KEY DATES
Call for papers opens: 18 December 2008
Call for papers closes: 1 February 2009
Registration opens: 1 February 2009

More information about submission:

http://berlin2009.healthgrid.org/fileadmin/templates/download/HealthGRID2009_Call-for-Papers.pdf

conference information

http://berlin2009.healthgrid.org/index.php?id=1