Monday, December 29, 2008

Columbia’s Natural Language Processing Technology Licensed to NLP International

Congratulations to our friends at Columbia

New York Columbia University reports that it has licensed MedLEE, a natural language processing technology, to Columbia spin-out, NLP International Corp.

Millions of dictated and typed medical reports require laborious and time-consuming processing by highly trained and expensive experts who manually review and extract the required information. MedLEE reduces the time and expense associated with these processes by automatically extracting and encoding all relevant clinical information. Once coded, the information is easily available and accessible for further clinical processes like billing, reimbursement, quality assurance analytics, data mining, accreditation and others.

“A significant proportion of the [electronic] health care record (EHR) resides in the form of unstructured, natural language text and both the amount of data and the number of records is growing dramatically,” said Bernie Keppler, founder and chief executive of NLP International. “Ready access to this information is critical to save costs and improve the quality and access to health care.”

MedLEE has been successfully tested by large hospital systems and government agencies, including the NewYork-Presbyterian Hospital , the National Cancer Institute and the U.S. Department of Defense. Several pharmaceutical companies and healthcare information system vendors are currently evaluating MedLEE for a variety of applications.

“I am excited that this technology will now be more broadly available to hospitals and other health care organizations, where it can continue to contribute to improving patient care,” said Carol Friedman Ph.D., Professor of Biomedical Informatics at Columbia University , who developed the technology. “MedLEE has been used in the academic community for many years to develop clinical applications that have been shown to improve the quality of health care.”

“MedLEE is considered by many in the field as the gold standard for unstructured medical text processing, but it has not been available as a commercial, enterprise-ready product,” said Donna See, Director of Strategic Initiatives at STV, which brokered the deal. “We are very pleased to be partnering with NLP International to introduce MedLEE to these markets.”

Robert Sideli, M.D., CIO at Columbia University Medical Center , said the already widespread deployment and use of MedLEE throughout the research and healthcare communities prove the system’s future commercial success. “It will contribute substantially to higher efficiency in the electronic medical record industry due to its superior functionality in medical data extraction, coding, analytics and data mining,” Sideli said.

David Lerner, who oversees new ventures for STV, said Columbia has found the right partner in NLP International. “This venture will add to the many successful technology spin-off ventures for which Columbia University is known,” Lerner said.

Tuesday, December 23, 2008

A build Point

I have reached a point where I was able to hand the code for gmap-polygon to Felicia so she could use it to draw polygons for data that her services were returning. Many thanks to Felicia and Brian. Felicia for her patience for all the "oh crap, that's not doing what I thought, sync and try again" moments. Brian for doing some textedit niftiness to turn the KML files into easier to parse flat-files.

Otherwise, I am going to be off on vacation for the next couple of days, and some of them Felicia will still be coding, so I am glad I got a chance to explain the inner workings of the code and that I spent a lot of time trying to make it as agnostic as possible.

Otherwise, that being said, I really need to step up my Javadoc documentation. I'm committing to spending the last hour of each day documenting all my new stuff (and old stuff I forgot to document) and checking it in. That way, people who AREN'T sitting right next to me will have a much easier time figuring out how to do stuff and why I did the things I did the way I did them.

Cheers, and Happy Holidays.

AMDS White Paper

Wil posted an initial draft of an Aggregate Minimum Data Set white paper over on the wiki. Please read and submit comments on how to improve the document. The purpose of the document is to show how aggregate data can be useful for biosurveillance. This isn't a new concept, but we wanted to put a clarification document of how our AMDS-related activities can benefit biosurveillance.

I'll be offline due to Christmas and New Year's and won't be back online (I hope) until January 5. So Merry Christmas, Happy Hanukkah and Happy New Year.

Monday, December 22, 2008

A plague of ticks (and by ticks, I mean little truncating errors)

So, I am sort of excited because I think that I am ready with Gmap Polygon, but on Friday I noticed that some of the polygons weren't getting drawn well at all, namely because the polygon strings (long lists of coordinates separated by spaces) were way too short and were getting cut off.

So today I spent a lot of time monkeying around with the little SAX engine I was using, and now I am sure my issue is that I need to just not use the sax engine anymore, because while it was a quick class and rather fast, it is really screwing up some of the stuff I am trying to do and some of the cases I am trying to handle and I am no further along figuring out how to make it stop doing that.

So tonight, I am going to read up on JAXB and maybe work a tutorial, and tomorrow, I hope to get this fixed.

Otherwise, there are some other fun questions about what everything will end up looking-like and how crazy some borders are. It seems that zip3s can be inside other zip3s and it also might seem there are more than one area covered by a zip3. I guess that's what happens when you base your centers on mailing routes. Zip3s have a lot of 'outer boundaries', that make me wonder about the best way to store the polygons, we might have to go to a many to one relationship.

I'll keep ya'll posted, and hopefully there will be a cool little demo app soon next year, or even cooler, my mapper on top of Felicia's RODS service.


DRN Test

We ran into some network connectivity problems and configuration issues during the the Geisinger test. We were unable to connect to port 8443 from the NCPHI node. Joel and I worked with Geisinger network support to resolve the issue. Once this was resolved, we began configuring the Simple Transfer Service. Joel was able to download the service, but we ran into some Java exceptions while trying to test the client from the command prompt. Felicia assisted us in resolving the Java exceptions. I sent Joel instructions on setting up the container to run the Transfer service.

We need to schedule another test of the DRN service. I talked to Beth and I left a voice-mail message for Jeff Brown informing him of the testing status.

SaTScan on the Cloud Update

Lately I've been working on the first draft of paper regarding SaTScan on the cloud
and familiarizing myself the the client-side objects of Nimbus that I will  instantiate
to programmaticlly create SaTScan resources on the cloud on-demand. 
So, things are going pretty well.

Happy Holidays!

Wednesday, December 17, 2008

NHIN Presentation, AMDS

The presentation that would have been presented at the NHIN forum is now on the wiki. Unfortunately, the agenda gods conspired against us. Regardless, it is a good intro to the ideas of the AMDS, and how this is the type of service that enables population health situational awareness.

Monday, December 15, 2008

RODSA-DAI Script for NHIN Demo

In order to support the demonstration of the RODSA-DAI services tomorrow morning at the 5th NHIN Public Forum, John and I wrote a script to explain our PH scenario.

It goes something like this:
1. This is all sample data
2. The counts and colors are not statistically meaningful. This is for demonstration purposes only.
Two public health data providers with overlapping catchment areas assist a local epidemiologist with biosurveillance. In this scenario, the systems do not share detailed data electronically due to technical or policy incompatibility.
As a part of his routine work, a county epidemiologist must check to see if there is a cumulative spike in flu activity in the region. He can do this by sending a query for ‘fever’ to a service on each node. This query is based on a draft AMDS that is mapped to each system.
• Step 1 – Launch page ( You will see that NCPHI Server is ‘on’. This shows data from the NCPHI research lab node, with the pin points representing counts of cases of fever per ZIP Code. When the user mouses over the pin for ZIP 15227, it is Green, indicating 10 cases.

• Step 2 – Click NCPHI Server ‘off’ and DallasServer ‘on’, and press submit. This map shows data from the other node, again with counts of fever per ZIP code. ZIP 15227 is Yellow, with 14 cases.

• Step 3 – Check both boxes, press submit and you will see the cumulative count. As a result of this query, you will see that ZIP code 15227 is now Red, with 24 cases.

Friday, December 12, 2008

OSU/Emory Meeting

We had some interesting meetings with the Emory and Ohio State grid teams this week. Specifically we were able to sit in on the Emory/OSU collaboration session and to meet with Steve Langella in person and Shannon Hastings over the phone about the activities ongoing and planned for PHGrid.

Steve was especially useful in meeting with Dan, Raja and Joseph about how cagrid's security components work and how they can help with our planning for the AMDS Pilot that is ramping up.

We're going to be updating our architecture diagrams to incorporate Steve's ideas and cagrid's infrastructure components.

Gmap Polygon objectier.

I basically have the test I was running before working. Ironically, there are still about 5 lines of java code in the JSP when there should be more like 2 or 3.

But, this is because I need to go ahead and implement the framework that will provide the polygons rather than coding some into the test JSP. This means the two courses of action are to reformat poicondai to use the new system, or to make a set of forms and controllers that are more generic and allow for the automation of a lot of the map controls.

After some discussions with Brian and Felicia, it appears I will be working on the latter: A set of generic map controls... thus making it easier for anyone else to use the map controls for their own display needs.

I think it will also make it even easier to revamp poicondai and integrate with Felicias RodsHDS.

SaTScan on the Cloud Progress

I just successfully deployed the SatScan grid service to the Nimbus cloud and invoked the SaTscan service on the cloud from my Windows XP notebook. Essentially, I stood up a Grid node on demand on cloud and then ran a SaTSCan job which uploaded files to the cloud and obtained results from the cloud. For now the Grid Security Infrastructure is not being used. Now to move on to programmtically standing up several SaTScan services on demand on the Nimbus cloud.

Have a great weekend.


Distributed computing with Linux and Hadoop

Every day people rely on search engines to find specific content in the many terabytes of data that exist on the Internet, but have you ever wondered how this search is actually performed? One approach is Apache's Hadoop, which is a software framework that enables distributed manipulation of vast amounts of data. One application of Hadoop is parallel indexing of Internet Web pages. Hadoop is an Apache project with support from Yahoo!, Google, IBM, and others. This article introduces the Hadoop framework and shows you why it's one of the most important Linux®-based distributed computing frameworks.

Capturing this project here for potential future research. Read full article here.

Wednesday, December 10, 2008

Zicam Cold & Flu Companion Mobile app

Zicam has an app for the G1 (soon to be for the iPhone) that shows syndromes occurring in your zip code. It is available at

Pretty interesting and seems similar to the AMDS concept of the National Retail Data Monitor service put out by RODS.

ColorHandler updated, and boy is it huge.

Well, it seems that having a range-conscious coloring object is a bit more difficult than even I was expecting, especially considering that I am allowing the use of nulls to mean "there is no limit, therefore it is infinite" for things like "all cases equal to or less than 0 need to have white shading" and "all cases above 20 need to have red shading".

But, I got it to build, which means I wasn't blowing up something obvious. Tomorrow I will saddle up the test harness for the color processor and run it through it's paces.

Then, it is building a very simple popup handler with a simple template. Here's to hoping it is simple.

Alas, this means it might be until friday before I have another good tester of the GMap Polygon jsp (but hopefully, it will be very small and just sort of go 'create grid map object, draw grid map object'.

Google Flu Trends spreads privacy concern

December 9, 2008 (Computerworld) Google's new Flu Trends tool, which collects and analyzes search queries to predict flu outbreaks around the country, is raising concern with privacy groups.

The Electronic Privacy Information Center filed a Freedom of Information Act request asking federal officials to disclose how much user search data the company has recently transmitted to the Centers for Disease Control and Prevention, or CDC, as part of its Google Flu Trends effort.
Concern stems from what privacy groups claim is a disturbing lack of transparency surrounding the method Google is using to predict flu outbreaks. Google has publicly stated that all the data used is made anonymous and is aggregated, but there has been no independent verification of how search queries are used and transformed into data for Google Flu Trends, said the privacy groups.

"What we are basically saying is that if Google has found a way to ensure that aggregate search data cannot be used to re-identify the people who provided the search information, they should be transparent about that technique," said Marc Rotenberg, Electronic Privacy Information Center's president.

Rotenberg said the issue is important because the same techniques Google is using to predict flu outbreaks could be applied to tracking other serious diseases, such as SARS. "Let's say we have a spike in Detroit of SARS and the police say we want to know who in Detroit submitted those searches. How can Google ensure that this can't be done? The burden is on Google," Rotenberg said.

Have Your Say

Is Google's Flu Trends a privacy threat?
Publicly disclosed in November, Google Flu Trends has been described by the company as a Web tool to help individuals and health care professionals obtain influenza-related activity estimates for all U.S. states, up to two weeks faster than traditional government disease surveillance systems.

Google said in a blog post introducing Flu Trends last month that search queries such as "flu symptoms" tend to be very common during flu season each year. A comparison of the number of such queries with the actual number of people reporting flu-like symptoms shows a very close relationship, it said. As a result, tallying each day's flu-related searches in a particular geography allows the company to estimate how many people have a flu-like illness in that region.
Google also noted that it had shared results from Flu Trends with the epidemiology and prevention branch of the influenza division at the CDC during the last flu season and noticed a strong correlation between its own estimates and the CDC's surveillance data based on actual reported cases. Google said that by making flu estimates available each day, Google Flu Trends could provide epidemiologists with an early-warning system for flu outbreaks.
Rotenberg said the service was potentially useful, but much depended on the kind of search data that Google is collecting and analyzing to make its predictions. Google has said that the database it uses for Flu Trends retains no identity information, IP addresses or any physical user locations. However, what is not clear is whether the company is completely deleting IP addresses, and if so, when it is doing it. Also, he said another issue was whether all Google is doing is anonymizing IP addresses by redacting some of the numbers in an IP string.
Google also claims that as part of its overall privacy policy it anonymizes all IP addresses associated with searches after nine months. Yet in an apparent contradiction, when introducing Flu Trends, Google noted that it uses both current and historic search data -- dating back to 2003 -- to make its predictions, Rotenberg said.

Jeffery Chester, executive director of the Center for Digital Democracy, said Google's growing presence in the health care space also makes it important for the company to disclose what kind of data it is collecting and using for Flu Trends.

"Google sees a potential profit center from targeting its vast user base with advertising that is related to health issues," Chester said. The company's announcement of Flu Trends in fact shows to pharmaceutical and medical markets precisely the kind of sophisticated analysis the company can do with search data to enable highly targeted medical marketing, he said. "This is about taking the tracking data that Google has at its disposal and focusing it on generating a new profit center for the company," Chester said.

Pam Dixon, executive director of the World Privacy Forum, echoed similar concerns and questioned whether the anonymization techniques used by Google provided enough of a guarantee that a search term could not be traced back to specific individuals. She pointed to an incident two years ago where AOL inadvertently posted search information on a public Web site. The search queries had supposedly been anonymized by AOL, but it was still relatively easy to track specific search terms back to IP addresses and even individuals in many cases, Dixon said.
Mike Yang, senior product counsel at Google, downplayed privacy concerns related to Flu Trends and insisted that the tool uses no personally identifiable data.

"Flu Trends uses aggregated data from hundreds of millions of searches over time," Yang said today in an e-mail. "Flu Trends uses aggregations of search query data which contain no information that can identify users personally. We also never reveal how many users are searching for particular queries."

Yang noted that the data used in Flu Trends comes from Google's standard search logs. He also referenced an article in the journal Nature, authored by the Google Flu Trends team, which he said explains the methodology behind the tool.

Amazon Public Data Sets on AWS

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
Click here for further information.

AMDS-RODS service development

I added a draft set of data structures for AMDS requests to the wiki. This led to further defining the initial set of operations for an AMDS compliant web service.

These operations include:

Based on these two pages, I created a sourceforge tracker item that Felicia is working on.

We're working on the RODS version first for a few reasons, primarily Peter and Felicia have familiarity with the RODS database structure and we don't have data structures for any other biosurveillance databases. We'd like to start working on ESSENCE and BioSense data structures to create AMDS-ESSENCE and AMDS-BioSense soon. John is working with partners to plan out how which services we build next.

Tuesday, December 9, 2008

Thank goodness for other coders

So, I have this nasty tendency to get into ruts... and sort of formulate something in my mind and consider it the only really good way to do something because it's formed.

Luckily, I also have this tendency to occasionally ask other people which parts seem like good ideas, and which parts would annoy the snot out of them if they ever came across my code or had to use it.

After getting a prototype of the polygon generator and drawer setup working, I asked Felicia to help me with a bit of a 'does the way I'm doing this seem painful to you' type review, and in discussion both she and Brian helped me piece everything together... including a few extra concepts:

1. Forcing people to write a class all the time rather than just be able to set a few fields with the values they want really sucks. It's kind of a lazy way of implementing something flexible. One can always just make an interface to keep things flexible. If someone wants to do something so radical and complicated they just have to write their own class, then it makes it easy, but it sucks to not just build a default with an easy way to change the most obvious option.

2. JSPs shouldn't have much java code in them. At all. They should invoke the class up top, and then call the method in the right spot where the variable comes in. JSPs should be more about the layout and less about the code. The code should go *ding* up in that class you substantiated up top.

Thus, for tomorrow, I am hoping to complete the color handler (with easily settable shading colors), write and create the popup handler (with easily handle-able sort of string placement options) and then write an example JSP that makes it easy to show how to invoke the class and use a minimum of java in a simple init function.

I'm sure this isn't the first iteration, and a lot of stuff might change later.

But, I think it will be better than it was.

Saturday, December 6, 2008

OGSA-DAI 3.1 Released

Dec. 5 -- The OGSA-DAI project, a partner in OMII-UK, have released version 3.1 of their database access and integration software.

OGSA-DAI is an extensible framework for the access and management of distributed heterogeneous data resources -- whether these be databases, files or other types of data -- via Web services. OGSA-DAI provides a workflow engine for the execution of workflows implementing data access, update, transformation, federation and delivery scenarios.

The main features of OGSA-DAI 3.1 include:

  • A number of new OGSA-DAI activities for:
    • Advanced SQL joins and and merging of relational data.
    • Dynamic OGSA-DAI data resource creation.
    • Running queries on remote OGSA-DAI servers.
    • Interacting with XML databases, including adding, listing, removing and getting XML documents, creating, listing and removing collections and running XPath, XQuery and XUpdate statements.
    • Splitting and combining lists (contributed by the ADMIRE project).
    • Retrieving physical schema for relational databases (contributed by the NextGrid project).
  • A document-based workflow client.
  • A data source servlet and data source client.
  • Prototype support for pluggable workflow transformation components.
  • Prototype support for configurable inter-activity pipes.
  • Resources can now be marked as "private" meaning they are hidden from clients and can only be used within sub-workflows.
  • An example workflow monitoring plug-in which records events which can be browsed via a JSP page.
  • Support for MDS registration in Globus-compliant versions.
  • A number of bugs have been fixed, components made more efficient, usable or robust.
  • The user doc has been extensively refactored and extended.

OGSA-DAI 3.1 is designed to be backwards compatible with OGSA-DAI 3.0 without the need for recompilation -- data resource, activity and presentation layer APIs and service WSDLs remain the same.

OGSA-DAI is a free, open source, 100% Java product and is now released under the Apache 2.0 licence. Downloads compatible with Apache Axis 1.2.1, Apache Axis 1.4, Globus 4.0.5, Globus 4.0.8, OMII 3.4.2 and, now, Globus 4.2, are available

Friday, December 5, 2008

Track Flu Trends on Google Phone

Owners of a T-Mobile G1, also known as the "Google phone," can now download a program that tracks flu outbreaks by zip code. The makers of the flu remedy Zicam created the program and got their information from polling health care providers and pharmacies. A version for the iPhone is expected to be available later this month.
Click here to listen to the NPR article

Thursday, December 4, 2008

Design Patterns, I choose you.

Still migrating and refactoring the code that I already wrote for poicondai into the new Gmaps-Polygon code, I know, I know, I wanted Daigon too, but really, there isn't any DAI to this peice, just maps and polygons.

Which brings us to our next piece: trying to make sure that gmap-polygon is both elegant and relatively simple. Poicondai had a lot of duplicated code (not ideal) a lot of java inside the JSP pages (also not ideal) and a lot of commented out code and stuff that isn't necessarily called (once again, not ideal).

Thus, I am trying to make things pretty (relatively speaking) with lots of interfaces (so that if the way something is handled needs to be made radically different, you don't have to modify the original working code and can just use a different interface.) I am also catching myself making classes that are nearly identical and going "waitaminute, this can be done with one class and a smaller class that handles the differences". The end result is I am moving from things like state polygons and county polygons and zip polygons (with lots of duplicated code) into one "region polygon" that happens to have different handlers (thus, if you need a polygon from a zipcode, you create a regionpolygon with a zipcodehandler.. that way the polygon code isn't duplicated... and it becomes pretty clear and simplistic how to put everything together (to get a zip3 polygon, just use a zip3handler...)

I'm kind of excited about the code this will generate. I think what I turn out will be easily adaptable into a lot of these "I want a colored polygon for google maps javascript control" applications we (and maybe even other people/groups) will be creating.

Tuesday, December 2, 2008

That popping is my paradigms shifting without a clutch.

So, for the past couple of days I have been cleaning up the rodsadai projects in anticipation of giving rodsadai polygons.

Then, we all got together and decided that the next step should actually be a generalized set of code that can return polygons given some sort collection of spatial and time series.

This lead to more discussion about how such a suite of code would behave, what pieces should go where, and a lot of "oh dear, I don't know how google maps would be able to tell the page outside of google maps what google maps is doing." The standard batch of logistics, heuristics, and metaphysics that goes into any good refactor.

Thus, all my paradigms have shifted... and I am very appreciative that Felicia will be here to help me with some of the AJAX/JavaScript and Struts/Spring stuff that I am not too familiar with.

Thus, look forward to lots of little posts like "I got the page to reload without reloading when you zoom to a certain level in google maps!" and other such stuff.

I think I'll call this latest suite of code daigon.

Another Demo

There's another demo scheduled on Wednesday, Dec 17th for Dr. Lenert to show the NCPHI Grid Lab to some of the COTPER leadership.