Thursday, April 30, 2009

Updated AMDS Schema + web app

I updated the AMDS schema with a new version in preparation for the release of AMDS-BioSense's Alpha services tomorrow.

I'm sure Vaughn will tell you about the details. Basically, the service tool we're using didn't like referenced types, so I remade the schema to use inline defined elements. I know this is bad form, but it's what we had to do to get the tool to work. If anyone can generate a WSDL file with Introduce using the March 30 schema we'll be very grateful. Otherwise, the example XML didn't change at all so documents valid under the March 30 schema are still valid under the April 30 schema.

Other changes are we changed all our xs:date and xs:dateTime elements to be xs:string, again to make it easier for the AMDS-BioSense Alpha services. Axis1.4 didn't really like date objects very much.

Finally, I added an optional "AllCount" sibling element to "Count" so that services that want to explicitly return the denominator can do so. This was added at the suggestion of the Harvard ESP:SS folks as their data provider doesn't actually return the counts for every single condition type so a query for Classifier=ALL would not actually return the denominator.

I also created a sourceforge project for the schemas so 1) we can do some more detailed version control than the wiki provides and 2) others can host the schemas if they so desire.

The guts are there, the data isn't.

I spent a good portion of today wiring up all of the behavior for taking AMDS service data and populating it into polygons, but I have had issues with the client so I cannot get data, and I didn't have enough time to make a data-faking routine. So there isn't going to be anything for me to deploy tomorrow.

I am guessing tomorrow I will get the client working and then will spend a lot of time debugging the behavior, and then I will spend a lot of time actually flattening and splitting the google map and the server selector next week.

The flattening will allow for much better multi-selection of regions and conditions... I just hope the usability isn't too impacted. But I figure it will be rather obvious: first pane is to select the servers you want to pull-from... second pane is to navigate the data with the given charts/maps and better highlight the data from the chosen servers.

Wednesday, April 29, 2009

cagrid installer forum

The cagrid team has a thread on their knowledge center on potential scenarios for their installer. This may be of interest to PHGrid as it presents the potential for us to save a lot of the effort we currently spend on building VMs and supporting installs. We've been asked by cagrid to take a look at the scenarios and help them with use cases.

Tuesday, April 28, 2009

Metadata Madness Continued

Today, I got the code for the metadata capture executing through tests. I also re-learned a few things.

- Quick little stubs for imitating real services are neither quick nor little, but are always necessary if data is to be provided and you want to test your structures and plans.

- Use the types your given. Creating new types because they seem a bit clearer in your mind means you have to convert them.

- Draw things out. While what you code will rarely look like what you ended up drawing... it gets that tossable first attempt out of the way without having to do a lot of SVN adjustment.

Tomorrow, I try and show the metadata in the grid-viewer pane, and then I try and replicate data.

Monday, April 27, 2009

Metadata madness.

Today, I took the first step in getting to Rodsadai-with-grid-viewer settings, setting up how all the metadata would be stored, and discussing it with other people.

For now, we have settled on storing the server names and URL's in a database, and then populating the metadata from the service(s) dynamically. Thus, most of today has been spent filling out the infrastructure for a one-time database load in gridviewer.

Brian and I also discussed how the app would behave. Instead of the Quicksilver-esque selection and then map... there will be an intro page that will probably need the real-estate formerly taken by the map to show a server-list. The given list will change depending on what date ranges, regions, and conditions are picked (based on whether or not the servers support said items) and then, after the ranges and servers are chosen, the map will be loaded and viewed. Another nifty discussed was to have metadata popups over the available servers, so people would get an idea of which limitations each server had.

Otherwise, the big next step will be integrating with the client app Vaughn is developing. It will be needed to pull the metadata for given services and set up the loading pane.

Update to Service Registry structure

New category has been added to Service Registry:  Type
http://sites.google.com/site/phgrid/Home/service-registry

Grid services can now be differentiated from the Clients / Applications which can leverage those services.

Saturday, April 25, 2009

Supercomputing gets its own Superhero

"DEISA (Distributed European Infrastructure for Supercomputing Applications), an EU-funded project that has linked a dozen of the world’s fastest supercomputers into one smoothly functioning transcontinental grid. "
.....
"Since DEISA started, just four and a half years ago, the aggregated peak computing power it can offer has multiplied by a factor of 300, from 30 teraflops (30 thousand billion floating point operations per second) to over a petaflop (a million billion operations per second)."
http://www.physorg.com/news158249258.html

Friday, April 24, 2009

VM Appliance with Mac Fusion

A second person has asked about runnin the VMWare appliance (v0.2 or higher) using VMWare's Mac player, Fusion. Fusion uses a little bit different file formats. This at first seems like a deal breaker, but there is a very handy utility described on infrageeks.

This script will converts your ovf file to the format that Fusion expects.

gridviewer is up

Greetings all,

So, this morning I posted gridviewer to the training node. Right now it is an empty shell being prepared for data and metadata, so it behaves a lot like the gmap-poly-web application, in that there is no data so the polygons will always show zero counts.

you can see it at http://ncphi.phgrid.net:8080/gridviewer/gmap-pane.jsp

The next steps are to set up a framework for handling multiple servers and metadata, reflect that capability in the framework by showing a changing server-list depending on which regions/time-periods were selected, and prepare for integrating Vaughn's AMDS client so that data can be fetched.

I am hoping to get to a "rodsadai" like state (click server A, get some data, click server B, get other data, click both, get combined data) for next Friday's release.

Cheers, and have a good weekend!

Final Publication of the 2009 Detailed Clinical Research Use Case

Value/use case from the Electronic Health Record Clinical Research Value Case Workgroup.  The first priority addressed by this group was the ability to exchange a set of core research data elements from an electronic health record to clinical research systems. 

The documents posted include a high-level value case, a value case extension document and a detailed use case.   

Very (very) interesting set of links


SAHANA - Home of the Free and Open Source Disaster Management System
http://www.sahana.lk

Python implementation of openEHR specifications
http://sourceforge.net/projects/oship

Open Source Software for Public Health
http://www.ibiblio.org/pjones/wiki/index.php/Open_Source_Software_for_Public_Health

Pentaho- Open Source Business Intelligence
www.pentaho.com

Models of Infectious Disease Agent Study
https://www.epimodels.org/midas/midasAvailableResources.do

Structuring an event ontology for disease outbreak detection
http://www.biomedcentral.com/content/pdf/1471-2105-9-S3-S8.pdf

BioCaster:  detecting public health rumors with a Web-based text mining system
http://bioinformatics.oxfordjournals.org/cgi/reprint/24/24/2940

Biocaster - Global Health Monitor
http://biocaster.nil.ac.jp

Global Health Monitor - A web-based system for detecting and mapping infectious diseases
http://www.aclweb.org/anthology-new/I/I08/I08-2140.pdf

A multilingual ontology for infectious disease surveillance:  rationale, design and challenges
http://research.nii.ac.jp/~collier/papers/LRE%202007.pdf

Mesh4x
http://code.google.com/p/mesh4x/

Global Disaster Alert and Coordination System
http://www.gdacs.org

…phew!


Thursday, April 23, 2009

GridViewer is working as a package

So I got gridviewer packaged up and most of the code I wanted transferred into it, and I am happy to say that code and web-code can happily coexist in a maven project and behaves as you think it would.

Next steps are to pull in AMDS data and start modifying the gmap-pane to handle different sets of metadata and different services.

Tuesday, April 21, 2009

Make that 'gridviewer' instead of 'amdsgmap-web'

Three productive discussions were had today.

The first was with Brian, who pointed out that I shouldn't have separate packages for amdsgmap and amdsgmap-web, as it makes it annoying with multiple properties files and the like, and that amdsgmap would never be used independently of amdsgmap-web. He also suggested naming it "gridviewer" as it would be easier for people to remember when typing it into the 'go' box of Internet explorer or Firefox or browser-of-choice. Chalk this up to lessons learned from the beautiful but hard-to-remember 'npdsgmaps-web'. I spent some of today moving packages into a web project and checking to make sure maven would understand what I was trying to do. I think it did, I will check more tomorrow.

The second conversation was with Vaughn, and we discussed the AMDS client and how aware it would be of what servers it would need to call (the end result being 'not very aware at all'). It will have basic hooks for setting the URL to call, the level of security to use (for now it will be 'none' and 'globus', and hopefully things like 'siteminder' and 'cabig' and 'coolfuturenetwork' to help extend the trust mesh of multiple CA's brian was talking about in an earlier post) and a plug for querying the AMDS service at given url/bunker.

The third conversation was in my head, where I was drawing out all of the bits between the client and the gridviewer and plotting out data and function flows. Tomorrow, I hope to code a bunch of them in and commit them with a few tests.

Monday, April 20, 2009

User Credential Security

Peter, Vaughn and I were talking the other day about how security configuration for authentication will differ between the current release of Grid Publisher and the future roll out plan. The current release of Grid Publisher uses the existing CDC credential/authentication infrastructure of SDN for issuing user certificates and authenticating users and VeriSign for creating host certificates for nodes. We're doing this because it's simpler initially. But to scale this would be expensive and would also put a large amount of control solely in CDC's hands.

So I created some models on the wiki to try to represent where we want to be with PHGrid in the future. Basically, I see PHGrid supporting multiple CAs that users and nodes can use for certificates. This means that there isn't a single CA for PHGrid but a trust mesh of CAs who can identify participants on the grid.

Of course, nodes still assign rights and privileges to subjects identified by the certificates, but at least we don't get into an expensive component that could limit grid sustainability.

Please let me know your thoughts about how in/valid you think this will be.

AMDS-UI

So, AMDS-UI is going to be broken into three parts, and the structure will be very similar to Quicksilver but will be replicated because the two apps will be doing two very different things.

Amdsmulticlient will be the main model generator. It will contact AMDS clients with queries(securely or non-securely, depending on the URL), and deal with marshalling data into objects which can be understood by...

Amdsgmap which is the main controller. It will take the data objects and convert them into data better understood by gmap-polygons (and will be using the gmap-polygon jar). It will pack up the data into gmap-polygons through a suite of JSP backing code for...

Amdsgmap-UI, which is the main view. This will take data from amdsgmap and display it and handle the requests and responses to the actual viewer. It will also hopefully have some admin pages for adding and removing AMDS sites and configuring some of the properties.

The look should be a lot like quicksilver, but the options will have shifted, and some of the features wanted for quicksilver will be provided by for amds-web.

I have created all these projects, and tomorrow I will extend the code and set mockups for handling data. Soon after that will be integrating Vaughns client, and then many cycles of testing and coding.

My hope is by the end of the week to have Quicksilver-esque functionality coming from AMDS test data.

Are application silos inevitable? Maybe MODS instead.

Good article from Burton Group Blog (http://apsblog.burtongroup.com/) about application silos

Service Philosphy

Last week I worked on merging the service and the core functionality of AMDS. The bulk of that effort was spent trying to get the GDTE tool to call the specific JaxBElements as specified by the AMDS XSD. At one point I thought I had this licked but after some initial testing I realized that there was some additional issues that would take even longer to resolve. So for now, I shelved trying to get the service to work in that fashion and just created manual marshal and unmarshal routines that would get the job done. I guess for the programmer that likes to just look at an interface and begin extending through raw guts, intuition and instinct, it forces them to read the README file before moving forward. I can talk about those type of programmers because I am guilty of the same. :-)

The entire code check in process took place over the weekend. I wanted to wrap my head around a directory structure for services going forward so that it can be easy for someone who wants to download the code and build. So through my thinking I came up with the idea that a service should have the following directory structure:

+--ServiceProjectName:
  • +--Service
  • +--Common
  • +--Client
  • +--lib
  • --build.xml
  • --README.txt
  • --POM.xml

This structure follows my philosophy for grid service development which follows:

A grid service should have 3 components:
  1. The core or common components - these components do all of the work for the service from making database calls, structuring inputs and outputs, etc.
  2. Service Implementation -Since there are a couple of techniques for creating Globus services its interface should be implemented from the core components making the service implementation very thin. I have based my service implementation on GDTE which is a more generic service implementation than Introduce. However, if someone wanted to implement a service using Introduce then the common components are preserved in such a way that this is very easily done. Additionally, configuration file location and environment variables are accessible in locations as prescribed by the GTK4 development manuals.
  3. Client Implementation - the client implementation is based on the service implementation. In my TODO list I am going to try and make this more generic and base the client off of the WSDL GDTE tool generates a client stub, I will use it as a base. I will transform it into something generic later.

Included in my service is a GUI for the service configuration. A snap shot is below:


The AMDS configuration is fairly complex and since this service will be moving to production, I wanted to make it as easy as possible to administer from a UI as oppose to just having a configuration file. From the UI it is easily seen that its design is intended to manage a suite of AMDS services at one node. This taps into my ideas about the service ecosystem: The service should be easy to extend/develop, to deploy/undeploy, management/configure, and include a default usable client. The intended audience for this targeted functionality is not just for developers but also for the non-technical person when using the service. These specific ideas follow the IPhone App-Store Model of application development, a topic to go into detail later.

Lastly, here are my TODOs for this week:
  • Complete the generic configuration of the AMDS service.
  • Build the default client to be used with the AMDS Service
  • Begin to think about multi-node access for the client (The client should be able to aggregate multiple node endpoints running the AMDS service)
  • Upload the updated service and remaining artifacts to source control, with a base level instruction set.
More to come...

Saturday, April 18, 2009

Friday, April 17, 2009

PHIN 2009 Presentation Coordination

I set up a page on the wiki for coordinating PHIN Presentations. Please fill in details where you can or add on items based on your plans.

Thursday, April 16, 2009

more Quicksilver fixes, some progress on NPDSAMDS

This morning I dealt with some more issues raised in security scans of Quicksilver, as well as fixing an error they had revealed and improving the error handling.

This afternoon I went to the Grid kickoff meeting, and after that I was focusing on cleanup and implementing some more of the service implementation for NPDSAMDS

Tracker cleanup

I worked on the sourceforge.net trackers to help track the changes being made for the current production releases (Grid Publisher and Quicksilver). I added the following groups to help track which product release (and code branch) the Bug or Feature Request belongs within:

  • AMDS-1.0-Initial release of AMDS services

  • QS-1.0-Initial release of Quicksilver

  • QS-1.x-bug fixes and additional FRs for next release of Quicksilver

  • GV-1.0-Initial release of the Grid Viewer

  • GP-1.0-Initial release of the Grid Publisher



These changes should make it easier for us to communicate what changes and fixes go into specific releases.

Kick-off Grid Collaboration Call / Webinar/ in person in Atlanta


Conf call schedule: Thursday April 16, 2009 1:00 – 2:00 pm EDT

Agenda
Purpose of the call
Resources available to community
Update on NCPHI grid activities
How can community members get involved
Open discussion on grid-related issues
Next steps

Call-in info:
Call-in #: Toll-free for US-based callers 1-888-390-3412
Toll-# for International callers 1-212-287-1825
Participant passcode: 7351587#

Webinar info:
Attendee:
Copy this address and paste it into your web browser:
https://www.livemeeting.com/cc/cdc/join
Meeting ID: 2PMK3Q

Wednesday, April 15, 2009

Updated Quicksilver

So, Quicksilver is being deployed to a version of CDC production, and there are scans being made, and few new possible vulnerabilities were made requiring some adjustments and some repackaging.

On the plus side, I am learning a lot more about how to secure a JSP website, and how to manage multiple builds at the same time.

Tomorrow, svn merging :D

Tuesday, April 14, 2009

Discussions and progress

Today we had a meeting about some of the ways that the AMDS grid services and AMDS viewers are going to interact in their planned environments. It helped clear up a lot of ambiguities about how they would connect to each other and what sorts of security requirements would be needed and what sorts of security should be expected at different points on the grid.

We also had discussions about ways to handle CAs now and in the future, and how to package things so that we make it easier for other developers to play with stuff should they want to.

I have been making some progress with the NPDSAMDS project, and forgot how handy it is to just write the interface and start pulling in types to wrap your head around all the bits in a service. I really do grok java better than XML Schema still.

Finally, I have been stopping what I was doing to help deal with minor issues in the production deploy of QuickSilver. Said deploy will probably loom for some time, but I can't complain because, hey, code going live is neat!

Sunday, April 12, 2009

Medical Informatics in the New York Times - Today

http://www.nytimes.com/2009/04/12/jobs/12starts.html?hpw

Connecting the Dots of Medicine and Data
By CHRISTINE LARSON

Friday, April 10, 2009

Creating a new maven project: npdsamds

I have created a new maven project to hold the service end of the npds-to-amds code, and it is appropriately named npdsamds. It will essentially be a gridified version of the npds service that should be served up via a globus container (Thus requiring security and all that jazz).

Maven solves many worlds of hurt with its dependency management, and the repositories where you can store jars so that others may use them. It also makes other tools pale in comparison and seem much more annoying.

One of them is SVN, which for some reason is "old" on my copy of ubuntu and cannot be easily upgraded using the standard ubuntu tools. Thus, whenever I try and add new projects to the SVN repository or move files, I always have to create new folders and re-check-out the code because otherwise the version of subclipse I use with (you guessed it) eclipse touches the svn special secret hidden files and makes it so that lowly command-line subversion is eschewed from adding the files. I also have to do this every time that subclipse botches something.

The other tool is ant. Don't get me wrong, maven /uses/ ant. But maven is to ant as a chocolate cake is to baker's chocolate. The combination maven of which ant is a part, is much better than the bitter chunks of ant by itself. Ant offers all sorts of commands that rely on all sorts of duck-aligning and basically anyone who writes an ant script also has to write a very long doc about what each part of the script can do and how to invoke it. Maven, you can usually just tell someone "type 'mvn package'" and it will produce the thingy they need, and it is a LOT easier for the person who coded the stuff inside the maven package to set up the project that way.

Don't get me wrong, both tools are very powerful and useful in their own right. I just have a small personal bias in that maven hasn't made me curse it's name yet.
But, we might write a plugin that allows maven to build globus services. So, I'm sure I will be cursing it's name soon enough.

Simplifying the service creation process

This is week 2 of me formally joining the Grid team in the NCPHI Lab. I am now focusing the majority of my efforts on specific Grid task but still contribute some of my time the PHINMS team as an advisor and architect. In the past 2 weeks of my efforts in the lab I have been trying to wrap my head around some of the challenges of the Grid regarding its implementation in the public health community. Additionally, I have closely examined the whole process of building and deploying a Globus service.

For the first week or so, I spent my time trying to get my development environment set up. This effort took more time than I expected due to the shift in paradigm of a pure production environment to a more R&D environment. None-the-less, I have prevailed and the result is that I am now up and running on 2 VMs. My personal preference is the Unbuntu VM. I also have a RedHat VM which Admin, Dan swears by. Since I always consider admins a programmer's best friend, I have to quietly defer. The Unbuntu environment is what I use heavily at home so there is an added layer of comfort here.

To get to the meat of this post, I have successfully built and deployed (in my environment) a Globus service! In week one I went through the GTK4 programmers manual examining the manual process in creating a service. Since I am a purest when it comes to programming, I figure, if I can not get any of the tools to work I can always just do it manually. After reviewing the book I poked around the Globus system. I downloaded the ws-core and deployed it to a Tomcat container. In my home environment this took no time. I then played with some of the packaged services. From there I was told about Introduce. This is the Globus Service Development Toolkit that Peter and Felica were using. In my attempts to get it set up it was a struggle. There are a lot of moving parts but I did have some mild success. After going through the generated service package there was a lot of items in it that I was not sure of its purpose. I chalked it up to not having enough knowledge about globus and caBig. Later I discussed some of my findings with Felica. She is very experienced with the Globus container and Introduce and she shared with me that there are a lot of caBig wrapped Globus jars used in the services that are generated as well as some redundancy regarding the jars used in the caBig default container. After further examination and discussions, I quickly arrived at the conclusion that deploying services using the Introduce tool may add a support burden that make approach critical mass as PHGrid gains wider adoption. Regardless, it did work, so at a minimum this was a milestone for me to have created a service. This was all in week 1.

Week 2
Immediate, I began to search for other solution that was simpler to understand, develop and manage. I then came across GDTE. The website was a little choppy but informative. A little background, I am an old school developer, from the days of C/C++ and Motif so the majority of my career I have used vi as my editing tool and have only recently, past 2-3 years, began to attach to an IDE, again I am a purest. :-) While I believe over time the IDE may erode my mind, it does make sense as the number of libs and methods/functions are too numerous to remember. So it does add some efficiency. All of that to say, I used mostly Netbeans in the past and I only used eclipse intermittently. GDTE uses eclipse so I had to put on my eclipse hat. It took me about 15 minutes to find the correct distro but then took me a few days to get it configured properly in my environment. To my credit there were some inconsistencies in my VM environment. However in the course of a couple of days, Dan Washington, he is part of the League of Superior Admins, came to my rescue and resolved my issues. So early Thursday I had a working eclipse environment. From there, I went to the GDTE site and followed their straight forward approach in developing a service. The end result is that their service generation is much cleaner and uses the pure Globus jars for its deployment. It does add 2 extra jars but these are Minor. Overall, the process was much simpler than Introduce and the output was much cleaner so for my efforts I am rolling with GDTE. Here is their link:

http://mage.uni-marburg.de/trac/gdt/wiki/WikiStart

Here are some hurdles I have over come via trial and error. Do not use the latest version of eclipse 3.4.x GDTE will not work with this version. I downloaded easyeclipse 1.3.1.1. here is their link:

http://www.easyeclipse.org/site/distributions/index.html

Download the version :EasyEclipse for Plug-ins and RCP Apps 1.3.1.1
This version will have all of the plug-ins you need. From there you can just follow the instructions from the GDTE folks. This group seems to be very active, I looked though their e-mail archives to get a feel. They are aware of the issues with 3.4 and are going to make fixes in their next distro.

Lastly, only installed the Service Generator and certifcate tool, I had issues with the ViGO (GDT Visual Grid Orchestrator (ViGO) Version 1.0 ) tool as there are some dependencies needed by eclipse. I got my service built, compiled and deployed successfully using this tool last night at 10:23pm. So having been up since 4 am I was a little tired in trying to figure it out. I will later.

Next, I plan to merge the back end code (I wrote in week 1) with my newly created service and then formally deploy the results to the NCPHI training node. Additionally, I will determine the "fitness" of the resulting GAR from GDTE. The fitness determination will let me know how the gar generation process should be augmented to increase the supportability in a PHGrid federation. I will post more about this later.

One last note, if you are familiar with annotation then you will love the GDTE tool. Their technique for Globus services development seems to be similar to WSIT/METRO (Sun's WS development). More to come.

Thursday, April 9, 2009

De-duplication Insight from the Field

After Monday's post Dave Heinbaugh, the tech manager for the Tarrant County APC, was kind enough to spend some time with me on the phone walking through some of the scenarios for de-duplication.

He basically broke it down that based on the statistical analysis, it's really not worth trying to account for a single patient who gets served by multiple hospitals or clinics. Based on Dave's experience near the data, this does not happen.

So, de-duplication is an issue for NHIN where they are trying to figure out if 5 different EMR/EHRs are the same person, but for AMDS it may not be as an immediate, data quality issue as feared.

This doesn't get rid of the question of de-duplication, but shifts it a bit from trying to reduce specific instances of patients recorded multiple times to trying to make sure that a data source is not reported multiple times (for example Tarrant County is a BioSense data source so if you queried BioSense-AMDS and Tarrant-AMDS services for Respiratory conditions and combined them together you would get double counts).

Transitioning

Today I spent a some time transitioning with Felicia, learning where the docs are for dealing with library conflicts, that sort of thing.

Otherwise, I contributed a bit Brian and Vaughn's discussions about how grid services will deal with different security schemes and how they should be loaded, and I have a much better idea for how to design my NPDS-AMDS service and the new AMDS viewer client.

Most of today was spent helping the SDN team get Quicksilver ready for deployment. I helped debug a database issue, and wrote the "these are the properties files you need to change to deploy to a new server" guide.

Cheers!

GAARDS Whitepaper

A new draft whitepaper which provides a primer of the GAARDS toolset has been posted to the Wiki. It is located at http://sites.google.com/site/phgrid/Publications/GAARDingtheGrid.doc?attredirects=0. I would love to get your comments on this as we continue to refine the ideas and messaging behind the usage of these critical grid security components.

This is a the first draft ready for public comment. Special thanks to those who provided pre-public comment feedback initially. I look forward to further feedback from all!

Wednesday, April 8, 2009

PHGrid Calendar

I've updated the PHGrid Calendar on the wiki. With the overall objective of demonstrating pilot AMDS services through the Grid by PHIN there are a lot of items across several of the dates, targeting to be demo ready by July 15. I've marked the key deliverables with the creative term "Milestone". You'll notice there are 9 Milestones between now and July 15. For each Milestone (and the Milestones only) I have placed a short description of the deliverable. This calendar is a living entity, and will be updated to reflect important collaboration and deliverable activities.

Services and Deploys.

So today I did manage to get a secure service and a secure client up and deployed in my current environment, and while it is a bit of a pain, it shouldn't be insurmountable, just tedious. I look forward to seeing if Vaughn found a better way to adjust the service parameters as AMDS is probably going to be very dynamic.

Tomorrow will be adjusting some of the properties by hand (namely the security parameters and such) and trying to make a little website off the client (like I did with rodsadai) just so I can see what happens when I deploy things alongside rodsadai and what libraries have to change (with Felicia's help).

I know this is going to be rather heuristic. But I didn't learn Globus that well before, and It'll help me sew together all the little bits about how all these services are going to bump into each other, and help me get a better perpective about the appliance network we are going to be making.

AMDS Store for BioSense data

So part of the BioSense program strategic plan is to share national data sources in a federated manner. In order to support this using grid and AMDS, we've started working with the BioSense Data Quality team (they run the analysis on the Data Mart) to generate some sample files that we can use to populate an AMDS data store with all the aggregate counts of BioSense syndromes by zip code and by day.

So far, we've got a sample file for all the zip codes covered by VA, DoD and Real-Time data. Of course all the counts are redacted as we don't deal with real data in the lab. But this will let me start working on an ETL process to get this into the AMDS extract data base that Vaughn is developing against.

So far it's only about 5MB/day of data and that's before I normalize out a lot of the content (we won't actually store syndrome names in the tables, etc. etc.) so this is still a good sign.

Also, it looks like Tom has started calling AMDS the Population Data Object (PDO) some times so we may be relabling soon. This actually makes more sense as we're not talking about a data set so much as we are a specification for sharing population data (as opposed to patient specific data that is NHIN's currency).

HHS makes software available for health IT network

Federal Computer Week
Tuesday, April 7, 2009

HHS makes software available for health IT network

     * By Mary Mosquera

   The Health and Human Services Department (HHS) said it is making
   software available to connect health information technology systems to
   the nationwide health information network (NHIN). The software
   availability is a first step toward encouraging public and private
   organizations to link with the NHIN, which would provide for the
   electronic exchange of health data, HHS has said.

   The Federal Health Architecture program, an e-government effort led by
   HHS' Office of the National Coordinator for Health Information
   Technology, is making the free software, named Connect, and supporting
   documentation available at www.connectopensource.org.

   The national coordinator's office has established the foundation for
   development of the NHIN, which it says will tie together health
   information exchanges, integrated delivery networks, pharmacies,
   government health facilities, laboratories, doctors, hospitals and
   private payers into a "network of networks." The NHIN uses
   interoperability standards, public- and private-sector specifications,
   participation agreements and policies, Robert Kolodner, the national
   coordinator for health IT, said April 6.

   The NHIN would provide for up-to-date records available at the point of
   care, enhanced population health screening, and the ability to collect
   case research faster to facilitate disability claims, he said.

   The Defense and Veterans Affairs departments, the Social Security
   Administration (SSA), and HHS' Centers for Disease Control and
   Prevention, the Indian Health Service, and the National Cancer
   Institute have tested and demonstrated Connect's ability to share data
   among each other and with private-sector organizations, Kolodner said.
   In February, SSA used the Connect software gateway for the first time
   when the agency began receiving live patient data from MedVirginia, a
   regional health information exchange, through the NHIN. SSA has said it
   is using the NHIN to speed up delivery of information to support
   disability claims applications.

   The Connect software is the result of a 2008 decision by 26 federal
   agencies to connect their health IT systems to the NHIN, Kolodner said.
   Instead of individually building software required to make this
   possible, the agencies produced the Connect software through the
   Federal Health Architecture program. The software establishes the core
   services defined by the NHIN, including standards for security to
   protect health information when it is exchanged with other trusted
   health organizations, he said.

   "Federal agencies accomplished something remarkable in developing
   Connect," Kolodner said. "They looked beyond their individual needs to
   the needs of the group as a whole, and they collectively built a
   solution that provides benefit to all involved much faster and at a
   significantly reduced cost than if they had worked independently."

   HHS plans to make Connect, which is based on Sun Microsystems
   open-source technology, available under an open-source license to
   encourage innovation and to keep costs low, said Vish Sankaran, the
   Federal Health Architecture program director. The software will also be
   available to the health care industry in order to speed NHIN adoption
   among health care organizations.

   The federal government has developed a working prototype, which can be
   deployed across multiple agencies, said Bill Vass, Sun president and
   chief operating officer.

   "The open nature of the IT foundation is critical to ensuring that
   government can work with the private health care sector to
   revolutionize the nation's health care system," he said.

   About the Author

   Mary Mosquera is a reporter for Federal Computer Week.

Tuesday, April 7, 2009

Whether introduce helps...

So, I have Introduce on my development box, but whenever I try and run the tutorial, it blew up in the very first step with an odd error bubbled up from ant about some method not working properly.

I subscribed to the introduce user group mailing list, asked my question and posted my error, and the response was "You must have used Introduce in the past and are now trying to use a new version of introduce, since introduce modifies the Globus libraries when it runs, you will need a fresh copy of Globus to use a new version of introduce or you get library conflicts"

That is kind of a dealbreaker for me. I am trying to use Introduce for something other than it's intended purpose: to build regular grid services and not caGrid specific services... so it appears that some of the changes that Introduce makes for caGrid (like mutating a Globus installation) are more than I bargained for. Also, introduce is supposed to be something that simplifies the tedious process of populating all the nooks and crannies of a given grid service, and little is more tedious than scrubbing a Globus install.

Vaughn is currently working with a different Eclipse based tool, so another thing I am batting my head against is trying to make something that he, too, will be making: An AMDS service shell. Vaughn is a bit further ahead than me and has already made the Java interface and converted all of the XSDs into Java objects, so if he gets a Grid service that matches that interface with security, I'll probably use that and implement the interface for Poison Data while he implements it for RODS data.... and then start making the client that looks like quicksilver but in fact pulls AMDS data rather than NPDS data.

The other thing that Vaughn has been talking about is maintaining that firefox/apple/google model of easy to install-ness with easily pluggable extensions. Thus, we want to make sure that our service is easily installable or even included on our default grid node we hand out... which makes me start wondering about whether we even want multiple AMDS services or just one AMDS service with multiple return options. (IE, instead of having Poison-AMDS and Rods-AMDS as I am envisioning it, do we want to just have one AMDS that returns two sets of metadata, one for Poison and one for Rods, and then however many extra sets depending on how many other data services are networked behind the scenes, and these questions might have been answered already and I just missed it).

Either way, I am sort of a mass of ambivalences right now. Introduce probably won't work for me, so I am wondering whether I want to try and modify a service from scratch (using the interface that Vaughn already created), do I want to play with the system Vaughn is using (or watch over his shoulder and help) or do I want to start refactoring npdsgmaps and npdsgmaps-web to deal with AMDS style metadata and queries. Or do I just want to just focus on getting rodsadai and the old AMDS client running on the same jboss without tripping over each other since there will probably be multiple clients and they'll need to behave.

It's not unpleasant, but it is a bit frantic. Oh well, ambivalence means I have lots of ideas, and the next step is figuring out which are best or most appealing.

Cheers.

Data.gov

Re-purposing a digg feed from earlier today, the Obama Administration announced that it will soon launch data.gov, where the government will make a broad array of US government data available in downloadable formats. Wired has an interesting wiki suggesting "how to" do so...

Monday, April 6, 2009

Introduction to Introduce

So, I read much more about Globus and got introduce up and running, more importantly I figured out what I want to try and do with it in the coming weeks: Make a service that will comply with latest AMDS Spec and a client that will allow me to ping it.

The part of converting Poison Center data into AMDS Responses and AMDS Queries into Poison Center queries, while tedious, is something I can envision doing. I can also envision doing the skeleton from scratch, but that would be absolutely horrible to try and do if I don't have any really powerful tools like XML-spy... or Introduce.

The only problems with Introduce for what we need to do is that it is a bit geared towards caGrid, and likes to package up a few special libraries that end up causing some interesting deploy behaviors using globus tools when it is NOT part of caGrid. It also doesn't produce particularly maven-able results from the get go, so there is some footwork to be done playing with packaging and building and deploy tools.

There is also some work to be done with the client and making sure that it's security libraries don't conflict with existing secure clients (like RODSAdai) or how to place libraries so they don't interfere.

Thus, this week will be devoted to getting the service skeleton set up for a secure globus service.

The other cool thing might be that once this is done, it can be saved off as a shell for any given skeleton that needs to wrap other services, since AMDS is supposed to be a repeatable standard.

The President Speaks on International Cooperation to Fight Pandemics

Click here to see and listen to short video

De-duplication

I've been thinking about the "De-duplication" grid service a bit lately and want to get my thoughts out here in case someone else has been mulling it over as well.

De-duplication is related to the De-identification/Re-identification problem of Anonymized data across multiple data sources (interesting MIT paper here). But in a grid environment the problem becomes tougher because the traditional, really-hard-to-do-solution of checking the various MPIs is even tougher because data sources 1) don't frequently have MPIs and 2) don't want to share them if they do.

Here's the scenario I use to talk about the de-duplication problem set:
The scenario is this:
3 hospitals in Atlanta with overlapping catchment areas. A federated query is sent out for SyndromeX for April 1, 2009. Hospital A returns a count of 10, Hospital B returns a count of 15 and Hospital C returns a count of 20. The combined result from the federated query shows a total of 45 counts of SyndromeX for April 1, 2009.

There exists a question of data quality as it's possible that the same patient visited multiple hospitals and had his SyndromeX recorded at multiple locations. In a federated/grid configuration this is even more probable as Hospitals may use data from multiple feeder sources (labs, small clinics, etc) that may report the same patient encounter multiple times.

Dr. Lenert had a good idea where each Hospital could send over a list of PII (name, address, dob, social) without the medical information for purposes of de-duping. This requires a trusted intermediary to receive these matches and send them on to each Hospital to respond back with number of matches. A hospital cannot send directly to each other hospital because it would reveal its patients to potential competitors.

So the de-duping scenario may change to something like this:
1) User submits query for SyndromeX on April 1, 2009 and selects de-duplication.
2) Trusted intermediary receives the query and passes it on to Hospitals A,B,C.
3) Hospital A returns count of 10, Hospital B returns count of 15, Hospital C returns count of 20. Interim total count 55 is not returned to user.
4) Trusted intermediary then requests patient id info from each hospital in a separate request.
5) Trusted intermediary compares the PII from each hospital (perhaps using a tool like febrl) and determines that there is overlap of patients and actual count. Hospital A shares 1 patient with Hospital B. Hospital A shares 2 patients with Hospital C. Hospital B shares 3 patients with Hospital C. Therefore there are 6 unique duplicates.
6) Total count of 39 is returned to the user (45-1-2-3).
Of course, this is going to require some pretty serious data use agreements with the hospitals and the trusted intermediary to control the storage of PII (i.e. don't persist any PII)

The alternative without the trusted intermediary:
1) User submits query for SyndromeX on April 1, 2009 and selects de-duplication.
2) Each hospital receives the query and list of nodes to which is was submitted.
3) Hospital A identifies a preliminary count of 10. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital B and C for comparison.
4) Hospital B compares the PII to its own results for query ID and returns the number of matches to Hospital A (duplicate count=1).
5) Hospital C compares the PII to its own results for query ID and returns the number of matches to Hospital A (duplicate count=2).
6) At the same time, Hospitals B&C are performing similar communications with the other nodes participating in the query.
7) Hospital B identifies a preliminary count of 15. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital A and C for comparison.
8) Hospital A compares the PII to its own results for query ID and returns the number of matches to Hospital B (duplicate count=1).
9) Hospital C compares the PII to its own results for query ID and returns the number of matches to Hospital B (duplicate count=3).
10) Hospital C identifies a preliminary count of 20. To check for duplicates, it submits only the PII without any condition data and a query ID to Hospital A and B for comparison.
11) Hospital A compares the PII to its own results for query ID and returns the number of matches to Hospital C (duplicate count=1).
12) Hospital B compares the PII to its own results for query ID and returns the number of matches to Hospital C (duplicate count=3).
13) Hospital A returns a count of 10 and notes that duplicates with B=1 and duplicates with C=2.
14) Hospital B returns a count of 15 and notes that duplicates with A=1 and duplicates with C=3.
15) Hospital C returns a count of 20 and notes that duplicates with A=2 and duplicates with B=3.
16) The user then feeds these results into another service that processes the results and determines that the actual count is 39 based on the following calculated relationships: (A|B)-1; (A|C)-2; (B|C)-3. These reductions are subtracted from the preliminary total of 45 to yield 39.

Approach #2, while not requiring a trusted intermediary requires a lot of coordination with the queried nodes and also involves the sharing of patient info with other partners that may not be desired. Also, due to set math will grow exponentially with the number of nodes involved in the query (for 3 nodes, the number of service calls is 9 [3 from user to nodes, then 6 among the nodes-using n!/{n-r}!]; but for 4 nodes, the number of service calls is 16; and for 5 nodes the number of service calls is 25). This is also suspect due to my deficient math skills and just some back of the napkin calculations.

Anyway, I wanted to get these two approaches to de-duping out there and will update with more as I keep thinking.

Friday, April 3, 2009

Updated programming playground rules

I updated the initial blog posting from June 5, 2008 and updated based on additional developments and moved it over to the wiki for easier updating. This will be the place where the NCPHI Lab's and eventually the PHGrid's programming conventions and policies are captured:


  • Use Sourceforge Subversion site to store changes - All changes are checked into the Subversion site on the sourceforge project. Only security related changes (passwords, etc) are not checked in, and these are factored out into a property or configuration file.

  • Everything builds - Maven is used to build (compile and package) and deploy changes. This will allow for changes to be made on different desktops without spending time trying to manually configure a new environment. This also means that if the build breaks then the developer who broke it needs to fix it as soon as possible.

  • Follow the Sun Code Conventions for Java. These are old, but still applicable for what we're trying to do. We picked this sort of as a default and this can change based on feedback and practice, but we want some uniform style to the project.

  • Document code- document code to improve clarity. Use javadocs where necessary. This is not documentation for documentation's sake, but enough so that someone coming across the code will be able to follow. This doesn't absolve a programmer from clean, concise coding but should improve on the clarity of a source file.

  • Write a Use Case first- the first step to development should be to make a post on this blog describing the use case. Nothing too formal, just a description of the steps, who will perform them, alternate flows and error handing. This will provide a way to capture the requirements before any code is written.

  • Next write a Test Case before coding- after the use case is posted and there's some sort of agreement, then a test case is written while the coding is performed. This speeds up development by institutionalizing testing in a standard manner.

  • Each project will have a separate folder in the sourceforge root with a Maven pom.xml file for the build (including dependencies).

  • All properties are stored in a configuration file to allow for projects to be easily built for various target environments. Actual values are not checked into sourceforge as these property values are likely sensitive (server names, passwords, etc.). For example a project named foo would build based on the properties stored in foo.properties. The file checked into SVN would be named foo.properties.template and contain descriptions for each of the properties, but not values. When a developer is building, they will take the proeprties file as configured for an environment, rename it to foo.properties and run their build. It is likely that you will have foo-dev.properties, foo-training.properties, foo-staging.properties, foo-production.properties (none checked into SVN) and choose a properties file based on where your deployment is targeted.

  • All changes are tracked using the SourceForge tracker for Bugs and Feature Requests. This is done to project project transparency to all our changes and prioritization of changes.

  • Every Friday, the current project state is built to the training node (ncphi.phgrid.net) so that the community can see progress and comment. Eventually, our CM processes will be run nightly and deployed more frequently than once a week.

  • Each evening, before leaving each developer ensures that their changes adhere to these rules and checks them into SVN.

learning grid services.

So right now I am essentially reading the grid book, learning how to make grid services, and learning how to secure grid services. Then, I will be playing with Introduce to see how it makes all this stuff much easier and much more automatic with the help of Felicia.

I also hope to pick Felicia's brain about implementing security using the client.

Either way, Globus services are a lot more than I thought they were, they have a lot more involved for maintaining state and essentially having all of their interactions marshaled and grouped, and it is just crying out for a shiny (if not leaky) abstraction to make it all pretty and less scary to people who aren't programmers used to fighting with Webservices and some of the gooey "I was generated by a machine" code that can come out of them. I look forward to those abstractions, but until then, I am going to be learning a lot of transport layer fun to get secure/amds-based quicksilver running.

Cheers

Thursday, April 2, 2009

Minor deploy, and Introduce

So, the next step for Quicksilver is to make an AMDS-intermediary layer that is a secure Globus service. Thus, tonight I am going to be going over some of the GRID/Globus books again and tomorrow I will probably install introduce and run the tutorial.

The reason I want to read the grid books is so I can see what goes into creating a grid service so that I can see what the introduce client will do for me. Then, I can start better understanding the heuristics of making an intermediary layer for poison data showing up in AMDS web.

Oh yeah, I have to go over the AMDS schema as well.

The end result is that the viewer part of Quicksilver, will end up becoming more of a generic AMDS viewer. Thus, one picks (a) secure service(s) that return(s) AMDS data, fills out the form that it/they require(s) (based on retrieved metadata for selected services), compiles the results into some sort of mappable data, and displays it in the map... and one of those secure AMDS services will be poison data.

One of the concerns I had was that making the NPDS->AMDS converter secure would be completely heuristic... then I remembered that people have to be given logins to be able to access poison data, and it would be much easier if they just were able to access it if they had a node on the grid.

So yeah, big context switch over the next couple of days, expect flagellating expectations as I realize the can, can't, and how.

Also, I ported a bunch of the improvements I made to npdsgmaps-web (the viewer part of quicksilver) back to gmap-poly-web (a demonstration library made on top of Gmap-Polygon). That should be deployed to training tomorrow so the two apps act the same (and because the version currently on training doesn't use encoded polygons, and is therefore much slower and error prone).

Cheers!

Potential PHGrid-related projects for PHI/CS Fellows

Hi all,

Please take a look…and if you think of other ideas….reply back with comments or email the team.

http://sites.google.com/site/phgrid/Home/resource-documents/PHGridtasksforfellows.pdf?attredirects=0


Wednesday, April 1, 2009

Quicksilver Code Complete

The deploy of Quicksilver to training has been complete. I have verified that the other applications are still working, and tested quicksilver on IE, Firefox, and Safari.

It can be reached here

The passwords have been changed to be more complex. Please contact Brian Lee at fya1@cdc.gov for your new password or login if you haven't gotten it.

Quicksilver now uses encoded gmap polygons. This reduces the size of the page sent back to the browser and greatly improves the performance. Internet Explorer no longer throws the "this script is taking a very long time" errors, and the polygons get more complex as you zoom in and less complex as you zoom out. They were encoded using the Douglas-Peucker algorithm described on this site.

The other improvements are tinkerings with the submit button (thus, it becomes disabled and says "loading..." when clicked) positioning improvements and a link to the Wiki help page.

This should be the Code Complete version of Quicksilver that will be deployed to the CDC SDN, and it has been labeled as version 1.0 and branched. New Features and behavior changes will be done to the 1.x branch and deployed to training, while bug fixes will be done to both the 1.x branch and the 1.0 branch and deployed to training and (if allowed) CDC SDN.

Thanks for your patience, I am very excited about Quicksilver and I think it has grown by leaps and bounds in functionality and niftiness from it's poicondai roots.

Otherwise, it looks like the next steps for me are to turn the service end of Quicksilver into an AMDS secure service, and to modify the visual end of Quicksilver (Gmap-polygon) to be flexible enough to combine and display data from multiple AMDS sources.

More immediately, I am going to merge some of the improvements I made for Quicksilver into the gmap-poly-web example project, and work with Felicia to learn the rudiments of a secure AMDS service and to work on making secure service clients coexist happily on the same server.

Once-secret 'cloud manifesto' sees light of day

http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9130706&source=NLT_PM

KMWorld Article: Tackling extreme data volume

Tackling extreme data volume

By Hugh McKellar - Posted Apr 1, 2009

We all know of the imperative of managing the extreme—and sometimes debilitating—volume of data in today’s organizations. In early March, Digital Reef emerged from two years of stealth mode to tackle that problem by introducing what it claims is a new approach to discovering and managing unstructured and semi-structured data.

The company says its massively scalable platform is designed to address critical business issues—such as e-discovery, data risk mitigation, knowledge reuse and strategic storage initiatives—that aren’t currently properly addressed using traditional solutions. Digital Reef says that as the volume of data expands unchecked, enterprises take on more expense and more risk.
"In the sea of solutions focusing on e-discovery and enterprise search, Digital Reef uniquely understands the requirements of data center infrastructure and contextual content management," says Tony Asaro, founder and senior consultant, The INI Group (http://contemplatingit.com). "This combination sets Digital Reef apart from other players. From a data center perspective, Digital Reef’s solution uses grid architecture to meet the performance requirements for indexing massive amounts of data.

"[Digital Reef] has a truly federated view of all of the data regardless of physical location, and the solution transparently works with all file systems from NTFS to WAFL to ZFS. From a content management perspective, Digital Reef combines intelligent keyword search with its unique similarity engine to enable users to efficiently access the appropriate and relevant information they seek."

Digital Reef says its namesake platform allows customers to: locate specific kinds of data, including sensitive data like Social Security and credit card numbers; identify regulated data for compliance; pinpoint relevant documents for pending legal action; and find intellectual property that can be reused for competitive advantage.