Thursday, January 29, 2009
DRN upcoming presentations
Demonstrate to DRN group ("internal presentation"): Thurs., 2/5 @ 4:30PM ET
Demonstrate to large group ("external presentation"): Thurs., 2/19 @ 2:30PM ET
WebEx Demo
Wednesday, January 28, 2009
Content for a Health Channel
This morning, the concept of the Health Channel was mentioned and I recalled an article from Dr. Linda Pickle (National Cancer Institute) regarding the notion of creating weather maps of disease.
The article is here: http://www.ij-healthgeographics.com/content/5/1/49
Abstract
Background
To communicate population-based cancer statistics, cancer researchers have a long tradition of presenting data in a spatial representation, or map. Historically, health data were presented in printed atlases in which the map producer selected the content and format. The availability of geographic information systems (GIS) with comprehensive mapping and spatial analysis capability for desktop and Internet mapping has greatly expanded the number of producers and consumers of health maps, including policymakers and the public.
Because health maps, particularly ones that show elevated cancer rates, historically have raised public concerns, it is essential that these maps be designed to be accurate, clear, and interpretable for the broad range of users who may view them. This article focuses on designing maps to communicate effectively. It is based on years of research into the use of health maps for communicating among public health researchers.
Results
The basics for designing maps that communicate effectively are similar to the basics for any mode of communication. Tasks include deciding on the purpose, knowing the audience and its characteristics, choosing a media suitable for both the purpose and the audience, and finally testing the map design to ensure that it suits the purpose with the intended audience, and communicates accurately and effectively. Special considerations for health maps include ensuring confidentiality and reflecting the uncertainty of small area statistics. Statistical maps need to be based on sound practices and principles developed by the statistical and cartographic communities.
Conclusion
The biggest challenge is to ensure that maps of health statistics inform without misinforming. Advances in the sciences of cartography, statistics, and visualization of spatial data are constantly expanding the toolkit available to mapmakers to meet this challenge. Asking potential users to answer questions or to talk about what they see is still the best way to evaluate the effectiveness of a specific map design.
Tuesday, January 27, 2009
Dimdim - the leading provider of open source web meeting software
Thought you all might find this interesting.
Monday, January 26, 2009
PoiConDai II, revenge of PoiConDai
The new service that will be feeding PoiCenDai is going to be giving back a lot more complex information, namely count aggregations based on region and time. Dr. Jeremy Espino has recommended Apache CXF as the service handler (which is basically the next generation of Apache AXIS, which will allow for this new PoiCenDai to work with globus that much more seamlessly should we need to distribute its operations on the grid) and using JAXB as the object marshaller (At least I am anticipating this because the last poison center webservice returned the data as an XML "any" stream with a "schema" stream)
Thus, the PoiCenDai service is going to need to take a command, determine the output type, marshal it all out, and then take the results and filter them into the appropriate regional polygons which were also drummed up with the help of the Gmap Polygon libraries.
Thus it's becoming one of those little moments where it's awkward figuring out where to start, whether I delve in with the new technologies for objectifying the PoiCen data, or start planning ways to keep methods for stuffing PoiCenDai data into polygons simple and relatively agnostic to JSPs.
I think I am going to focus on making the stubs for PoiCenDai closer to the JSP side for a couple of reasons... First off, I have just been dealing with the JSP end of things so it will be an easy bridge, second is that I am not sure of the service endpoints and will need to fiddle with JAXB and CXF and it would be good to know where the resulting objects are going to go.
Of course this is all liable to change as assumptions will probably be made and then be proven wrong or not feasible. Another wacky day in coding big enterprise type things.
Friday, January 23, 2009
New Grid Node
Thursday, January 22, 2009
Gmap-Polygon now demonstrable
Now, I will work on taking the revamped NPDS service and getting it to populate the polygons so that little line graphs and colorations occur instead of just "0 cases" showing up all the time.
The other idea is that anyone can use this project if they need a google maps polygon mapper, especially if you need it to work in a secure environment and don't think you would be able to publicly expose a KML file.
Either way, I am happy with the results, I think it is nifty, and the more you poke things the more they cache and the faster they should (hopefully) load.
Cheers!
AMDS draft structure changing
The powerpoint is available on the HIE/AMDS wiki and outlines the new data structure for AMDS. It's a simplification/consolidation of what Felicia is currently working on in that there is a single operation that allows a query by syndrome by group of zip codes. It also expands the specification by including specific fields for cell suppression rules.
So this means that we're closer to a working draft, but all the example services we have up on sourceforge and in the service registry need updating.
Wednesday, January 21, 2009
Centroids.
This also means I will be ramping up on the NPDS changes, and will be focusing on getting time series populated into the gmap Polygon, and then we will have NPDS II (Or Poicondai II if you will).
Cheers!
Tuesday, January 20, 2009
Polygon Centroid Updates.
After that, it is time to sit down with all the NPDS sample services and see if I can't get the NPDS-service that Dr. Jeremy Espino put together (who has already suggested a bevy of new technologies to look at) working with some of the new constructs and creating new time-series to put into the polygons.
Then it is a hop-skip and debug session from the new NPDS demo, which will be the marriage of the gmap-polygon tech with the new NPDS service.
I'm hoping it shall be sufficiently nifty.
List of Open Geospatial Consortium (OGC) Web Mapping Services
http://www.ogc-services.net/
Friday, January 16, 2009
Gmap Polygon coming along nicely
Also, while the polygons are all in the database, multiplied tho they may be, I need to get the centroids into the database so that the map zooms in better for when individual states or zip3s are picked.
Another thing is that the NPDS service will be updating, so I am finishing up some of the tweaks to the gmapping service just in time.
I am going to go ahead and schedule a deploy early next week so people can start playing with it and seeing how the mapper behaves.
Thursday, January 15, 2009
AMDS Progress
Succesful DRN Test
- List files on remote SAMBA share with SST service
- Upload a SAS file via SST
- Run SAS program on a remote SAS server
- Download SAS output file in CSV format via SST
- View output on client node
Lessons Learned:
The following errors were thrown by the SST service for various reasons:
Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error Message: An error has occurred, error message = org.globus.gsi.GlobusCredentialException: Proxy file (/tmp/x509up_[uid-number]) not found
Explanation: You need to create a proxy before running SST commands. Run the grid-proxy-init command to correct this problem.
Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error message: An error has occurred, error message = java.netSocketException: Connection Reset
Explanation: You client has trust issues with the remote node. Make sure the remote node has your user credentials has been entered into the grid-map file and your signing CA is trusted by the remote node.
Command: java -jar caGrid-Transfer-client-1.0.jar list .
Error message: An error has occurred, error message = java.lang.NullPointerException
Explanation: This means there is no data in the simpleTransfer directory.
Command: java -jar caGrid-Transfer-client-1.0.jar upload [file name]
Error message: An error has occurred, error message = java.io.IOException: Internal Server Error
Explanation: This means the mapped user account is unable to write to the simpleTransfer directory. You will need to create the directory or check the directory permissions to correct this issue.
Order of operation for SST installation:
- Start Globus Container with the Root user
- Copy the caGridTransfer.war to $CATALINA_HOME/webapps
- Run the globus-deploy-gar command on cagrid_TransferService.gar
- Run the ant command to deploy Globus to Tomcat
- Shutdown the old Globus instance
- Start Tomcat with the Globus user (This starts Globus within Tomcat)
- Verify Globus is listening on the correct ports
- Run the grid-proxy-init command on the client node
- Test SST
Globus Toolkit User Perspectives Report Available
http://www.mcs.anl.gov/~childers/perspectives/
Focus is HPC (high performance computing) but the toolkit comments are useful.
R and Cloud Computing
Got models in R? Deploy and score them in ADAPA in minutes on the Amazon EC2 cloud computing infrastructure!
Zementis ( http://www.zementis.com ) has been working with the R community, specifically to extend the support for the Predictive Model Markup Language (PMML) standard which allows model exchange among various statistical software tools ( http://adapasupport.zementis.com/2008/02/how-can-i-export-pmml-code-from-r.html ).
If you develop your models in R, you can easily deploy and execute these models in the Zementis ADAPA scoring engine ( http://www.zementis.com/products.htm ) using the PMML standard. This not only eliminates potential memory constraints in R but also speeds execution and allows SOA-based integration. For the IT department, ADAPA delivers reliability and scalability needed for production-ready deployment and real-time predictive analytics.
SST Security Settings
Log File:
Jan 13, 2009 3:20:18 PM org.globus.wsrf.impl.security.authorization.GridMapAuthorization isPermitted
WARNING: Gridmap authorization failed: peer "/O=Messed/OU=updomain.net/OU=Some Organization/OU=updomain.net/CN=Bubba Gump" not in gridmap file.
Jan 13, 2009 3:20:18 PM org.globus.wsrf.impl.security.authorization.ServiceAuthorizationChain authorize
WARNING: "/O=Messed/OU=updomain.net/OU=Some Organization/OU=updomain.net/CN=Bubba Gump" is not authorized to use operation: {http://transfer.blahdomain.com/TransferService}storeFile on this service
We are currently researching a method for doing this via the Globus / Tomcat configuration files. This configuration is currently built into the SST gar file.
Wednesday, January 14, 2009
DRN Notes
Current error message:
An error has occurred, error message = java.lang.ClassCastException: org.globus.gsi.gssapi.GlobusGSSContextImpl
Actions Taken:
1.Verified that a valid proxy was created for the SST Client side
2.Verified all Globus user and host certificates were installed with the correct permissions.
3.Checked the grid-mapfile to verify the remote user DN was mapped to a valid user account
4.Created a simpleTransfer directory at the default location. (Permissions may need to be changed – Action for tomorrow)
5.Verified the permissions were set correctly for containercert.pem and containerkey.pem
6.Viewed the Tomcat server.xml to verify the paths for the cert, key, and cacertdir variables were set correctly
7.Decided to do a full service SST redeploy. Currently in progress - Shut down Tomcat and deleted the webapps/caGridTransfer and wsrf directories
8.Remote Admin had to leave during the operation. We will continue the redeploy first thing in the morning.
Next Steps:
1.Start Globus via VDT
2.Run the globus-deploy-gar command on the SST gar file
3.Shutdown VDT Globus
4.Redeploy Globus to Tomcat
5.Start Tomcat
6.Test Transfer
Other Notes:
We were able to generate output with the SAS code and the SAS Template that was provided by Roy.
Harvard paper on SAS Macros for Satscan inputs
generation of input files for SatScan and could be useful as we move ahead
with the SatScan grid service from Univ of Utah.
A SatScan macro accessory for cartography (SMAC) package implemented with SAS
http://www.ij-healthgeographics.com/content/6/1/6
Jim Tobias
Tuesday, January 13, 2009
Zip3's now much better.
But this all brings up a good point: Our system is going to have to be all sorts of flexible. This is because we are dealing with GeoData, and in addition to GeoData changing all the time (zipcodes move, census numbers are updated), everyone will want to use a different set.
I don't think anyone will ever agree on a good way to geographically handle reporting systems like RODS or Poicondai, and the requirements will be completely different depending on whether things are being veiwed based on geography or population density, so I feel like the easiest thing we can do is allow for our systems to easily start using whatever new-fangled (or old-fangled) visualization technique is preferred at the time by whomever wants to use our system.
I feel like Gmap-Polygon is a good step in that direction. It basically is a way to put polygons on the map for whatever region you want to draw, so as long as you have the polygons put into the database before hand, and you know how you want to shade them, you should be able to use the software easily. This is of course assuming that you wanted to put it in a secure portal and didn't just want to expose KML (because for drawing polygons with nifty shading and effects, KML is a bit more straightforward).
I feel like the other thing that PHGrid products will need to start focusing on are ways to securely associate data, because a lot of people will want to focus on things like the Census Tracks that Jim mentioned, and no-one will know what census track they live in when they go to a hospital, and pulling up address data for a count item is going to be a very big no-no for things outside the secure network. I wonder if one of the applications of the Natural Language Processor can be backfilling the AMDS regional info with census track data and the like.
Monday, January 12, 2009
Oh look, Zip3s have multiple polygons in them.
So, ZIP3's do not form one big contiguous block of area in all cases. In some cases, they will form multiple polygons because some zipcode changed from 61299 to 61304 or whatever.
But, the polygon system I had set up was expecting only one polygon per region. Thus, I will need to do a bit of an overhaul (it actually won't be too bad, yay object oriented coding) but will need to massage a new list of zip3's (and probably zip5s and states) out of the KML files.
Then, hopefully, I'll have entire states full of zip3s, and then be able to select zip5s within zip3s, and then comes the PoiConDai refurb.
Friday, January 9, 2009
Obama's Speech (January 8, 2009)
Read about Smart Grid here.
Read article on Obama's speech below.
Obama includes broadband, smart grid in stimulus package
January 8, 2009 (IDG News Service) U.S. President-elect Barack Obama laid out his plan for a huge economic stimulus package, with a broadband rollout, an Internet-based smart energy grid and computers for schools as part of the plan.
During his campaign, Obama included rolling out broadband, energy issues and computers for schools in his list of goals, but in Thursday's speech in Fairfax, Va., he called for those items to be included in a giant stimulus package he'll push Congress to pass within weeks. The stimulus package could cost close to $1 trillion.
The president-elect called the economic situation in the U.S. a "crisis unlike any we have seen in our lifetime."
He also called for all U.S. medical records to be computerized within five years. "This will cut waste, eliminate red tape and reduce the need to repeat expensive medical tests," he said. "But it just won't save billions of dollars and thousands of jobs -- it will save lives by reducing the deadly but preventable medical errors that pervade our health care system."
Obama called on Congress to approve funding for rolling out broadband to unserved and underserved areas, although his speech did not provide details on how he wants it to happen. Several tech groups have called for a national broadband policy that would include a mixture of tax credits, loans and payments to broadband providers that bring broadband to new areas.
Part of the package should include rebuilding physical infrastructure such as roads and bridges, Obama said. "But we'll also do more to retrofit America for a global economy," he added. "That means updating the way we get our electricity by starting to build a new smart grid that will save us money; protect our power sources from blackout or attack; and deliver clean, alternative forms of energy to every corner of our nation. It means expanding broadband lines across America, so that a small business in a rural town can connect and compete with their counterparts anywhere in the world."
Smart energy grids would allow real-time monitoring of a customer's energy use through Internet technology. Proponents of a national smart grid say it would likely result in decreased electricity use, allow energy companies to more efficiently distribute electricity, and encourage homeowners to install alternative energy generators such as solar panels and sell their excess energy back to the grid.
Obama also called for Congress to approve money for "21st-century" classrooms, laboratories and libraries. "We'll provide new computers, new technology and new training for teachers so that students in Chicago and Boston can compete with kids in Beijing for the high-tech, high-wage jobs of the future," he said.
Obama's priorities line up with several tech groups that have been calling for more broadband and smart-grid funding. The Information Technology Industry Council (ITI), a trade group, praised Obama's stimulus plan. The package outlined by Obama represents an "excellent starting point," ITI President Dean Garfield said in a statement.
"Our firms know that technology investments are the quickest way to dramatically turn the
economy around," he added. "Increased broadband spending, electronic medical records, green energy investments and new computers for schools and libraries are all smart ways to keep America competitive while also creating new jobs and spending."
Wednesday, January 7, 2009
Updates to the AMDS draft
Now Level 1 has:
Syndrome
Syndrome classifier
Age Group (0<2;2<5;5<18;18<54;54<65; 65+)
Patient State
Patient Zip (either zip3 or zip5)
Date Start
Date End (in at least 1 day increments)
Count
We removed denominator as that's not really necessary (since you can query for all syndromes).
Level 2 (not developed yet):
Gender
Race
Ethnicity
Detailed Geography
Also, Dr. Lenert created different stages of access support for AMDS:
Stage 0: Scheduled transfers of AMDS documents to a supernode/central database
Stage 1: AMDS data available using secure grid services
Stage 2: AMDS data de-duped via secure grid services
Stage 3: AMDS data de-duped and combined together from multiple nodes (although after Stage 1, results can be combined together)
Update on Collaboration with SatScan on Cloud
I have build a simulation dataset with 53,018 random points within NC, SC, and GA that is based upon a RODS dataset that was provided to me.
I am working with a statistician in NCZVED to write scripts to automatically generate SatScan parameter files (.prm) as well as case files (.cas), population files (.pop) and coordinate files (.geo) that would all be used as inputs
to the SatScan batch mode (for the grid service).
Also, I have emailed the CCID TB group several singular genotype spatial cluster maps as examples as well as built one web mapping service to visualize spatial clusters. It is my hope that Ron and I can collaborate on a demo of the grid service and simulated data to reignite the interest of the CCID TB epidemiologists.
Jim Tobias
R and Rweb
R is used extensively by epidemiologists and geographers.
One can download R (free): http://cran.cnr.berkeley.edu/
The base install of R can be extended with R packages.
A fairly comprehensive list of R packages is here: http://cran.r-project.org/web/packages/
The R software is level III approved for use at the CDC.
The R software runs on your desktop but the R code can be run through a web service.
An example of an R web service is RWeb: http://www.math.montana.edu/Rweb/
Specific packages such as EpiTools have been developed by UC Berkeley and they support full epidemiology coursework with R.
If you have questions or wish to see this in action...just stop by;
Jim Tobias
Tuesday, January 6, 2009
package changes and SVN
It was sort of humbling, because my prior approaches were to build a big parser library and then spend another day or two debugging it. In this case, I just took the files, got a macro on them, and them told the database to import them. The result is I spent just 2/3rd of a day getting all the data I wanted in the right spot. The thing that will be bad is I'll have to do it all again (well, probably spend 1/3 of a day) when database updates need to be made, then again, I probably would have to tweak things in my loader scripts if they needed to be made, and it's still the same cost to the end user... and in fact probably less. Before it was download files, download code, adjust properties, package, build, run. Now it's download CSV, create table in database, import CSV into table in database. Yeah. It's easier now.
Otherwise, there are lists of relational data (all the zip3s in Nevada, all the zip5s in a given zip3, etc). Now I get to start building the code libraries that will serve it up as needed. Before I did that I had to do a package update... which involves changing one line in a lot of places.
Luckily eclipse handled the refactoring for me beautifully, and it terribly broke the subversion management of it all. Thus, I spent the rest of the day deleting out the appropriate files and committing the changes by hand.
Tomorrow is the implementation of the lists, and updating documentation for the loader and the list-libraries and, well, everything else.
Dallas Network Issues
- The Tarrant node is back online. The root cause of the problem was the VM failed to restart after a host server reboot. The VM was manually restarted successfully.
- The Dallas node is experiencing slow response times due to network issues that are currently being addressed by the Dallas network team. They are in the process of investigating the root cause of the issue. This issue currently affects the RPDSAdai demo and any data transfers to and from the Dallas node. I was informed that this issue would be corrected by close of business on 01-06-2009.
Link Plus: From CDC's National Program of Cancer Registries (NPCR)
Download Application Here:
ftp://ftp.cdc.gov/pub/Software/RegistryPlus/Link_Plus/RPLinkPLus-2.0.exe
Features
http://www.cdc.gov/cancer/NPCR/tools/registryplus/lp_features.htm
Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface
I found out about Febrl yesterday. It has been developed since 2002 as part of a collaborative research project conducted between the Australian National University in Canberra and the New South Wales Department of Health in Sydney, Australia.
The group is working on Parallel Large Scale Techniques for High-Performance Record Linkage
Link to their efforts: http://datamining.anu.edu.au/linkage.html
The NCPHI Grid team is going to investigate this - and find out what, if any, aspects of Febrl might be leveraged to help provide this service on the PHGrid.
Note: It requires Python to be installed.
Here are two papers on Febrl.
http://cs.anu.edu.au/~Peter.Christen/publications/kdd2008christen-febrl-demo.pdf
http://crpit.com/confpapers/CRPITV80Christen.pdf
Here is the sourceforge link to Febrl
http://sourceforge.net/projects/febrl
Features include:
* Probabilistic and rules-based cleaning and standardisation routines for names, addresses, dates and telephone numbers.
* A geocoding matching system based on the Australian G-NAF (Geocoded National Address File) database.
* A variety of supplied look-up and frequency tables for names and addresses.
* Various comparison functions for names, addresses, dates and localities, including approximate string comparisons, phonetic encodings, geographical distance comparisons, and time and age comparisons. Two new approximate string comparison methods (bag distance and compression based) have been added in this release.
* Several blocking (indexing) methods, including the traditional compound key blocking used in many record linkage programs.
* Probabilistic record linkage routines based on the classical Fellegi and Sunter approach, as well as a 'flexible classifier' that allows a flexible definition of the weight calculation.
* Process indicators that give estimations of remaining processing times.
* Access methods for fixed format and comma-separated value (CSV) text files, as well as SQL databases (MySQL and new PostgreSQL).
* Efficient temporary direct random access data set based on the Berkeley database library.
* Possibility to save linkage and deduplication results into a comma-separated value (CSV) text file (new).
* One-to-one assignment procedure for linked record pairs based on the 'Auction' algorithm.
* Supports parallelism for higher performance on parallel plat- forms, based on MPI (Message Passing Interface), a standard for parallel programming, and Pypar, an efficient and easy-to-use module that allows Python programs to run in parallel on multiple processors and communicate using MPI.
* A data set generator which allows the creation of data sets of randomly generated records (containing names, addresses, dates, and phone and identifier numbers), with the possibility to include duplicate records with randomly introduced modifications. This allows for easy testing and evaluation of linkage (deduplication) processes.
* Example project modules and example data sets allowing simple running of Febrl projects without any modifications needed.
-----
An Evaluation of Febrl
http://www.qsinc.com/dedup/results_FEBRL.html
Conclusion of Evaluation:
Febrl is an excellent research tool to study the deduplication process. It provides refinements in record standardization, geocoding and field matching. It provides flexibility in how data is selected for scoring, how record fields are compared, and how the untimate score is calculated. Febrl is also useful as an outboard record standardizer, of particular interest where names, addresses and phone numbers are not already separated into component parts.
It became clear, as we understood the inherent flexibility of Febrl that with sufficient effort, we could have used our long experience in deduplication to improve our results. It's all in the configuration.
However, we couldn't think of any way to deal effectively with twins. Probabilistic record linking schemes that rely solely on aggregating scores of individual field comparisons do not perform well on data with twins. Febrl couldn't distinguish between a true match with data errors and a twin. It didn't "understand" data in context.
Febrl places considerable resources at the researcher's command. It provides a deduplication laboratory allowing the experimenter lots of knobs to turn in order to arrive at a deduplication solution. But a laboratory is not a factory. We found ourselves "trying just one more thing" over and over again in the interests of fairness for this comparison with no way to decide when enough is enough. With endless opportunities for tinkering, adjustment and trying "just one more thing", it's an hourly consultant's paradise. For this reason, look very carefully at any consultant's recommendation to use Febrl.
While Febrl is free open source software, we believe that the actual cost of moving it from the laboratory into production could be significant. Its inability to deal with twins is a fatal flaw.
Monday, January 5, 2009
Setting up the locational library.
Now, I am in the process of building lists so that any application that wants to provide locations will be able to get lists of states, zip3's in said states, and zip5s in said zip3s. That, and improving the graphic awareness to be able to center on given areas after they are selected.
This of course means massaging large KML lists, and discussions on what would be most useful to someone trying to build their own databases on their own systems. Right now it seems the best answer would be "Comma delimited files", considering every database has a way to very quickly ingest such files.
I also am happy to have gotten some examples of text massaging using macros, and am once again making sure to earmark some time in the new year for basic scripting niftiness, namely because there are just some times when creative uses of grep and regular expressions will do so much more with a file than trying to create a SAX parser.
Friday, January 2, 2009
NIH Webcast - Transforming Science: Cancer Control on caBIG
http://videocast.nih.gov/Summary.asp?File=14755.
Special thanks to Abdul R Shaikh, PhD, MHSc, Behavioral Scientist Program Director, Health Communication and Informatics Research Branch, Behavioral Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute.
FIY - the NCI is moving forward to launch a division web site on grid-enabled research in population health.
Thursday, January 1, 2009
HealthGrid 2009 conference Berlin - Call for Papers
KEY DATES
Call for papers opens: 18 December 2008
Call for papers closes: 1 February 2009
Registration opens: 1 February 2009
More information about submission:
http://berlin2009.healthgrid.org/fileadmin/templates/download/HealthGRID2009_Call-for-Papers.pdf