Xiaoming Gao - notes, reports, and fun: 2008

Tuesday, December 2, 2008

Test DailyRdahmm Result Service set up

We have set up a test web service at http://156.56.104.161:8080/axis2/services/DailyRdahmmResultService?wsdl to calculate markers' colors for a given date. Also added is a "please wait" window to the portlet while the page is waiting for results from the service. Calling the web service to calculate the colors is actually faster than using managed bean, because now we don't have to refresh the page. The test page is located at http://156.56.104.161:8080/gridsphere/gridsphere, log in with empty user name and password.

We'll update this to gwll and svn after performance measure is done.

Saturday, November 1, 2008

Daily RDAHMM video updated

Because the changes of our treatment to missing data sections, and correction of several bugs, the historical results of Daily RDAHMM are also different from previously. So we just made a new movie for the whole time since 1994 based on the new results. The new video can be accessed from the portal at http://gw11.quarry.iu.teragrid.org:8080/gridsphere/gridsphere

Modifications to DailyRDAHMM missing data treatment

We did the following modifications to the treatment of missing data in DailyRDAHMM:

1. For the input data, if the dates used for getting the input is 1994-01-01 to 2006-09-30, and we only get data for 1995-10-03 to 2006-8-22, then we will only fill those time sections between 1995-10-03 and 2006-8-22, and leave the gaps before and after as they are, in order to avoid the distortion to results;

2. For the big flat file that contains input for all stations, we use NaNs to fill all missing lines.

Tuesday, October 7, 2008

Result Aggregator added to real-time RDAHMM service

A new component, real-time RDAHMM results aggregator, is added in to the real-time RDAHMM service. This aggregator connects to the service through NaradaBrokering, receives messages containing the real-time RDAHMM analysis results from all the seven networks, and aggregates these results and save them into a single .xml file, so that the portlet can show the results by accessing this file.

One problem is the large amount of results. Some stations came across as many as 800+ state changes within a 4-hour test run. So we just save one day's results in this file temporarily.

Treatment to missing data corrected in daily RDAHMM analysis

A serious error of daily RDAHMM analysis has been corrected, concerning the way we treat the missing input data of stations. Previously we do nothing about the missing data but recording them, which may lead to incorrect RDAHMM results because the Hidden Markov Model assumes an even distribution of data across time. So now we correct this error by inserting "fake data lines" in to the missing-data time sections, which are duplicated from the available data at a most recent time relative to the missing-data time sections.

After this correction, the total number of stations with state changes dropped significantly (from around 100 on average to less than 30) during for the time from 2006 to 2008. Whether this new result is reasonable still needs verification.

Tuesday, September 16, 2008

Daily RDAHMM porlet files updated

The big DailyRdahmm.jsp file is now split into three files: DailyRdahmm.jsp and two other utility javascripts, NmapAPI.js and dateUtil.js.

Saturday, September 6, 2008

Update to real-time RDAHMM portlet

We have changed the coloring plan of markers in the real-time RDAHMM portlet, so that now the colors describes the state change information about each station, rather than which network they belong to. Now we have three different colors: red for state change within last 2 hours, yellow for state change within last 1 day, and green for no state change within last 1 day.

We use red for last 2 hours because there is a lag between the time stamp of the real-time data and the "current" time when the data is received. For example, at 14:10pm, we might just receive a station's gps data with a time stamp of 13:05. Therefore, the evaluation results of a station always show its state change information of an hour ago or so.

In order to expose the state change information to the portlet, the real-time RDAHMM service saves this information for each network in a separate file in the same directory as the plots, so that the portlet can retrieve them with URLs.

Check out the new portlet at http://gw11.quarry.iu.teragrid.org:8080/gridsphere/gridsphere.

Video making added to Daily RDAHMM

We have integrated the video making function to daily RDAHMM service and portlets.
The service will now create a new thread to make a video of the whole time period after doing RDAHMM analysis on all stations, and add an xml element into the result xml file, which gives a URL of the newly created video.

The portlet will show a link to this video, right below the link to the trace file of each day's number of stations with state changes.

Plot of state change numbers added to Daily RDAHMM

A new plot function is added to the daily RDAHMM service. After doing RDAHMM analysis on all stations everyday, the service will draw a plot to display the variation of the number of stations with state changes through the whole time from 1994-1-1 to "today". We added new shell and gnuplot scripts to do this, and the service will generate a file containing the trace of the state change numbers as the input for these scripts.

On the daily RDAHMM portlet, a new tab is added to show this plot, as well as a link to the trace file, which contains the detailed information about each day's number of stations with state changes.

Monday, August 4, 2008

Difficulties in running realtime RDAHMM on Teragrid

Since building a RDAHMM model on one day's realtime data is compute-intensive, it turns out impossible to run the realtime RDAHMM service for seven networks on one machine:
it takes a much longer time to build a model when we start the service for all networks at the same time; further, this time varies a lot among different stations. When trying this on gf14, this time could vary from 8 minutes to more than 9 hours.

So we consider running the service on TeraGrid, expecting the grid services will balance this load among more resources. But so far we get the following difficulties:

1. We can't get the present rdahmm executable we have to run on the TeraGrid clusters. Some of them might not have the correct version of gcc and thus can't be configured with the right libraries. We set up all necessary libraries on Cobalt, but got a float exception when trying to run it. Maybe we need to get the latest source code of rdhamm and recompile it.

2. The TeraGrid services don't support continuous jobs, i.e., jobs that will run "forever" after started, until we kill them manually. Further, child processes created by these jobs will not get scheduled and managed by the job management system. Therefore, even if we could try to start the services by submitting them as jobs with a fake "estimated cpu time" parameter value, their child processes, which are created for building RDAHMM models, will just run on the same machine as the service, and thus the problems remain the same.

3. We may be able to redesign the realtime RDAHMM service so that we run the service on our server, and submit jobs to TeraGrid to build RDAHMM models, and then get the reusult files. But this will complicate the structure of the service; PS, the estimated completion time of these jobs may still be long because of the waiting time.

Friday, July 11, 2008

daily RADHMM video

I tried to make a video of the daily RDAHMM results for the time from 2008-1-1 to 2008-6-14 in the following way:
1. Get the big Google map picture of California, save it as a background of the video frames;
2. Get the position coordinates of all stations in the background picture. Google map has an API to do so: fromLatLngToDivPixel(latlng);
3. Get daily rdahmm results for everyday, draw markers of stations on the background image, with their colors decided on the stations' states on each day. So we get one image for each day during the time from 2008-1-1 to 2008-6-14;
4. Make a video from all these images with mencoder.
Right now we have a separate program to draw images for everyday. We can integrate these functions to the daily RDAHMM service later. And in order to use mencoder to make the video, we need to install it on every server where the service runs. Besides, we'll figure out a way to add one more image to an existed video, so that we don't have to make a new video from all raw images for every next day.

Sunday, June 15, 2008

Parallel RDAHMM

I did a test about the performance of two diffrerent ways for building a rdahmm model:
A. Train the model with 10 tries in one process;
B. Train the model with 10 processes, each carrying out 1 try with a different random seed, and then select the model with the largest L value.

Performance Comparison:
Station Name Line Count of input file Time for A (sec) Time for B (sec)
CVHS 3921 23 20
TABL 48138 277 212
TABL 72622 457 3 41

When the amount of input is not so large, the performance of A and B are similar, because with method B there is the cost of creating new processes. When the number of lines in input file is large, using 10 processes improves the performance by 20%-25%.

If we start the real-time service for all stations of one network at the same time, we might need to create two many processes running at the same time, because there are 7-8 stations per network, and 10 processes for each station. Having so many processes running at the same time is very costly. So I think maybe we can just use one process with 10 tries for building the models tmeporarily. 8-10 minutes is not a very long time, anyway.

Thursday, May 22, 2008

data storage and plotting of realtime rdahmm

Since the present modifications to the real-time RDAHMM service is just an initial version that is not fully correct, we need to store all received data for possible remodeling and evaluation purposes in the future. We now use one file to save one day's input data for one station. And all data for all stations is stored in a directory structure like this: "histParentDir/stationName/yyyy/mm/statioName_yyyy-mm-dd.dat".

For plotting, we'll first do tests with the existing plotting script, which plots with lines instead of points. We'll switch to points later.

Sunday, May 11, 2008

Modifications to real-time RDAHMM

Present real-time RDAHMM service just runs RDAHMM in training mode periodically on stations' real-time input, which is actually not a right way to do it.
We'll do the following modifications, which is not totally right either, but just our first step towards the right way:
For each station, collect its input data for a whole day, and build a RDAHMM model for it by running RDAHMM in training mode on the one-day's input data;
Use this model to periodically do evaluations for the station in future time; this period could vary from 10 minutes to 1 hour.

Right now we just use the model created based on one-day's data to do evaluations for all the rest time. This is obviously not completely right, and there might be the need for rebuilding the model from time to time. We'll leave an argument for specifying the period for rebuilding models to make the new implementation as general as possible, and discuss the proper period at a later time.

Thread problem with GRWS queries

We came across some problems when issuing GRWS queries with multiple threads. Some thread gets no input for the stations they query about, while the input for these stations are actually available when only one thread is used. Paul explained that this is because of current problems with the GRWS services about threading support. When a thread queries after another one, but the query time is nearly the same, the late thread may get no input.

To solve this problem we temporarily use only one thread, which just runs for less than an hour to do evaluations for all stations, and which is acceptable. Paul will try to improve the threading support of their services. He also mentioned that we can query input for all stations at one time, and get the result in one single file. This will be very helpful to improve the performance of the daily RDAHMM service, and we'll try it later.

Sunday, April 6, 2008

Late Post for Addition of Links to Model Files

A link to the station's model files is added in the daily rdahmm portlet, located below the links to output files of a station. The test page is still http://156.56.104.161:8080/gridsphere/

This link points to a .zip file which contains the compressed package of the model files of the corresponding station. The package is created by the daily rdahmm service in such a way that every time the model files of a station are created, or verified as already created, the service will check if there is a package for these files; if not the service will create a package with the same name as the directory where these files are kept.

Late Post for Managed Bean Based Daily Rdhamm Portlet

The test page for daily rdahmm portlet based on managed bean is:

http://156.56.104.161:8080/gridsphere/

We got more problems than expected when doing this. Most of computation about state change and missing data is now moved to a managed bean, which is a java class running on server side. Since java codes are running much faster than javascript, and the page size is also brought down from 6 MB to 770KB, the loading and coloring time is a lot shorter now. Just that a page refresh is needed when the managed bean is invoked. In order to reduce the times of refresh, the managed bean is only invoked when a new date is selected, for the color calculation of all stations. Station markers in a specific region are still created only when we are moving to that part of map, but their colors are calculated beforehand with all other stations when the managed bean is invoked. Calculating the colors for all stations instead of only the stations in a region results in a delay that is 7-8 times longer, but the absolute value is just from 4ms to around 30ms, which is trivial for the whole page refresh procedure.

Friday, March 7, 2008

Alternative ways with managed bean

The main purpose of the use of managed bean in daily rdahmm portlet is to move the state change and color calculation process to a managed bean, so that the loading and response time of the portlet could be shorter.

The managed bean needs to do two things: (1) read information from the result xml file and create data structures to store the state change and data missing information for all stations; (2) given a specific date, calculate the proper color for every station based on these informations.

A normal way to do this is to keep one managed bean for each browser request, or each session, and this bean contains both the xml file loading and calculating function. On the other hand, since the loading and calculating process are all the same for every bean, we can just move these processes to a stand alone service, and leave the beans with an interface to the portlet; then the beans can call the service to do the calculation, and return the results to portlet. This way the xml file will be read just once, and there is only one copy of the state change and data missing information kept in the service entity.

We'll first try the normal way, and then switch to the latter. The beans have been completed, and we still need some adjustments to the portlet codes to start testing with managed beans.

Javascript runs faster on windows?

Javascripts of daily rdahmm portlet seem to run faster on Windows. For changing stations' colors in the same scope of the map on the same selected date, the scripts run for about 2 seconds on my laptop, with Xeon duel core 1.83G, 2G ram and windows, while the time is around 4 seconds on the machine in the lab, with P4 duel core 3.4G, 2G ram and Linux. The same procedure also runs for like 4 seconds on Windows on another lab machine, which just has P4 1.7G and 500M ram installed.

I think this might be interesting for performance analyzers of Linux and duel core CPUs.

Monday, February 18, 2008

decouple daily rdahmm service and portlet

The daily rdahmm service must run on the same machine as the portlet under the previous configuration. Obviously this is not good, cause the service is doing the same thing on every machine deployed, costing a specific amount of resources; and it is hard to update the service once some changes are applied.

The portlet is now modified so that every portlet will try to fetch the xml result file from gf13 through http. This puts two requirements for gf13: 1) the web server should always be running; 2) the service should run there, and be updated in time. Since the xml file is updated once per day, the portlet will cache the file for a day after getting it every time. Since the portal servers are all deployed near gf13 presently, fetching the xml file (whose size is around 1.5M) only adds a trivial latency to the response time of the portlet, and this only happens once every day when the portlet is first requested.

Friday, January 25, 2008

Moving status checking stuff to a managed bean?

It is a time consuming procedure for javascript now to check whether each station has a status change on the chosen date or within the last 30 days before that date, because javascript is not efficient enough. So I am thinking about moving this procedure into a jsf managed bean, and call a method like "getColorsForStations" on it when we need to recolor the stations.

The problem is that managed bean can only be used by binding to some jsf tag of a html control. We can't call the methods of a managed bean directly from javascript. So the following tricks might be needed:
a. The method may need five parameters: the current view scope of the map (min latitude, max latitude, min longitude, max longitude), and the chosen date. So we need five invisible control tags to save the value of these parameters;
b. We need an invisible commandButton or something like that to map its onClick action to the call to this method. In this way we'll be able to call the method by invoking "btn.click()", after setting up the parameter values.
c. We need an invisible control to store the result. The result should be simple, like a string on {0, 1, 2, 3,4, 5}, where each number denotes a different color.

After calling the method, we can decide the color for every station by analyzing the resulted string.

The miscoloring problem caused by time zone difference

We got this miscoloring problem several days ago that for some specific dates, some stations are colored as red on some machines, but yellow on others.

The reason for this is relevant to the way we record the status change dates in javascript code. The dates were previously stored as a millisecond count of the local time since 1970-01-01-0:0:0:000.
Since it is only the date that matters to our application, the specific time of all dates are set to 12:0:0:000.
The initial dates are setup at the server side by analyzing the xml file containing status change information of all stations, so the time used is the local time of the machine where the web server is running. Then when a user chooses a date from the portal, a time according to the chosen date is generated with client side javascript and used for coloring the stations. The color of each station is decided by comparing this time with each stored time that has been initialized at the server side.
As a result, if the client and server have different time zone configuration, the two millisecond time counts for the same date will be different, and this will lead to miscoloring of the station.

To solve this problem, we just use standard UTC time for generating both the initialized status change time and the dynamic date time chosen by portal user. Moreover, since only the date matters, we use day count instead of milliseconds since 1970-01-01.

Wednesday, January 2, 2008

Manipulate xml file in javascript

Consider the following sample samp.xml file:

/*
< xml>
< station>
< id>aacc< /id>
< lat>33.124< /lat>
< long>110.223< /long>

< station>
< id>bbcc< /id>
< lat>32.124< /lat>
< long>110.289< /long>
< /station>
< /xml>
*/

We can use java codes to analyze the xml files in jsp, e.g., with SAXReader. But when mixed with javascripts, this will generate ugly html source codes. For instance, if there are hundreds of stations in this xml, and we wanna create one google map marker for every station, we might write the following codes in jsp:
/*
< % for (int i=0; i < count_station; i++) { % >
marker[idx++] = new GMarker('<%=station.element("id").getText()%>', ...);
< % } % >
*/

And this will result in hundreds of lines like "marker[idx++] = new GMarker('aacc', ...);" in the html generated from this jsp. On the other hand, if we can manipulate xml files directly with javascript, codes will be much more elegant. To be honest, I always think jsp is ugly. If we really want the web interface to do more things, we'd better come back to the time of ActiveX object of applications, where codes of the objects are compiled and execute much faster...

To avoid such ugly codes, we can manipulate xml files just in javascript. The specific object to use depends on the type of browser:
function xmlMicoxLoader(url){
//by Micox: micoxjcg@yahoo.com.br.
//http://elmicoxcodes.blogspot.com
if(window.XMLHttpRequest){
var Loader = new XMLHttpRequest();
//assyncronous mode to load before the 'return' line
Loader.open("GET", url ,false);
Loader.send(null);
return Loader.responseXML;
}else if(window.ActiveXObject){
var Loader = new ActiveXObject("Msxml2.DOMDocument.3.0");
//assyncronous mode to load before the 'return' line
Loader.async = false;
Loader.load(url);
return Loader;
}
}
var changeXml = xmlMicoxLoader("/samp.xml");
var xmlNode = changeXml.childNodes[0];

Then we can treat this xmlNode as the root node of a tree structure corresponding to the structure of the xml elements, and reference the elements at different layers from different layer of child nodes:
for (var i=2; i < xmlChildCount; i++, saIdx++) {
stationNode = xmlNode.childNodes[i];
var id = stationNode.childNodes[0].firstChild.nodeValue;
....
}

Note that for pretty-formatted xml files, the dented tabs, spaces and line-changes are treated as a special kind of "text" nodes--in fact, even in "ugly-formatted" xml files where no spaces or line-changes we still find these "text" nodes--so unless we are very sure about the layout order of "text" nodes and xml-element nodes that are really useful for us, we should always check the type of nodes before access their child nodes, or node values. For example, for traversing the whole tree structure:

function xmlMicoxTree(xmlNode,ident){
//by Micox: micoxjcg@yahoo.com.br
var treeTxt=""; //var to content temp
for(var i=0;i < xmlNode.childNodes.length;i++){//each child node
if(xmlNode.childNodes[i].nodeType == 1){//no white spaces
//node name
treeTxt = treeTxt + ident + xmlNode.childNodes[i].nodeName + ": "
if(xmlNode.childNodes[i].childNodes.length==0){
//no children. Get nodeValue
treeTxt = treeTxt + xmlNode.childNodes[i].nodeValue
for(var z=0;z < xmlNode.childNodes[i].attributes.length;z++){
var atrib = xmlNode.childNodes[i].attributes[z];
treeTxt = treeTxt + " (" + atrib.nodeName + " = " + atrib.nodeValue + ")";
}
treeTxt = treeTxt + "
\n";
}else if(xmlNode.childNodes[i].childNodes.length>0){
//children. get first child
treeTxt = treeTxt + xmlNode.childNodes[i].firstChild.nodeValue;
for(var z=0;z < xmlNode.childNodes[i].attributes.length;z++){
var atrib = xmlNode.childNodes[i].attributes[z];
treeTxt = treeTxt + " (" + atrib.nodeName + " = " + atrib.nodeValue + ")";
}
//recursive to child of children
treeTxt = treeTxt + "
\n" + xmlMicoxTree(xmlNode.childNodes[i],ident + "> > ");
}
}
}
return treeTxt;
}

Except for the problem of "text" nodes, performance is also an important drawback of using javascript to manipulate xml files, especially when the file is large and there is a big number of nodes to process. For example, when employed for analyzing the result xml file of daily rdahmm, because there are too many stations to process, the loading time of the page becomes unacceptable. As a result we came back to the java code solution, which is much faster, but produces ugly codes, and results in a large html file. The loading time mainly depends on the execution of jsp codes when the network connectivity is good, which is the normal case, so this makes the portlet response more quickly.

Plotting problem and portlet performance

The present plotting of rdahmm uses lines to plot, which leaves out some outliers if these points have a different status from that of the days before and after their dates, as well as covers some time periods of missing data if the statuses before and after the periods are the same. To solve this problem, we plot with points, in which case all outliers will be plotted, and time periods with missing data are just left as sections with no points.

The portlet is having a very long loading time for analyzing the status change xml file, and creating the markers. One solution for this is just to create necessary resources for markers in the current range of the map, and build the rest on the fly when users move around in the map.