Since building a RDAHMM model on one day's realtime data is compute-intensive, it turns out impossible to run the realtime RDAHMM service for seven networks on one machine:
it takes a much longer time to build a model when we start the service for all networks at the same time; further, this time varies a lot among different stations. When trying this on gf14, this time could vary from 8 minutes to more than 9 hours.
So we consider running the service on TeraGrid, expecting the grid services will balance this load among more resources. But so far we get the following difficulties:
1. We can't get the present rdahmm executable we have to run on the TeraGrid clusters. Some of them might not have the correct version of gcc and thus can't be configured with the right libraries. We set up all necessary libraries on Cobalt, but got a float exception when trying to run it. Maybe we need to get the latest source code of rdhamm and recompile it.
2. The TeraGrid services don't support continuous jobs, i.e., jobs that will run "forever" after started, until we kill them manually. Further, child processes created by these jobs will not get scheduled and managed by the job management system. Therefore, even if we could try to start the services by submitting them as jobs with a fake "estimated cpu time" parameter value, their child processes, which are created for building RDAHMM models, will just run on the same machine as the service, and thus the problems remain the same.
3. We may be able to redesign the realtime RDAHMM service so that we run the service on our server, and submit jobs to TeraGrid to build RDAHMM models, and then get the reusult files. But this will complicate the structure of the service; PS, the estimated completion time of these jobs may still be long because of the waiting time.
Monday, August 4, 2008
Subscribe to:
Comments (Atom)