Sunday, October 28, 2012

Weekly report 20121028

Done in last week:

1. Solved the low data locality problem in HBase MapReduce programs. The problem is caused by two different ways of getting hostnames in Hadoop JobTracker (JT) and HBase TableInputFormat (TIF). When JT tries to get the hostnames of TaskTrackers (TT), it uses InetAddress.getHostname(), which returns something like "c056.cm.cluster"; But when TIF tries to get the locations of data splits (for the MR programs), which are actually table regions, it first gets the IP addresses of the region servers, then uses reverse DNS service to resolve their hostnames. The problem comes from the reverse DNS service, which returns something like "c056.cm.cluster.". Note the trailing dot in this hostname. It's not an error, but a more complete, "fully qualified" domain name according to the HTTP protocol (RFC 1738).

Both hostnames are legal, but the difference in the trailing dot caused the JT to treat them as names for different nodes, and therefore fail to recognize the data locality between the RS and the TT. It is hard to discover this subtle problem, since these hostnames always appear at the end of sentences in the JT logs, which makes them look like periods. Furthermore, I had to dig deep into the source codes of Hadoop and HBase to find out the cause of the difference. To solve this problem, I ended up writing my own customized TableInputFormat class to remove the trailing dot in RS hostnames when getting region locations for data splits.

2. Found several papers from Richard McCreadie and Jimmy Lin about building inverted index with MapReduce. I need to read the details to have a complete comparison between their work and our work, but I think the basic difference is that they didn't use HBase tables to store raw text data and index data. Our strategy may also have better support for real-time documents insertion and indexing.

To do next:

1. Run the system over larger data sets with all these problems solved, and collect performance measurements.

2. Read the related papers and work on paper draft.

Sunday, October 21, 2012

Weekly report 20121021

Done in last week:

Solved a problem with our previous HBase and Hadoop configuration that had caused various errors during runtime. The problem was lack of support for file "append" operations in the previous version of HDFS that we had used. Since HBase requires append operations in HDFS, this problem had caused errors such as missing data blocks and writing timeouts in our previous runs. We didn't find out this problem before because when data scale is not large enough this problem could be hidden by not enough write operations in HBase. This problem is solved now by using latest stable versions of Hadoop (1.0.4) and HBase (0.94.2) and turning support for append on in HDFS configuration.

Now we don't see the previous errors during job runs, but another problem came up: data locality is low and the job runs very slow. What is more weird is that Hadoop job tracker is printing very confusing logs about map task scheduling. For example, it is reporting one map task has data split on one node, and deploying that task on that node, but still counts the task as a rack-local task, instead of data-local. Also, the job tracker is not always trying to put tasks to the nodes containing their data splits, but seemingly doing it in a random way. One possible reason could be the hostname configurations on the nodes in alamo -- the "hostname" command returns something like "c029", but the hostname got through "System.getenv( "HOSTNAME" )" in java returns "c029.cm.cluster". We need more investigation to solve this problem.

To do next:

Solve the low data locality problem and run the job on the whole ClueWeb09 CatB data set.

Thursday, October 4, 2012

weekly report 20121004

Done in last week:

Solved the following issues with the MyHBase setup:

1. An unsynced-clock exception on a region server caused by its out-of-sync system clock. A fuguregrid administrator helped solve this issue.

2. A timeout exception from Hadoop data node. This was fixed by setting the timeout value to 0 in Hadoop configuration files.

3. A "block-missing" exception from Hadoop data node. This exception was gone after the first two issues were resolved.

4. A "path not found" exception from Hadoop task tracker. After some investigation, I found that this might just be normal some status report from the task tracker that does not necessarily indicate an error. So we will try to live with it for now.

5. A crashing problem with an HBase region server in some runs. This might have been caused by the previous issues. After solving those issues and adjusting the memory limit of data nodes and region servers, the problem seems gone. Will investigate it again if it still happens.

6. A "socket connection refused" exception from the zookeepers. This seemed to happen during the start time of the servers, maybe because the servers were not totally started yet. I have put a 30-second wait time between the zookeeper start-up and the HBase start-up so that the zookeepers get a time to make everything right. We need more test runs to verify if this works.

To do next:

Start running on larger data scale.