Sunday, October 28, 2012

Weekly report 20121028

Done in last week:

1. Solved the low data locality problem in HBase MapReduce programs. The problem is caused by two different ways of getting hostnames in Hadoop JobTracker (JT) and HBase TableInputFormat (TIF). When JT tries to get the hostnames of TaskTrackers (TT), it uses InetAddress.getHostname(), which returns something like "c056.cm.cluster"; But when TIF tries to get the locations of data splits (for the MR programs), which are actually table regions, it first gets the IP addresses of the region servers, then uses reverse DNS service to resolve their hostnames. The problem comes from the reverse DNS service, which returns something like "c056.cm.cluster.". Note the trailing dot in this hostname. It's not an error, but a more complete, "fully qualified" domain name according to the HTTP protocol (RFC 1738).

Both hostnames are legal, but the difference in the trailing dot caused the JT to treat them as names for different nodes, and therefore fail to recognize the data locality between the RS and the TT. It is hard to discover this subtle problem, since these hostnames always appear at the end of sentences in the JT logs, which makes them look like periods. Furthermore, I had to dig deep into the source codes of Hadoop and HBase to find out the cause of the difference. To solve this problem, I ended up writing my own customized TableInputFormat class to remove the trailing dot in RS hostnames when getting region locations for data splits.

2. Found several papers from Richard McCreadie and Jimmy Lin about building inverted index with MapReduce. I need to read the details to have a complete comparison between their work and our work, but I think the basic difference is that they didn't use HBase tables to store raw text data and index data. Our strategy may also have better support for real-time documents insertion and indexing.

To do next:

1. Run the system over larger data sets with all these problems solved, and collect performance measurements.

2. Read the related papers and work on paper draft.

No comments: