Solved a problem with our previous HBase and Hadoop configuration that had caused various errors during runtime. The problem was lack of support for file "append" operations in the previous version of HDFS that we had used. Since HBase requires append operations in HDFS, this problem had caused errors such as missing data blocks and writing timeouts in our previous runs. We didn't find out this problem before because when data scale is not large enough this problem could be hidden by not enough write operations in HBase. This problem is solved now by using latest stable versions of Hadoop (1.0.4) and HBase (0.94.2) and turning support for append on in HDFS configuration.
Now we don't see the previous errors during job runs, but another problem came up: data locality is low and the job runs very slow. What is more weird is that Hadoop job tracker is printing very confusing logs about map task scheduling. For example, it is reporting one map task has data split on one node, and deploying that task on that node, but still counts the task as a rack-local task, instead of data-local. Also, the job tracker is not always trying to put tasks to the nodes containing their data splits, but seemingly doing it in a random way. One possible reason could be the hostname configurations on the nodes in alamo -- the "hostname" command returns something like "c029", but the hostname got through "
System.getenv( "HOSTNAME" )" in java returns "c029.cm.cluster". We need more investigation to solve this problem.To do next:Solve the low data locality problem and run the job on the whole ClueWeb09 CatB data set.
No comments:
Post a Comment