Done in last two weeks:
1. Run the system on a 46GB data set with 28, 32, and 48 nodes. The most time-consuming step is inverted index building, and the time taken with each cluster size is given in the following table:
# of nodes time taken (sec) # of mappers
28 17452 90 / 47
32 15607 94 / 65
48 12562 116 / 66
We can see the speed gets faster as more nodes are used. But even with 48 nodes it still took almost 3.5 hours to build the inverted index. One possible reason is still low node-level data locality. The third column in the table gives the total number of mappers used in each configuration, followed by the number of node-local mappers. We can see the portion of node-local mappers among all mappers is still not very high.
2. To solve this problem, I investigated the task scheduler implementation in Hadoop. I modified the existing JobQueueTaskScheduler and created a new "LocalityFirstTaskScheduler", with the purpose of getting better node-locality in our configuration. Because of resource availability in alamo we didn't have a chance to test it yet.
3. I worked with Bingjing and the Quarry administrator about the issue of sharing nodes on PolarGrid. I can now access the shared nodes with higher priority. Still trying to configure MyHadoop on Quarry to make it work.
To do next:
1. Test the LocalityFirstTaskScheduler on alamo.
2. Run the system on larger scale on PolarGrid.
Wednesday, November 14, 2012
Sunday, October 28, 2012
Weekly report 20121028
Done in last week:
1. Solved the low data locality problem in HBase MapReduce programs. The problem is caused by two different ways of getting hostnames in Hadoop JobTracker (JT) and HBase TableInputFormat (TIF). When JT tries to get the hostnames of TaskTrackers (TT), it uses InetAddress.getHostname(), which returns something like "c056.cm.cluster"; But when TIF tries to get the locations of data splits (for the MR programs), which are actually table regions, it first gets the IP addresses of the region servers, then uses reverse DNS service to resolve their hostnames. The problem comes from the reverse DNS service, which returns something like "c056.cm.cluster.". Note the trailing dot in this hostname. It's not an error, but a more complete, "fully qualified" domain name according to the HTTP protocol (RFC 1738).
Both hostnames are legal, but the difference in the trailing dot caused the JT to treat them as names for different nodes, and therefore fail to recognize the data locality between the RS and the TT. It is hard to discover this subtle problem, since these hostnames always appear at the end of sentences in the JT logs, which makes them look like periods. Furthermore, I had to dig deep into the source codes of Hadoop and HBase to find out the cause of the difference. To solve this problem, I ended up writing my own customized TableInputFormat class to remove the trailing dot in RS hostnames when getting region locations for data splits.
2. Found several papers from Richard McCreadie and Jimmy Lin about building inverted index with MapReduce. I need to read the details to have a complete comparison between their work and our work, but I think the basic difference is that they didn't use HBase tables to store raw text data and index data. Our strategy may also have better support for real-time documents insertion and indexing.
To do next:
1. Run the system over larger data sets with all these problems solved, and collect performance measurements.
2. Read the related papers and work on paper draft.
1. Solved the low data locality problem in HBase MapReduce programs. The problem is caused by two different ways of getting hostnames in Hadoop JobTracker (JT) and HBase TableInputFormat (TIF). When JT tries to get the hostnames of TaskTrackers (TT), it uses InetAddress.getHostname(), which returns something like "c056.cm.cluster"; But when TIF tries to get the locations of data splits (for the MR programs), which are actually table regions, it first gets the IP addresses of the region servers, then uses reverse DNS service to resolve their hostnames. The problem comes from the reverse DNS service, which returns something like "c056.cm.cluster.". Note the trailing dot in this hostname. It's not an error, but a more complete, "fully qualified" domain name according to the HTTP protocol (RFC 1738).
Both hostnames are legal, but the difference in the trailing dot caused the JT to treat them as names for different nodes, and therefore fail to recognize the data locality between the RS and the TT. It is hard to discover this subtle problem, since these hostnames always appear at the end of sentences in the JT logs, which makes them look like periods. Furthermore, I had to dig deep into the source codes of Hadoop and HBase to find out the cause of the difference. To solve this problem, I ended up writing my own customized TableInputFormat class to remove the trailing dot in RS hostnames when getting region locations for data splits.
2. Found several papers from Richard McCreadie and Jimmy Lin about building inverted index with MapReduce. I need to read the details to have a complete comparison between their work and our work, but I think the basic difference is that they didn't use HBase tables to store raw text data and index data. Our strategy may also have better support for real-time documents insertion and indexing.
To do next:
1. Run the system over larger data sets with all these problems solved, and collect performance measurements.
2. Read the related papers and work on paper draft.
Sunday, October 21, 2012
Weekly report 20121021
Done in last week:
Solved a problem with our previous HBase and Hadoop configuration that had caused various errors during runtime. The problem was lack of support for file "append" operations in the previous version of HDFS that we had used. Since HBase requires append operations in HDFS, this problem had caused errors such as missing data blocks and writing timeouts in our previous runs. We didn't find out this problem before because when data scale is not large enough this problem could be hidden by not enough write operations in HBase. This problem is solved now by using latest stable versions of Hadoop (1.0.4) and HBase (0.94.2) and turning support for append on in HDFS configuration.
Now we don't see the previous errors during job runs, but another problem came up: data locality is low and the job runs very slow. What is more weird is that Hadoop job tracker is printing very confusing logs about map task scheduling. For example, it is reporting one map task has data split on one node, and deploying that task on that node, but still counts the task as a rack-local task, instead of data-local. Also, the job tracker is not always trying to put tasks to the nodes containing their data splits, but seemingly doing it in a random way. One possible reason could be the hostname configurations on the nodes in alamo -- the "hostname" command returns something like "c029", but the hostname got through "
Solved a problem with our previous HBase and Hadoop configuration that had caused various errors during runtime. The problem was lack of support for file "append" operations in the previous version of HDFS that we had used. Since HBase requires append operations in HDFS, this problem had caused errors such as missing data blocks and writing timeouts in our previous runs. We didn't find out this problem before because when data scale is not large enough this problem could be hidden by not enough write operations in HBase. This problem is solved now by using latest stable versions of Hadoop (1.0.4) and HBase (0.94.2) and turning support for append on in HDFS configuration.
Now we don't see the previous errors during job runs, but another problem came up: data locality is low and the job runs very slow. What is more weird is that Hadoop job tracker is printing very confusing logs about map task scheduling. For example, it is reporting one map task has data split on one node, and deploying that task on that node, but still counts the task as a rack-local task, instead of data-local. Also, the job tracker is not always trying to put tasks to the nodes containing their data splits, but seemingly doing it in a random way. One possible reason could be the hostname configurations on the nodes in alamo -- the "hostname" command returns something like "c029", but the hostname got through "
System.getenv( "HOSTNAME" )" in java returns "c029.cm.cluster". We need more investigation to solve this problem.To do next:Solve the low data locality problem and run the job on the whole ClueWeb09 CatB data set.
Thursday, October 4, 2012
weekly report 20121004
Done in last week:
Solved the following issues with the MyHBase setup:
1. An unsynced-clock exception on a region server caused by its out-of-sync system clock. A fuguregrid administrator helped solve this issue.
2. A timeout exception from Hadoop data node. This was fixed by setting the timeout value to 0 in Hadoop configuration files.
3. A "block-missing" exception from Hadoop data node. This exception was gone after the first two issues were resolved.
4. A "path not found" exception from Hadoop task tracker. After some investigation, I found that this might just be normal some status report from the task tracker that does not necessarily indicate an error. So we will try to live with it for now.
5. A crashing problem with an HBase region server in some runs. This might have been caused by the previous issues. After solving those issues and adjusting the memory limit of data nodes and region servers, the problem seems gone. Will investigate it again if it still happens.
6. A "socket connection refused" exception from the zookeepers. This seemed to happen during the start time of the servers, maybe because the servers were not totally started yet. I have put a 30-second wait time between the zookeeper start-up and the HBase start-up so that the zookeepers get a time to make everything right. We need more test runs to verify if this works.
To do next:
Start running on larger data scale.
Solved the following issues with the MyHBase setup:
1. An unsynced-clock exception on a region server caused by its out-of-sync system clock. A fuguregrid administrator helped solve this issue.
2. A timeout exception from Hadoop data node. This was fixed by setting the timeout value to 0 in Hadoop configuration files.
3. A "block-missing" exception from Hadoop data node. This exception was gone after the first two issues were resolved.
4. A "path not found" exception from Hadoop task tracker. After some investigation, I found that this might just be normal some status report from the task tracker that does not necessarily indicate an error. So we will try to live with it for now.
5. A crashing problem with an HBase region server in some runs. This might have been caused by the previous issues. After solving those issues and adjusting the memory limit of data nodes and region servers, the problem seems gone. Will investigate it again if it still happens.
6. A "socket connection refused" exception from the zookeepers. This seemed to happen during the start time of the servers, maybe because the servers were not totally started yet. I have put a 30-second wait time between the zookeeper start-up and the HBase start-up so that the zookeepers get a time to make everything right. We need more test runs to verify if this works.
To do next:
Start running on larger data scale.
Friday, September 28, 2012
Weekly report 20120928
Done in last week:
1. Investigated the impact of very large rows in our project, and found that it would make several programs not work since they try to get whole rows sometimes. If the row size exceeded the memory limit of the tasks, the program would fail. I solved this problem by avoiding getting whole rows with HTable.get(), and setting the "batch" property to something like 10000 when scanning the tables.
2. Tried to run the project with a larger data set - 11GB compressed, ~40GB after put into HBase tables. Got some errors from HBase region servers, and just investigating the issue. Possible reasons include incompatible versions of Java and limited heap size.
To do next:
Solve the issues with large data set and test on larger scale.
1. Investigated the impact of very large rows in our project, and found that it would make several programs not work since they try to get whole rows sometimes. If the row size exceeded the memory limit of the tasks, the program would fail. I solved this problem by avoiding getting whole rows with HTable.get(), and setting the "batch" property to something like 10000 when scanning the tables.
2. Tried to run the project with a larger data set - 11GB compressed, ~40GB after put into HBase tables. Got some errors from HBase region servers, and just investigating the issue. Possible reasons include incompatible versions of Java and limited heap size.
To do next:
Solve the issues with large data set and test on larger scale.
Wednesday, September 19, 2012
Weekly report 20120919
Done in last week:
1. Got all signatures for the nomination of candidacy form.
2. Read articles about data locality and compression of HBase.
3. In the HBase inverted index project, added bloom filter to the term count table in the synonym scoring step, and modified some implementations so that terms with count of only 1 are no longer stored in the table. This improved the performance by 6% on a small 3.5GB data set. A more significant improvement is expected for larger data sets.
4. Thought and investigated the impact of very large rows in the inverted index table on the performance and reliability of the whole system. So far it looks fine for our data set.
To do next:
1. Apply the HBase inverted index programs on larger data set.
2. More investigation about abstract data description and sharing model for cloud storage services.
1. Got all signatures for the nomination of candidacy form.
2. Read articles about data locality and compression of HBase.
3. In the HBase inverted index project, added bloom filter to the term count table in the synonym scoring step, and modified some implementations so that terms with count of only 1 are no longer stored in the table. This improved the performance by 6% on a small 3.5GB data set. A more significant improvement is expected for larger data sets.
4. Thought and investigated the impact of very large rows in the inverted index table on the performance and reliability of the whole system. So far it looks fine for our data set.
To do next:
1. Apply the HBase inverted index programs on larger data set.
2. More investigation about abstract data description and sharing model for cloud storage services.
Wednesday, September 12, 2012
Weekly report 20120912
Done in last week:
1. Contacted Prof. Fox and Prof. Van Gucht about nomination of candidacy form, and I expect to get the form submitted next week.
2. Read part of Gerald's dissertation.
3. Alamo was available so I was able to apply and test some optimizations to the synonym analysis phase of the HBase Inverted Index project. The optimizations were able to reduce the total execution time of that phase from 2852 seconds to 505 seconds on a test data set. Steps involved in the optimizations include:
(1) In the word pair frequency counting step, a combiner and a reducer were added to filter out the word pairs that are impossible to be synonyms.
(2) Added a word count table to only record the total hits of each word in the data set. The total hits information is intensively used in the synonym scoring step, and addition of this table not only makes access to such information faster, but eliminates the unnecessary total hits recalculation that was needed when this table was not available.
(3) In the synonym scoring step, added a buffer for word total hits, so that repeated access to the same term can be done in local memory.
To do next:
1. Think about details of the two possible directions for my thesis topic.
2. Test the optimizations to larger data set. Use the whole ClueWeb09 CatB data set if possible.
1. Contacted Prof. Fox and Prof. Van Gucht about nomination of candidacy form, and I expect to get the form submitted next week.
2. Read part of Gerald's dissertation.
3. Alamo was available so I was able to apply and test some optimizations to the synonym analysis phase of the HBase Inverted Index project. The optimizations were able to reduce the total execution time of that phase from 2852 seconds to 505 seconds on a test data set. Steps involved in the optimizations include:
(1) In the word pair frequency counting step, a combiner and a reducer were added to filter out the word pairs that are impossible to be synonyms.
(2) Added a word count table to only record the total hits of each word in the data set. The total hits information is intensively used in the synonym scoring step, and addition of this table not only makes access to such information faster, but eliminates the unnecessary total hits recalculation that was needed when this table was not available.
(3) In the synonym scoring step, added a buffer for word total hits, so that repeated access to the same term can be done in local memory.
To do next:
1. Think about details of the two possible directions for my thesis topic.
2. Test the optimizations to larger data set. Use the whole ClueWeb09 CatB data set if possible.
Subscribe to:
Comments (Atom)