Friday, September 28, 2012

Weekly report 20120928

Done in last week:

1. Investigated the impact of very large rows in our project, and found that it would make several programs not work since they try to get whole rows sometimes. If the row size exceeded the memory limit of the tasks, the program would fail. I solved this problem by avoiding getting whole rows with HTable.get(), and setting the "batch" property to something like 10000 when scanning the tables.

2. Tried to run the project with a larger data set - 11GB compressed, ~40GB after put into HBase tables. Got some errors from HBase region servers, and just investigating the issue. Possible reasons include incompatible versions of Java and limited heap size.

To do next:

Solve the issues with large data set and test on larger scale.

Wednesday, September 19, 2012

Weekly report 20120919

Done in last week:

1. Got all signatures for the nomination of candidacy form.

2. Read articles about data locality and compression of HBase.

3. In the HBase inverted index project, added bloom filter to the term count table in the synonym scoring step, and modified some implementations so that terms with count of only 1 are no longer stored in the table. This improved the performance by 6% on a small 3.5GB data set. A more significant improvement is expected for larger data sets.

4. Thought and investigated the impact of very large rows in the inverted index table on the performance and reliability of the whole system. So far it looks fine for our data set.

To do next:

1. Apply the HBase inverted index programs on larger data set.

2. More investigation about abstract data description and sharing model for cloud storage services.

Wednesday, September 12, 2012

Weekly report 20120912

Done in last week:

1. Contacted Prof. Fox and Prof. Van Gucht about nomination of candidacy form, and I expect to get the form submitted next week.

2. Read part of Gerald's dissertation.

3. Alamo was available so I was able to apply and test some optimizations to the synonym analysis phase of the HBase Inverted Index project. The optimizations were able to reduce the total execution time of that phase from 2852 seconds to 505 seconds on a test data set. Steps involved in the optimizations include:

(1) In the word pair frequency counting step, a combiner and a reducer were added to filter out the word pairs that are impossible to be synonyms.

(2) Added a word count table to only record the total hits of each word in the data set. The total hits information is intensively used in the synonym scoring step, and addition of this table not only makes access to such information faster, but eliminates the unnecessary total hits recalculation that was needed when this table was not available.

(3) In the synonym scoring step, added a buffer for word total hits, so that repeated access to the same term can be done in local memory.

To do next:

1. Think about details of the two possible directions for my thesis topic.

2. Test the optimizations to larger data set. Use the whole ClueWeb09 CatB data set if possible.