Done in last week:
1. Contacted Prof. Fox and Prof. Van Gucht about nomination of candidacy form, and I expect to get the form submitted next week.
2. Read part of Gerald's dissertation.
3. Alamo was available so I was able to apply and test some optimizations to the synonym analysis phase of the HBase Inverted Index project. The optimizations were able to reduce the total execution time of that phase from 2852 seconds to 505 seconds on a test data set. Steps involved in the optimizations include:
(1) In the word pair frequency counting step, a combiner and a reducer were added to filter out the word pairs that are impossible to be synonyms.
(2) Added a word count table to only record the total hits of each word in the data set. The total hits information is intensively used in the synonym scoring step, and addition of this table not only makes access to such information faster, but eliminates the unnecessary total hits recalculation that was needed when this table was not available.
(3) In the synonym scoring step, added a buffer for word total hits, so that repeated access to the same term can be done in local memory.
To do next:
1. Think about details of the two possible directions for my thesis topic.
2. Test the optimizations to larger data set. Use the whole ClueWeb09 CatB data set if possible.
Wednesday, September 12, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment