Done in last two weeks:
1. Run the system on a 46GB data set with 28, 32, and 48 nodes. The most time-consuming step is inverted index building, and the time taken with each cluster size is given in the following table:
# of nodes time taken (sec) # of mappers
28 17452 90 / 47
32 15607 94 / 65
48 12562 116 / 66
We can see the speed gets faster as more nodes are used. But even with 48 nodes it still took almost 3.5 hours to build the inverted index. One possible reason is still low node-level data locality. The third column in the table gives the total number of mappers used in each configuration, followed by the number of node-local mappers. We can see the portion of node-local mappers among all mappers is still not very high.
2. To solve this problem, I investigated the task scheduler implementation in Hadoop. I modified the existing JobQueueTaskScheduler and created a new "LocalityFirstTaskScheduler", with the purpose of getting better node-locality in our configuration. Because of resource availability in alamo we didn't have a chance to test it yet.
3. I worked with Bingjing and the Quarry administrator about the issue of sharing nodes on PolarGrid. I can now access the shared nodes with higher priority. Still trying to configure MyHadoop on Quarry to make it work.
To do next:
1. Test the LocalityFirstTaskScheduler on alamo.
2. Run the system on larger scale on PolarGrid.
Wednesday, November 14, 2012
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment