Powered by GitBook

MapReduce

Count the word frequency of a web page

for loop + hashmap
多机器for loop + hashmap + merge，合并时候会是bottle neck
多机器map reduce
- Map: 机器1，2负责把文章拆分为一个一个单词
- Reduce：机器3，4各负责一部分word的合并

Map and Reduce

Framework flow:

input
split
map 实现怎么把文章切分成单词
传输整理
reduce 实现怎么把单词统一在一起
output

其中map和reduce函数是我们需要实现的

map (key, value) key为文章存储地址，value为文章内容
reduce (key, value) key为map输出的key，value为map输出的value

Question?

多少台机器map，多少台reduce？自己决定
机器越多越好？越多越快，但是启动机器的时间相应也变长了
不考虑启动时间的话，reduce的机器也不是越多越快，因为key的数目就是上限

传输整理的步骤

Map端用一个HashMap先做一次合并，把相同的key合并到一起（一台机器上的？）
Reducer端把相同的key排序到一起，快速排序

Build inverted index with MapReduce

//key: the id of a doc
//value: the content of the line
Map(String key, String value)
  for each word in value
    output(word, key)

//key: the name of the word
//valueList: the appearance of this word in docs
Reduce(String key, list valueList)
  List sumList;
  for value in valueList:
    sumList.append(value)
  outputFinal(key, sumList)

results matching ""

No results matching ""