在学习Map Reduce方法时,Word Count(单词统计)程序是最基础的入门训练。不同的写法会有不同的执行效率,下面是用python写的一个示例。

Map:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/usr/bin/python
#
#  WordCount mapper in Python
#  Author: Zeng, Xi
#  SID:    1010105140
#  Email:  [email protected]
 
import sys
import re
 
def main(argv):
  line = sys.stdin.readline()
  pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
  try:
    while line:
      for word in pattern.findall(line):
        print word + "\t" + "1"
      line = sys.stdin.readline()
  except "end of file":
    return None
if __name__ == "__main__":
  main(sys.argv)

Reduce:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/usr/bin/python
#
#  WordCount reducer in Python
#  Author: Zeng, Xi
#  SID:    1010105140
#  Email:  [email protected]
 
import sys
word_list = {}
 
## collect (key,val) pairs from sort phase
for line in sys.stdin:
    try:
        word, count = line.strip().split("\t", 2)
 
        if word not in word_list:
            word_list[word] = int(count)
        else:
            word_list[word] += int(count)
 
    except ValueError, err:
        sys.stderr.write("Value ERROR: %(err)s\n%(data)s\n" % {"err": str(err), "data": line})
 
## emit results
for word, count in word_list.items():
    print " ".join([word, str(count)])