massive data - minimal code
Disco in action
from disco.core import Job, result_iterator
def map(line, params):
for word in line.split():
yield word, 1
def reduce(iter, params):
from disco.util import kvgroup
for word, counts in kvgroup(sorted(iter)):
yield word, sum(counts)
if __name__ == '__main__':
input = ["http://discoproject.org/media/text/chekhov.txt"]
job = Job().run(input=input, map=map, reduce=reduce)
for word, count in result_iterator(job.wait()):
print word, count
This is a fully working Disco script that computes word frequencies in a text corpus. Disco distributes the script automatically to a cluster, so it can utilize all available CPUs in parallel. For details, see Disco tutorial.
Highlights
-
Easy to install on Linux, Mac OS X, and FreeBSD.
-
Efficient data-locality-preserving IO, either over HTTP or the builtin petabyte-scale Disco Distributed Filesystem.
-
Supports profiling and debugging of mapreduce jobs.
-
Random access data and auxiliary results through out of band results.
-
Run jobs written in any language using the worker protocol.
-
Build and query indices with billions of keys and values, using DiscoDB
...and more! See the documentation for details.
Need help with Disco? We can be reached on our IRC channel #discoproject at Freenode or on the Disco discussion group, or by opening an issue at Disco repository at GitHub.