Disco MapReduce

2 min read Original article ↗

massive data - minimal code

Fork me on GitHub

Disco in action


    from disco.core import Job, result_iterator

    def map(line, params):
        for word in line.split():
            yield word, 1

    def reduce(iter, params):
        from disco.util import kvgroup
        for word, counts in kvgroup(sorted(iter)):
            yield word, sum(counts)

    if __name__ == '__main__':
        input = ["http://discoproject.org/media/text/chekhov.txt"]
        job = Job().run(input=input, map=map, reduce=reduce)
        for word, count in result_iterator(job.wait()):
            print word, count

This is a fully working Disco script that computes word frequencies in a text corpus. Disco distributes the script automatically to a cluster, so it can utilize all available CPUs in parallel. For details, see Disco tutorial.

Highlights

...and more! See the documentation for details.

Need help with Disco? We can be reached on our IRC channel #discoproject at Freenode or on the Disco discussion group, or by opening an issue at Disco repository at GitHub.