MapReduce is making a comeback
estuary.devI question the premise that MapReduce really ever went away. Many migrated away from Hadoop, but in frameworks that succeeded it, MapReduce was still a core pattern. And in some cases, moving away from Hadoop wasn't ideal because later frameworks still got some things wrong. Maybe we stopped talking about MapReduce because we were focused on new patterns and challenges -- how to support many complex jobs and pipelines, more interactive and exploratory analysis, etc.
I'm curious about the difference between "continuous MapReduce" and I guess a subgraph in a "differential dataflow" (which I have read about but never really used). https://github.com/TimelyDataflow/differential-dataflow
First let me say that I think Timely Dataflow and Materialize are both super cool. The two approaches are quite different, in part because they solve slightly different problems. Or maybe it's more fair to say that they think of the world in somewhat different ways. Probably most of the differences can be traced back to how Timely Dataflow relies on the expiration of timestamps in order to coordinate updates to its results. You can read the details on that in their docs (https://timelydataflow.github.io/timely-dataflow/chapter_5/c...).
I think a reasonable TLDR might be to say that continuous map reduce has a better fault-tolerance story, while timely dataflow is more efficient for things like reactive joins. They both have their purpose, though, and I imagine that both Flow and Materialize will go on to co-exists as successful products.
> Rather than producing output data and then exiting, like a Hadoop job, a continuous MapReduce program continues to update its results incrementally and in real time as new data is added
So keeping track of min/max/average as you add new data is now "continous MapReduce"?
Don't get me wrong, a data platform that ingests data and computes useful user-defined aggregates from that sounds useful. But this article feels like an attempt to position that as some kind of incredible industry-leading insight that is a novel take on $buzzword, when it really isn't.
> $buzzword
Yeah, the article is an odd spin on what they're building.
> min/max/average as you add new data
Or a 2D kernel density estimate for your dashboards, a real-time view of 3-neighbors in a graph (nodes+edges definition) sized by log1p(request frequency), .... I find it way easier to write a few custom incremental primitives to piece together into that kind of algorithm than to write such an algorithm from scratch.
I'm not crazy about a general-purpose framework/product that tries to allow incremental updates of AllTheThings™ -- my experience thus far suggests that getting it to do what you want (or perform reasonably) on your own data will require enough kludges that you would have been far better off writing the WholeDamnedThing™ yourself.
If they do only support min/max/average and other simple transforms then that's probably not great; they'd be competing directly with something like QuestDB, which is a phenomenal product I'm leaning toward more and more. You don't need millisecond view update times if you can query the whole db in milliseconds.
Real-time aggregates is the core value pitch of Clickhouse.
Interesting but how is this as an alternative to Apache Flink's stream processing model?
A big difference is the removal of windowing: Flink lets you aggregate or join events only so long as they arrive in the same temporally-bound window. You're required to have a window, and it's core to the semantics of your workflow.
Flow's model doesn't use windows, and allows for long-distance (in time) joins and aggregations. There's no concept of "late" data in Flow: it just keeps on updating the desired aggregate.
You do not need windowing. You can do everything you want with regular KeyedStream. You can join not-windowed streams using IntervalJoin.
This is if you want to use high level API. If you use lower-level ProcessFunction you have even more flexibility.
dataflow things like Flink (or even better differential datatflow [0]) are far more flexible and subsume map-reduce. This article feels like hyping up the durability of the Model T.
IANAE on Flink, especially when it comes to the internals. But I think that the decomposition of computations into distinct map and reduce functions seems to afford a bit more flexibility, since it can be useful to apply reductions separately from map functions, and vice versa. For example, you could roll up updates to entities over time just with a reduce function, and you could easily do so eagerly (when the data is ingested) or lazily (when the data is materialized into an external system). That type of flexibility is important when you want a realtime data platform that needs to serve a broad range of use cases.
I love how data structures (block chain!) and algorithmic design patterns are now sort of like fashion trends. This industry is so faddish.
"Reduced lock contention sharded hash tables are this season's new casual!"
MapReduce is a useful tool in some contexts, but the the hype surrounding it always felt like people patting themselves on the back for re-inventing relational algebra.
The big hype was always due to the ability to shed large oracle based data warehouses. When Hadoop was full of hype in 2010 Oracle was charging 150k/cpu core/year for a rac cluster license.
Considering that oracle is not in fact magic, this meant that a large number of firms were spending 7-8 figures annually on oracle licenses. Map reduce/Hadoop was the first accepted alternative that didn’t involve spending outrageous license fees and instead involved outrageous hardware expenditure.
Enlightenment moment.
Con: OSS is less optimized than proprietary solution, requiring bigger hardware
Pro: OSS allows you to buy bigger hardware, use all of it without logical restrictions, and scale infinitely beyond the arbitrary point you were locked into with licensing.
And then the new-found efficiency frees up time to discover/identify $(x,)xxx,xxx+ in manual work that can also now be done with your new-found compute...
Wow. Way to prevent us from progressing beyond the industrial revolution.
($catchup_speed++)
And there's also an added cost of engineers supporting all that hardware and the Hadoop components running on it.
My interpretation is that is why it’s so brilliant.
It’s incredibly simple for the end user conceptually but encapsulates optimizing processing across a distributed file system, fault tolerance, shuffling key value pairs, job stage planning, handling intermediates ect.
Hadoop a big data framework that reduces the level of competence required to write data pipelines because it was able to hide a massive amount of complexity behind the map reduce abstraction.
Id even argue that hive, snowflake, and other sql data warehouses have taken this idea further, where most sql primitives can be implemented as map reduce derivatives. With this next level of abstraction, dbas and non-engineers are witting map reduce computations.
I think my point is that abstractions like map reduce have had a democratizing effect on who can implement high scale data processing and their value is that they took something incredibly complex and made it simple.
I agree with this. As soon as the MapReduce paper came out, people were criticizing it for a lack of novelty, claiming that so-and-so has been using these same techniques for years. And of course those critics are still around saying the same things. But I think there's a reason we keep going back to these techniques, and I think it's because they repeatedly prove to be practical and effective.
It reminds me more of timescale’s continuous aggregates and the new snowflake slayer, firebolt’s, aggregation indexes.
Isn't the entire field of software just people patting themselves on the back for some extremely basic relational algebra?
I don't know what I should be proud of when I learn something new, it all seems extremely basic compared as soon as I learn it.
I wouldn’t say that anymore than I’d say soccer World Cup is a bunch of people worshipped for jogging around passing a ball around.
It’s pretty easy to simplify things down until they sound unimpressive.
It's kind of like math; all math concepts are impossible until you figure out why they're actually easy.
It was more about scaling than about relational algebra though.
The scale factor was really overblown and without caveats. If you have Google-scale computations where it is not unlikely that at least 1 / 10k machines serving that request will bite the dust during the query, then of course MapReduce makes sense.
However, in most other cases there are now far better alternatives (although tbh I'm not sure how many were around when MapReduce was introduced).
The main limitation around mapreduce is the barries imposed by the shuffle stage and after the end of the reduce if chaining together multiple mapreduce operations. Dataflow frameworks remove these barriers to various degrees, which often lowers latency and can improve resource utilization.
Exactly this. I remember in 2007 being able to process TBs of data on commodity hardware with Hadoop. You got decent throughput, decent fault tolerance out of the box wrapped in Java that many average software developers (yours truly included) were comfortable with. You could scale data and people.
It dramatically reduced the cost of entry for many ad-tech applications.
> It dramatically reduced the cost of entry for many ad-tech applications.
You say that as if it was unequivocally a point in its favour.
Apart from Google, which has a patent related to their 2004 paper, I don't know how much people are trying to "take credit" for map-reduce. I'm certainly not. But I do think the approach of running map-reduce continuously in realtime is interesting and worth sharing. And I hope that some folks will be interested enough to try it out, either with Flow or in a system of their own design, and report on how it goes for them.
In reality, such continuous mapreduce jobs lead to unchangeable code and versioning nightmares.
Imagine you want to change part of your pipelines logic. Now either all data needs to be reprocessed (expensive, depends on you having retained past data, will your low latency continuous pipeline keep running while the backlog is cleared, is the code really idempotent or will a rerun lead to half the records failing to be reprocessed?). Or you need to not reprocess old data (now there is inconsistency in historical records, what do you do if you make a bad release which just outputs zeros?).
In any real organisation, you'll need both approaches. And it'll end up a mess with versions of code and versions of data. Now some customer comes along and demands a GDPR deletion of their session records and you have no way to even find all the versions of all the copies of the records let alone delete them and make everything else consistent...
Versioning is indeed an issue, but that's the case for anything with long-lived state. Our current rely on JSON schemas, TypeScript, and built-in testing support to help ensure compatibility. Those things actually help quite a bit in practice. But I think we may also want to build some more powerful features for managing versions of datasets, since there's a real need there, regardless of the processing model you use to derive the data.
Did this ever go away? 99% of programs do map/reduce all-day long.
Yes, but when talking about MapReduce we generally talk about distributed frameworks for doing it on big data.
But most people don't have "big data" in the sense of having data that requires more than a single machine to process.
Most people who think they have "big data" still don't have big data (e.g. I've done work on datasets where people insisted on using "big data" solutions when it could all easily fit in a Postgres instance with or without a columnar store with most of the working set cached in memory for a fraction of the cost).
It "went away" in the sense that more people realised they could avoid it with a few simple steps (e.g. pre-processing during ingestion), and/or fit the data they needed on fast-growing individual servers, and so the number of people continuing to use it more closer approximated the set of people who actually work on big data.
For those who actually needed it, it of course never went away.
If all you have is a hummer everything look like a nail: when Hadoop first appeared there was almost no other open source systems to process 'big data' and it was widely adopted. Now there are many options to choose from. We don't have to use map-reduce for every task which could be solved using map-redude. E. g. for some tasks a columnar store, like ClickHouse is a better fit.
If all you have is a Hummer everything looks like an enemy vehicle.
Forgive me if this is naive, but could smaller-scale cases be served by a version that uses the MapReduce model as a way to cleanly break up operations across cores instead of machines? Or do the benefits of the model become mostly irrelevant in that case?
I'm sure it wouldn't take the form of a dedicated process; probably just a language-agnostic programming pattern
most of the benefits go away, but yes, you can do this. MapReduce had a flag to use multiple cores for multiple workers on a machine and this was often the way to get the greatest throughput.
I think this is referring to a specific paradigm within Apache Hadoop but I'm no expert.
Every section in this article is a mistake. For instance - Map Reduce was not invented in 2004. it's a technique from the punch card days.
- Map reduce is not solely useful because it allows you to delete source data, it's a trivial method of parallel processing. It's not the most trivial or most modern. Just common
But they say ‘MapReduce’, not ‘map reduce’. They’re talking about the specific model, not the general idea of map reduce.
Really, they should be calling it "Distributed MapShuffleReduce", I let Sanjay and Jeff know this (in person) but they didn't really seem to care. Neither map nor reduce is a method of parallel processing, the shuffle stage is.