Settings

Theme

Ask HN: How to *really* learn concurrency,parallelism?

10 points by bonobo3000 9 years ago · 8 comments · 1 min read


Hello,

I've been interested in concurrency for a while now and I've been picking up bits and pieces by stumbling across great posts/videos once in a while or working with them at my job. So far I'm sort of comfortable with threads (not the gory JVM details though) and Akka, and a tiny bit of futures/promises and co-operative concurrency type stuff with pythons tornado.

I want to go deeper, really grok all the different models and be able to make informed decisions about which model us appropriate when, understand how they map to lower levels, generally gain a comprehensive understanding. I'm currently going through Java Concurrency in Practice.

My question - what resources will help me get a comprehensive understanding of concurrency & parallelism? what toy project would you recommend where i can implement the same thing with lots of different models to understand the difference?

Thanks in advance!

nostrademons 9 years ago

Book rec:

https://www.amazon.com/Concepts-Techniques-Models-Computer-P...

Language & library recs:

Java is actually a pretty shitty language to learn concurrency on, because the concurrency primitives built into the language & stdlib are stuck in the 1970s. There've been some more recent attempts to bolt more modern concurrency patterns on as libraries (java.concurrent is one; Akka is another; Quasar is a third), but you're still very limited by the language definition. Some other languages to study:

Erlang, for message-passing & distributed system design patterns. Go has a similar concurrency model, but not as pure.

Haskell for STM.

Python3.5/ES2017/C#, for async/await & promises. Actually, for a more pure implementation of promises, check out E or the Cap'n Proto RPC framework.

Rust, for mutable borrowing. Rust's concurrency story is fairly unique; they try to prove that data races can't exist by ensuring that only one reference is mutable at once.

JoCaml for the join calculus. Indeed, learning formal models like CSP, the pi-calculus, or the join-calculus can really help improve your intuitions about concurrency.

Hadoop for MapReduce-style concurrency. In particular, learning how you might represent, say, graph algorithms on a MapReduce system is a great teacher. Also look at real-time generalizations of MapReduce paradigms like Storm or Spark.

Paxos & Raft for the thorny problems in distributed consensus.

Vector clocks, operational transforms and CRDTs. One approach to concurrency is to make it not matter by designing your algorithms so that each stage can be applied in arbitrary order (or can compensate for other operations that have occurred in the meantime). That's the idea behind this, and it perhaps has the most long-term promise.

Project & job recs:

The best way to really learn concurrency is to take a job at a company that has to operate at significant scale. Google or Facebook are the most prominent, but any of the recent fast-growers (AirBnB, Uber, Dropbox, probably even Instacart or Zenefits) will have a lot of problems of this type.

Failing that, I've found that implementing a crawler is one giant rabbithole in learning new concurrency techniques. The interesting thing about crawlers is that you can implement a really simple, sequential one in about 15 minutes using a language's standard library, but then each step brings a new problem that you need to solve with a new concurrency technique. For example:

You don't want to wait on the network I/O, so you create multiple threads to crawl multiple sites at once.

You quickly end up exhausting your memory, because the number of URLs found on pages grows exponentially, and so you transition to a bounded thread pool.

You add support for robots.txt and sitemaps. Now you have immutable data that must be shared across threads.

You discover some URLs are duplicates; now you need shared mutable state between your fetch threads.

You start getting 429 and 403 request codes from the sites, telling you to back off and stop crawling them so quickly. Now you need a feedback mechanism from the crawl threads to the crawl scheduler, probably best implemented by message queues.

You want to process the results of the crawl. Now you need to associate the results of multiple fetches together to run analyses on it; this is what MapReduce is for.

You need to write out the results to disk. This is another source of I/O, but with different latency & locking characteristics. You either need another thread pool, or you want to start looking into promises.

You want to run this continuously and update a data store. Now you need to think about transactions.

  • PaulHoule 9 years ago

    Java's concurrency model may be stuck in the 1970s, but it actually works, unlike a lot of newer frameworks...

    • nostrademons 9 years ago

      Most concurrency models work if you understand both your problem domain and your tools thoroughly, and fail if you don't. Event-driven callback spaghetti, for all of its flaws, has been used successfully in services with billions of simultaneous users.

      OP was asking for ways to broaden his concurrency knowledge, and studying message-passing, transactional models, promises, etc. will certainly do that.

      • rnovak 9 years ago

        Why are you advising him to run before he can crawl? All of those paradigms are built on top of the 70's era threading-model as used by the JVM (as well as every other language, behind the scenes).

        How can you understand these tools thoroughly if you don't understand the foundation that they're built on?

        Yes, they probably have been used in services with billions of users, but I guarantee those engineers understood the basics of threading just as well, if not better, than the tools built on top of them.

        OP: This video is pretty good at explaining some core concepts: https://www.youtube.com/watch?v=MCs5OvhV9S4

  • blain_the_train 9 years ago

    Not an expert here, but erlang uses the actor model, while go uses csp.

    Side note, clojure implements csp in a core library called core.asyc

  • bonobo3000OP 9 years ago

    Awesome post, thank you so much! I'll get started on the crawler soon :)

ruslan_talpa 9 years ago

Parallel and concurrent programming is easy in Haskell (because of the amazing work on lower levels) http://chimera.labs.oreilly.com/books/1230000000929

Switch to haskell and you will no longer view "parallel/concurrent" as a hard problem that you need to solve

dudul 9 years ago

There is a book "7 concurrency models in 7 weeks". I haven't read it (yet) but I assume that it's a good fit for what you want.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection