Real time collaboration with Go and RethinkDB: The Good, The Bad, & The Ugly

I recently launched a real-time art collaboration platform called Formgraph, and I spent a lot of time beforehand thinking about what stack I was going to use to build it. After experimenting with a few options, I ultimately settled on Go for the application server, and RethinkDB for storing and publishing user actions in real time. I thought I’d take a minute to reflect on those choices and how I feel about the subsequent outcomes.

Golang: It’s better than Yolang

Go nails developer ergonomics in a way that’s difficult to describe. The language itself is pretty sparse and unremarkable. It’s missing a handful of features that I’d find very helpful (generics, anyone?), but the runtime, ecosystem, and overall philosophy of how a programming language fits into the broader activity of software development and deployment more than make up for any trivial misgivings about features and syntax.

The standout features that caused me to choose Go over the others were channels and goroutines. Go has a concurrency model based on communicating sequential processes, which is a formal language for describing how different processes can communicate with each other. The motto with Go is, share memory by communicating, rather than communicating by sharing memory. Channels help Go achieve this by providing a strongly typed handle where one “goroutine” (basically a lightweight thread) can signal another by queueing an object onto a channel, and having the receiving goroutine block until some object comes over that channel. This is really handy for things like publish and subscribe, which is a core pattern behind Formgraph’s real time collaboration features.

Go’s way of dealing with concurrency made it a pretty natural fit for the problem at hand. Its other positive attributes, like having a really short learning curve and devastatingly simple tools for deployment and dependency management, were just icing on the cake.

RethinkDB: The open-source database for the realtime web

I first heard about RethinkDB on an episode of FLOSS Weekly a few years ago. To be quite honest, the cynic in me was a bit skeptical about how gushing the creators were about their baby. After using it for the last few months, however, I can definitely say they’ve built something to be proud of.

In a way, RethinkDB is kind of like a hybrid of MongoDB and Redis. You can store structured, schemaless data as JSON, and query it with a language called ReQL which is a subset of Javascript. In that way it’s pretty similar to Mongo. The killer feature with Rethink is something called “changefeeds”, which basically let you write a query generating a cursor that waits for new data coming in that matches on that query. Using this feature makes it pretty easy to do pub/sub like in Redis, but unlike Redis, you don’t have to go through a separate step to store the data that gets published in a changefeed. You just insert the data into the table, and the changefeed receives the data and passes it to whatever is blocking on it.

As you can probably imagine, Go’s concurrency model plus RethinkDB’s changefeeds make a powerful combination for quickly building a scalable real time web app. Indeed they do, but of course, as these things go, not everything has been blue skies.

RethinkDB: The fine print

The first real pain point with RethinkDB came up with message history. Changefeeds are great at serving real time data, but calling up previous messages leaves a bit to be desired. When you setup the changefeed query, you can specify the “includeInitial” parameter which gives you the previous messages that have come through that changefeed. The only problem is, the messages aren’t in chronological order! With Formgraph Studio, this is kind of a big deal, since the message stream is used to build the render command buffer that the client uses to draw the images. An out of order message history is basically useless for reconstructing the undo/redo history and command buffer. The good news, however, is that you can add a timestamp to the messages, and tell RethinkDB to sort them chronologically. Relying on timestamps for this makes me a bit nervous, since the possibility of a timestamp collision between two conflicting messages is now technically on the table, but the timestamps have a granularity of 1ms, so unless you and another user start drawing something at the same instant and the two paths overlap, I don’t think this will lead to any issues.

Even with the timestamps, we’re still not out of the dog house, yet. It turns out, RethinkDB can’t create changefeeds with an “orderBy” clause, unless it also comes with a “limit” clause. While I could count the number of messages for a given project ID and pass that to the changefeed parameters, that’s likely to have its own non-trivial performance overhead, so for now I’ve just hard-coded it to 1,000,000. Larger projects on Formgraph hover in the 10s of thousands of messages range, so while I’m sure that limit will be hit at some point (this is the internet, after all), doing it this way works fine for now.

The next problem with RethinkDB has to do with the remixing feature on Formgraph. One of the core features of Formgraph is, you can remix anyone else’s project. While I could have written some fancy scheme to trace through a project’s inheritance hierarchy and push those messages onto the remix’s action stream, it’s a lot less complicated just to copy the parent project’s messages in order with a new project ID. I was disappointed to see that there wasn’t any direct method to copy a changefeed. I can totally understand why that wouldn’t be a thing, but boy would it be useful.

So, initially populating a remix involves copying all the messages from the source project to the destination project. Which is basically a few lines of ReQL. Except doing it in a single query causes the entire action stream from the parent project to be collapsed into one changefeed message. I could have probably worked around this with a little type assertion dance inside the goroutine that writes to the client from the changefeed, but that reeked of extra complexity for the sake of premature optimization, and I prefer to take the obvious route first and optimize from there. So I ended up writing a for loop that takes every message from the source project and dumps it into the new project, one by one. Initially, doing this took way too long. It turns out that the default “RunWrite” method in the RethinkDB driver I was using was waiting for all those thousands of inserts to return a response, so remixing even a modestly sized project would take over a minute. After digging into RethinkDB’s documentation, I discovered the “no reply” and “soft durability” options that can be applied to an insert. No reply keeps the driver from blocking while waiting for response from RethinkDB, and soft durability causes RethinkDB to acknowledge writes when they’re stored in memory, rather than when they’re written to the disk. This is less reliable, but the performance win is huge (remixes that took over a minute to setup now only take a couple seconds) and there isn’t a great recovery path if the bulk insertion fails, so it’s not like that extra feedback was helpful anyway.

The final bit where Rethink causes issues is with initial project loading. Taking a look at Chrome’s instruments, I saw that heavy projects would spend a good 30+ seconds being almost totally idle in the browser, followed by a concentrated glut of CPU time spent processing the initial messages to build up the command buffer. While RethinkDB is clearly not at solely fault here, a few experiments have revealed it to be a significant factor.

Running a query from the RethinkDB admin console to grab all the messages for a larger project takes roughly 7 seconds on my 2013-era MacBook Air (rendering the results page takes a bit longer, probably for reasons that have little to do with RethinkDB itself). I wrote a little hack on Formgraph’s server that would dump the messages to an array in-memory after loading a project for the first time. On subsequent loads, instead of querying for the changefeed history, the server-to-client goroutine would send the contents of the in-memory array to the client, and suddenly the 30 seconds of initial client idle time went down to roughly 2!

A lot is going on with that initial history stream. RethinkDB has to resolve the changefeed query, fetch the history from the “picture_messages” table, filtered by project ID, sorted by timestamp, and serialize all that to JSON and send it to Formgraph’s server over TCP. Then the Go RethinkDB driver has to unmarshal that into the PicStreamMsg struct, where the message type gets inspected to see if it’s valid for the current user (e.g. no sending member-only chat messages to non-project members or logged out users), then re-marshal the JSON and write it to the WebSocket if valid. When RethinkDB is out of that equation, the biggest delay in loading a project becomes the client replaying the action stream to reconstruct the project state.

While the crude in-memory cache yielded a great performance improvement, it was far from thread safe, and would take a lot more work to get production ready. Before I even heard about RethinkDB, I thought about using Redis for the real time component of this project (checkout the whois entry on formgraph.com; I’ve owned the domain since 2011!). If I went that route, I would have been publishing to a channel and writing messages to a queue. Pulling up the queue would have probably been way faster than querying a table with messages for all the projects munged together (I’m using indexes, which helps, but it’s still got a ways to go). Using redis would have added some complexity, as publishing the message and enqueueing it would be a two-step process. There’s a project out there called Centrifugo, which is basically a Go WebSocket server that uses Redis to provide exactly that functionality. I could have used that, but I was already using RethinkDB for stuff like session tokens and timelines (a use case for which it is absolutely great, by the way!), so I wanted to stay consistent. I’m already using Postgres for the transaction-y stuff inside of Formgraph, so introducing a third database seemed like overkill. :) However, Postgres has had some interesting developments recently that might obviate the need for RethinkDB in those other areas as well, but that’s another blog post!

Conclusion

This might come off like I’m dumping on RethinkDB, but I’m totally not! It’s got a great, easy-to-use API, changefeeds work like magic for most WebSocket streaming applications, and the admin tools are intuitive and nicely polished. The documentation is thorough and stellar, the authors and community are great, and setting up replication clusters is so easy, a developer can do it. :) For the sort of high volume message application I was using it for, however, it was a serviceable, but not completely stellar fit. That’s my bad. Lesson learned.

There’s still a lot more work to be done to make Formgraph fulfill its mission of reimagining creativity for the real time web. Go and RethinkDB have proven to be a huge assets in getting this far. As this mission gets carried out, I’m looking forward to sharing with you the lessons learned, and discoveries made, along the way.