Why Discord is switching from Go to Rust (2020)
discord.comPrevious Discussion
I suspect that GC'd languages could mitigate this problem by introducing regions; separate areas of memory that cannot point at each other. Pony actors [0] have them, and Cone [1] and Vale [2] are trying new things with them.
If golang had this, then it might not ever need to run its GC because it could just fire up a new region for every request. The request will likely end and blast away its memory before it needs to collect, or it could choose to collect only when that particular goroutine/region is blocked.
Extra benefit: if there's an error in one region, we can blast it away and the rest of the program continues!
[0] https://tutorial.ponylang.io/types/actors.html#concurrent
[1] https://cone.jondgoodwin.com/fast.html
[2] https://verdagon.dev/blog/seamless-fearless-structured-concu...
or you know; just pace the GC mark and sweep algorithm. which is what go is doing now.
Correct me if I'm wrong, but IIRC pacing would still cause a latency spike, it would just be a more strategically-timed latency spike.
sure - https://github.com/golang/go/issues/44167. you'll see the new design the CPU util only increases in GC CPU utilizations when you're actually allocating heavily. which makes sense you're doing more work. this should completely resolve the problem discord had; since their system was in a steady state.
You are always going to have some kind of latency spike with a sweeping GC; even if that spike is tiny.
not really. if you're in a steady state (like discord was) you wouldn't see any spikes. you'd have a consistent utilization. if you start allocating heavily then you would potentially see an increase. which makes sense, you're increasing your workload, utilization needs to increase. but still not necessarily a 'spike'
Not discrediting Rust, but I've noticed you rarely hear "we improved performance" by rewriting our implementation using the same language... Although this, too, can yield similar performance improvements.
In the article they describe their attempts at tuning and weren't happy with the results
The main issue with Go in this regard is that it generally doesn't have multiple ways of doing the same thing at least idiomatically. I mean sure, there are architectural choices you could make in a rewrite but the actual structure of the code itself is going to be very similar.
It does look like with some GC tuning (e.g. manually triggering GC's at a smaller interval than the Go automatic GC threshold) they might've mitigated the spikes, although I don't think they would have gotten the level of perf improvement they did. Golang assembly code IME is not very optimized compared to Rust/C++.
edit: reading comprehension skills are lacking, please see comment below for why I'm wrong
I don't understand.... isn't this idea (triggering GC more often) explicitly discussed in the article?
My understanding is they tried to tune the GC percent to make the automatic heuristic do GC sooner, but they didn't allocate enough to have that make a difference. However, Go has a way to manually trigger a GC which they could've set on a timer on a goroutine. If they weren't actually generating that much garbage then pause times should theoretically be pretty short if you're doing a GC every 5 seconds or something like that.
That being said it's not something that 100% is guaranteed to fix the issue so maybe they did test this and just didn't mention it in the blog.
Okay, I see what you're saying about the timer. That isn't in TFA, agreed.
But I still don't understand, because....
(NB: I'm not a GC expert, just a curious amateur, so my apologies if there are errors in the following, and the opportunity to be corrected in these errors is part of why I'm posting this.)
Regarding the "not much garbage => theoretically times would be shorter", my understanding is that this is actually not how GC works. The GC time is a function of the size of the GC pool, because GC works by walking ("tracing") the tree of live references. So the only way to make GC faster is to have not less garbage, but less stuff allocated at all.
Multi-generational GC works by dividing the whole pool into smaller pools, so that most GC passes only visit the high-churn nursery, but even then some GC passes need to read the
TFA mentions this, where they say "the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire [thing we were keeping track of]".
That is, they had virtually no garbage to collect, and that wasn't speeding up the GC. Which is consistent with how all tracing GC works, as far as I know.
Comments/corrections/clarifications are requested!!
I remember when this article came out, everybody was pointing out the fact that they used a go version that was several releases older.
Perhaps if the intent wasn't to convince their managers to let them write it in Rust, they would have tried using the latest Go version at the time?
Then why publish it at all?
Not to mention, the article made no effort to establish that it's describing the world 2 years prior to this being written
They actually would have mitigated it by upgrading their Go version, because the latest release at the time had a fix in the runtime that would have basically solved this.
Turns out no one on the team actually looked into issues in the Go repo to see if it was being addressed. Looks like they just wanted to write Rust, which is fine Rust is cool, but let’s not deceive ourselves.
That is not what happened, they did not publish the blog post immediately after the transition, and those changes to the GC did not happen until after the port happened. Some people made assumptions about timeline that were incorrect, and then repeated.
The discussion at the time on Reddit [1] mentions this. The general discussion as well talked about if the improvements, which were big in many cases, would have even improved this particular case. We’ll never truly know.
That said it is important to recognize that Go’s GC has received significant upgrades over the years, and remember that what’s true in the past may not be true today.
1: https://www.reddit.com/r/programming/comments/eyuebc/why_dis...
The point is that they could have found the issue and seen that it was about to be released. That would be good engineering, bad engineering is when you don't find the root cause of your problem and see if its being worked on
Good engineering is when you solve the problems you have. Sometimes there are multiple ways to solve a problem. Just because they did not choose the solution (which again, we're only speculating would actually solve the issue here, we don't have proof of that) you prefer does not make it poor engineering.
I’m told the Go GC has gotten better in recent years. Has anybody run a similar program in Go lately that can confirm that?
(2020)
(Anyone know if they're still using Rust?)
Yes we are using rust in a big way. We have multiple teams now full time working on Rust. It is being used on both the client and server, as native modules, web assembly, and also native rust services and NIFs that embed themselves in our elixir services.
It has been an incredible success. I plan to blog more about it in the coming months. Our usage of Rust is continuing to grow, and if you check out our jobs page, you might notice all backend / infra jobs list Rust in them now :)
I think probably 40% of requests are handled directly by rust services now, with the rest involving one or more rust service called from our Python API layer.
Bit of a late reply, but how does Elixir fit into the overall strategy? Is it still like how it is described in previous engineering blogs where it acts as a kind of orchestrator for guilds?
I love Rust elixir Nifs. Gives you the best of both worlds to be honest. Highly fault tolerant code with fast computation. Only downside is that it can't really handle extreme crashes like a native process can.
Erlang/Elixir + Rust is an awesome couple. For the downside you mentioned, depending on the use case, it could be interesting to use Rust as a node: https://github.com/sile/erl_dist
Very nice. Nodes are great if you want a long running system along side your elixir app.
Their blog doesn't list all articles on a single page (so you could ctrl-f) and doesn't have its own search and doesn't have it's own domain (so googleing `site:discord.com rust` returns a mix of Discord communities and blog posts).
Makes it pretty hard to find stuff!
You can search "site:discord.com/blog/ rust", it appears to work for me on DuckDuckGo or Google. It seems TFA is the latest article mentioning Rust.
Woah I didn't realize you could filter on paths. I thought `site:` was domain only.
HN site listing is also useless because they host the blog on the same domain as everything else.
Looks like they're still hiring for it: https://www.google.com/search?q=rust+site%3Adiscord.com%2Fjo...
How illuminating. From CloudFlare posts, I had been under the impression that Go's gc was incredibly unintrusive, near-real time performance for applications operating in increments of a few hundred milliseconds. For example, CloudFlare uses Go to analyze network traffic.
Yes, Rust provides a more predictable, faster memory management model than Go. At the expense of unpredictable, expensive memory leaks triggering application termination.
Curious how much time and effort was dedicated to improving gc, which is a useful endeavor in its own right.
Personally, I prefer the choice of simply using a language that doesn't have GC. No GC, no GC tuning problem. Seems ideal!
That said, I've worked on several embedded systems, and the never allocate memory rule that most of them had for runtime was critical to maintaining real-time-like performance. One was written in C++, which meant that we basically couldn't make use of most of the stl and boost. We had to roll our own implementations of plenty of data structures used on the performance critical threads as a result. I couldn't imagine using a language with GC baked in for such a system. But the results spoke for themselves: microsecond level latencies and performance that scaled well with increased CPU core counts.
Where do you get that sense that Rust results in memory leaks? Is that just an assumption you’re making about languages without garbage collection, or are there examples you’re aware of Rust applications having to deal with runaway memory consumption?
Rust as a language does not protect against memory leaks - std::mem::forget even explicitly does so. Generally garbage collected languages do, so trading go for rust increases the risk of having memory leaks.
You are invoking std::mem::forget which is explicitly for circumventing destructor execution and then complaining about leaks.
Okay.
The documentation page for std::mem::forget goes through all the alternatives you should try before resorting to std::mem::forget.
Now, perhaps std::mem::forget should be marked unsafe. However, you don't just "accidentally" run std::mem::forget.
BTW, one of the problems with GC languages is the fact that you never know when your destructor might get run (ie. object gets reclaimed) so your GC thinks life is just fine but ... oops ... you just ran out of file descriptors because they are all waiting to be reclaimed.
> perhaps std::mem::forget should be marked unsafe
It was in the past IIRC. The problem is that there are multiple ways to leak resources in safe Rust (e.g. creating `Rc` cycles), and the compiler cannot prevent them all. So leaking is safe, and because it's also useful sometimes, `std::mem::forget()` and friends (e.g. `Box::leak()`) are also safe. That being said, it is pretty hard to accidentally leak memory.
C# has disposable objects (an interface), they require you to call dispose manually, though the "using" syntax makes this automatic once the variable goes out of scope so unless you're passing the object with an FD around you don't really notice it. Though if you miss the using on a disposable object you're leaking, so you need tooling to help you remember.
That's how they solve it, C# is my fav language, but it's probably because I've only really spent time in C++ and C#, the rest is just "scripting". So the syntax is familiar.
Doing std::mem::forget is not unsafe as you cannot get any sort of undefined behavior which is what Rust protects you from [0].
Leaking is pretty safe, not always desired but definitely not dangerous like random pointers derefs.
But as OP says, leaking accidentally is uncommon.
[0] https://doc.rust-lang.org/nomicon/safe-unsafe-meaning.html
I'm not complaining - I'm pointing out that leaking memory is considered "safe" in the rust memory model to the point where there is a safe method in the stdlib to do so.
Depends on gc implementation. With reference counting it's pretty predictable.
Rust is precise with what it guarantees, but don’t take “not an absolute guarantee in every case” as “no guarantee at all”.
Leaking in Rust requires either explicit leaking function call, leaking in C/FFI code, or a type that is 1. recursive 2. refcounted 3. with interior mutability, and 4. a programming error in use of such type. If any of these four conditions aren’t met, it can’t leak.
Cloudflare also uses a significant amount of Rust, in many business-critical places.
Hmm, well Go itself has gone through a ton of revisions where the GC saw large improvements. And other languages coming up are experimenting with various types of GCs. Its really up to the use case though.
With discord, I imagine a big reason why Rust was considered as an alternative to Go is because they already have a substantial Elixir codebase. Rust and Elixir have a very easy time communicating with one another via Erlang NIFs (native function interfaces). You can embed languages like C/C++/Rust into elixir without much overhead. While I've never personally tried do to such a thing with go, I can't imagine its a smooth experience. You'd probably need to use Ports or CNodes for Go simply for this reason.
I love go myself, but one of the biggest turn offs for the language is its FFI support for C and other C connected languages. CGo is a relatively expensive investment when compared to many other comparable alternatives and it should be avoided if possible.
We use Elixir + Rust in a few select places. But the majority of our Rust code is services that are purely written in Rust.
We use Rust over Go, not only because of the garbage collection issues, but because it's truly a better language in almost every way (once you learn it!)
I will say, Go is much easier to pick up, but in exchange you pay in the long term having a language that actively works against you when you start working on more advanced programs, and a mountain of code that's accumulated over the years that you have to maintain.
We work on high concurrency systems here, and I very much enjoy not ever having to think about "is this thing thread safe" because the compiler is checking that for you. I love being able to use the type system to offer my co-workers powerful, but difficult to misuse libraries. I like having sensible abstractions around concurrent execution.
Like, for example, if you create a channel in go, and for whatever reason, don't try to read from the channel, or give up (because you're racing a timeout), then the goroutine that tries to write to that channel will block forever and leak. In Rust, if you try to write to a channel where there is no longer a receiver, the write to channel will return an error, which you can then choose to handle, or simply ignore depending on your use-case. Of course, you can be wise and allocate your channels with a capacity of 1, but you can also just completely forget that, and start a steady leak of goroutines for the lifetime of your program that the garbage collector won't save you from!
Want to execute many futures with bounded concurrency in Rust and collect the results back into a Vec, but give up if any of the futures fail, or if a timeout is elapsed, and also make sure that all allocated resources are properly dropped and closed in the event that any errors happen? Just combine a futures::stream::StreamExt::{buffer_unordered, collect}, and a tokio::time::timeout, and in a few lines of code you've done it.
Want to do the same in Go? Spawn a pool of goroutines I guess, distribute two channels, one for sending them work, one for receiving work, and don't forget to throw in a WaitGroup, pass a context along to all the goroutines, make sure you don't forget any defers, if you are using a shared resource, make sure it's thread safe, or make sure you're locking/unlocking the appropriate mutex, make sure you size your result channel appropriately or you might leak goroutines and any allocations they hold if your main goroutine that's spawned all that work timed out waiting for the results to come in. Is there a library that does all this for you in Go? I googled "golang run many goroutines and collect their results" and looked at the first page of results, and it's basically the above...
It is no surprise then that we've picked to use Rust pretty seriously. When you're looking to build reliable systems with serious speeds and massive concurrency, you pick the best tool for the job. That for us is Rust, not Go. And for our real time Distributed systems, we pick Elixir, because BEAM/OTP is just so dang good.
> We use Rust over Go, not only because of the garbage collection issues, but because it's truly a better language in almost every way (once you learn it!)
What about complexity? How does "enterprise Go" code compare to "enterprise Rust" code? And what about the tooling. The other threads here are dwelling on GC, and latency, and threading and so on.
One of Go's selling points is that it tends to force writing simple-to-read code.
My general experience in C++/C#/Java/Kotlin is once a code base gets beyond a certain size and number of developers, without any discipline, it becomes a hot mess.
all the issues you mention for go literally plague every other language out there except for a very small minority. of which rust is one. its a fair criticism but you're completely disregarding the downside around development speed.
I've used both languages extensively. I like both languages. I still reach for golang first because its faster to develop with simply due to the compile times.
+1. Used Go, Python, React, etc extensively for years at Facebook. The massive Go projects were easiest to extend, operate, and on-board people to.
Rust vs Go discussions are pretty silly. More folks should think of Rust + Go. There are a lot of glue-services that can take the trade-off of a GC, be magnitudes better than Python, and be maintained by a short-staffed team.
Plus, Rust is still plagued by Python (and others) Async problem — it's easy to accidentally block the event loop. When you have a ton of tiny glue services, esp. if one has a lot of contributors / internal libs, it's an easy mistake. Critical path things are worth the attention to detail. Others... ¯\_(ツ)_/¯
> At the expense of unpredictable, expensive memory leaks triggering application termination.
Rust’s borrow checker does defend against it as well for safe rust, unless you are doing something very stupid, this is just false.
Isn’t this a function of them being such heavy Erlang users and are writing NIFs (via Rustler) in Rust.
Or they just got bored and wanted to try some shinier toy. I've seen this happen dozens of time, all the bullshit for justifying it is just that, bullshit.
Not saying this is the case here but highly likely.
Well... is this still true? Go's had a lot of perf improvements in the last two years.
It wasn't true even when they wrote the blog. Realistically this should read 2018 because apparently they waited two years before writing this blogpost.
There was a major performance boost in Go GC just after this happened
C with coroutines?
meh. I'm switching from Discord to Guilded.
isn't the tldr on this that Go let them scale up for years before it became the bottleneck? a natural progression for any successful project...