I write a lot about investigations into tricky bugs – CPU defects, kernel bugs, transient 4-GB memory allocations – but most bugs are not that esoteric. Sometimes tracking down a bug is as simple as paying attention to server dashboards, spending a few minutes in a profiler, or looking at compiler warnings.
Here then are three significant bugs which I found and fixed which were sitting in the open, just waiting for somebody to notice.
Server CPU Surprise
A few years ago I spent a few weeks doing some memory investigations on live game servers. The servers were running Linux in distant data centers so much of my time was spent getting permissions so I could tunnel to the servers, and learning how to effectively use perf and other Linux diagnostic tools. I found a series of bugs that were causing memory usage to be triple what it needed to be and I fixed those:
- I found map ID mismatches which meant that a new copy of ~20 MB of data was loaded for every game instead of being reused
- I found an unused (!) 50 MB (!!) global variable that was memset to zero (!!!), thus ensuring it consumed per-process physical RAM
- Miscellaneous smaller fixes
But that’s not what this story is about.
Having taken the time to learn how to profile our game servers I figured I should poke around a bit more, so I ran perf on the servers for one of our other games. The first server process that I profiled was… odd. The live view of the CPU sampled data showed that a single function was consuming 100% of the CPU time. Within that function it seemed to show that just fourteen instructions were executing. But that didn’t make sense.
My first assumption was I was using perf incorrectly or was misinterpreting the data. I looked at some other server processes and found that roughly half of them were in this strange state. The other half had CPU profiles that looked more normal.
The function in question was traversing a linked list of navigation nodes. I asked around and found a programmer who said that floating-point precision issues might cause the game to generate navigation lists with loops. They’d always meant to cap how many nodes would be traversed but had never got around to it.
So, mystery solved, right? Floating-point instabilities cause loops in the navigation lists, the game traverses them endlessly, the behavior is explained.
But… this explanation meant that whenever this happened the server process would get into an infinite loop, all players would have to disconnect, and the server process would consume an entire CPU core indefinitely. If that was happening wouldn’t we eventually run out of resources on our server machines? Wouldn’t somebody have, you know, noticed?
I tracked down the server monitoring and found a graph that looked something like this:
As far back as the monitoring went (a year or two) I could see the daily and weekly fluctuations of server load, and overlaid on that was a monthly pattern. CPU usage would gradually increase and then drop back down to zero. A bit more asking around revealed that the server machines were rebooted once a month. And finally everything made sense:
- On any particular run of a game there was a small chance that the server process would get stuck in an infinite loop
- When this happened the players would disconnect and the server process would stay in this loop until the machines were rebooted at the end of the month
- The CPU monitoring dashboard clearly showed that this bug was reducing server capacity by about 50% on average
- Nobody ever looked at the monitoring dashboard
The fix was a few lines of code to stop traversing after twenty navigation nodes, presumably saving a few million dollars in server and power costs. I didn’t find this bug by looking at the monitoring graphs, but anybody who looked at them could have.
I love that the frequency of the bug was perfectly set to maximize the cost without ever quite causing serious enough problems for it to be caught. It’s like a virus which evolves to make people cough, but not kill them.
Slow startup
Software developer productivity is intimately tied to the latency of the edit/compile/link/debug cycle. That is, having made a change to a source file how long does it take to be running a new binary with that change incorporated? I’ve done a lot of work over the years on reducing compile/link times, but startup times are also important. Some games do a huge amount of work every time they are launched. I’m impatient and I’m often the first person to spend a few hours or days trying to make game startup run a few seconds faster.
In this case I ran my favorite profiler and looked at the CPU usage during the initial load. There was one stage that looked the most promising: about ten seconds spent initializing some lighting data. I was hopeful that there might be some way to speed up those calculations and maybe save five seconds or so from startup time. Before digging in too deeply I consulted with the graphics expert. They said:
“We don’t use that lighting data in this game” – “Just remove the call.”
Oh. Well. That was easy.
With half an hour of profiling and a one-line change the launch time to the main menu was cut in half, with no extraordinary effort required.
An ill-timed crash
The variable arguments in printf formatting means that it is easy to get type mismatches. The practical results vary considerably:
- printf(“0x%08lx”, p); // Printing a pointer as an int – truncation or worse on 64-bit
- printf(“%d, %f”, f, i); // Swapping float and int – could print nonsense, or might actually work (!)
- printf(“%s %d”, i, s); // Swapping the order of string and int – will probably crash
The standard says that these mismatches are undefined behavior so technically anything could happen, and some compilers will generate code that intentionally crashes on any of these mismatches, but these are some of the most likely results (aside: understanding why #2 often prints the desired result is a good ABI puzzle).
These mistakes are very easy to make so modern compilers all have ways to warn developers when they have a mismatch. gcc and clang have annotations for printf-style functions and can warn on mismatches (although, sadly, the annotations don’t work on wprintf-style functions). VC++ has (different, unfortunately) annotations that /analyze can use to warn on mismatches, but if you’re not using /analyze then it will only warn on the CRT-defined printf/wprintf-style functions, not your own custom functions.
The company I was working at had annotated their printf-style functions so that gcc/clang would emit warnings but had then decided to ignore the warnings. This is an odd decision since these warnings are 100% reliable indicators of bugs – the signal-to-noise ratio is infinite.
I decided to start cleaning up these bugs using VC++’s annotations and /analyze to make sure I found all of the bugs. I’d worked my way through most of the errors and had one final change waiting for code-review before I submitted it.
That weekend we had a power outage at the data center and all of the servers went down (there may have been some power-configuration mistakes). The on-call people scrambled to get things back up and running before too much money was lost.
The funny thing about printf bugs is that they usually misbehave 100% of the time that they are executed. That is, if they are going to print incorrect data or crash then they usually do it every time that they run. So, the only way that these bugs stick around is if they are in logging code that is never read, or error handling code that is rarely executed.
It turns out that “restarting all servers simultaneously” hits some code paths that are not normally executed. Servers that are starting up go looking for other servers, can’t find them, and print a message like:
fprintf(log, “Can’t find server %s. Error code %d.\n”, err, server_name);
Oops. Variadic arguments mismatch. And a crash.
The on-call people now had an additional problem. The servers needed to be restarted, but that couldn’t be done until the crash dumps were examined, the bug was found, the bug was fixed, the server binaries rebuilt, and a new build deployed. This was a pretty quick process – a few hours I believe – but an entirely avoidable one.
It felt like the perfect story to demonstrate why we should take the time to resolve these warnings – why ignore warnings that tell you that code will definitely crash or misbehave when executed? However nobody seemed to care that fixing this class of warnings would provably have saved us a few hours of downtime. Actually, the company culture didn’t seem interested in any of these fixes. But it wasn’t until this last bug that I realized I needed to move on to a different company.
Lessons?
If everyone on a project spends all of their time heads-down working on the features and known bugs then there are probably some easy bugs hiding in plain sight. Take some time to look through the logs, clean up compiler warnings (although, really, if you have compiler warnings you need to rethink your life choices), and spend a few minutes running a profiler. Extra points if you add custom logging, enable some new warnings, or use a profiler that nobody else does.
And if you make brilliant fixes that improve memory/CPU/stability and nobody cares, maybe find a company where they do.
Hacker news discussion here, reddit discussion here, twitter discussion here.