Facebook’s evolutionary search for crashing software bugs

10 min read Original article ↗

Ars gets the first look at Facebook’s fancy new dynamic analysis tool.

An arty photo of one of Facebook's data centres. Credit: Facebook

An arty photo of one of Facebook's data centres. Credit: Facebook

With 1.3 billion daily users, the Facebook site and its apps are the most-used pieces of software in the world. Only a handful of software companies have ascended to a similar echelon of ubiquity, including Microsoft, Google, and Apple. For better or worse, that is the world we now live in, where a large percentage our waking hours is spent interacting with software—and Facebook leads the pack, with the average user spending 50 minutes per day mostly watching videos and liking photos of babies. Television is the only leisure activity in the world that receives more attention than Facebook. And don’t forget that Facebook now owns Instagram and WhatsApp, too.

Look, it’s really hard to find stock imagery for “evolutionary algorithm.” Credit: Adobe Stock

It is understandable, then, that Facebook cares a lot about the quality of its software. If Facebook pushes out a new version of its Android app with a crashing bug, millions of users could be affected. Those users might be inclined to switch to another social network, or even worse: put down their phone and interact with the real world. The net effect is the same, either way: Facebook’s share of your attention, and thus potential revenue, decreases.

That’s why Facebook has some advanced bug-finding tools—including a devilishly clever dynamic analysis tool, initially devised by students at University College London and then acquired and further developed by Facebook’s London office. This is the first time they’ve shown the inner workings of this new tool, dubbed Sapienz, to the press.

Static vs. dynamic analysis

There are two ways of automatically analysing a piece of software in the hunt for bugs, security vulnerabilities, and other potential issues. Static analysis, as the name implies, is only interested in the source code of the program. Dynamic analysis is the opposite: you run the program, feed it a bunch of inputs, and record how it behaves.

Each technique serves a different purpose, and a big software company would usually use both. Static analysis is perfect for formally verifying that an algorithm works as intended, or for highlighting bad code that might allow for a buffer overflow or other security vulnerability. Dynamic analysis is better at finding the gnarly edge cases that cause crashes. Humans can manually perform both analyses, of course, but computers are obviously a lot quicker when it comes to testing millions of possible inputs.

Facebook’s static analyser is called Infer. The company open-sourced the tool in 2013, and a lot of big names (Uber, Spotify, Mozilla) use it. There isn’t a whole lot to say about it, other than it seems to be very popular and effective; download it today!

Sapienz

There are a lot of dynamic analysers out there, but none like Sapienz, according to Facebook. The biggest problem with dynamic testing is finding the right inputs that cause an app to crash. Facebook says that some dynamic analysers just throw random sequences of inputs at apps, with up to 15,000 input events between crashes. Sapienz, on the other hand, only needs about 100-150 events to find a crashing bug. In practice, that means Facebook finds significantly more crashing bugs in a shorter amount of time.

Sapienz has three main tricks up its sleeve. First, it uses a search-based evolutionary algorithm, rather than a random or model-based approach. Second, the fitness function that guides how the algorithm evolves is incredibly complex: there are multiple objectives, entwined by Pareto optimality, that must be fulfilled for each evolutionary change to be a success. And third, Facebook can run Sapienz on its One World test platform, which lets engineers find crashing bugs on hundreds of different Android devices simultaneously. (Sapienz only supports Android apps currently, though there are plans to expand to other platforms and app types.)

A block diagram of Sapienz. It might make a bit more sense if you finish reading the story, then try to decode this.

Let’s unpack those points in order, as they all play an important part in the efficacy of Sapienz. Most existing dynamic analysers either throw random inputs at an app, or use some kind of model: a set of test cases that are informed by outside knowledge, such as the app’s header or configuration files. Sapienz uses an evolutionary algorithm that, er, evolves the test inputs based on how the app actually responds to earlier inputs, with the hope of more efficiently finding crash cases.

The key to a successful evolutionary algorithm is its fitness function. I’m not your college computer science lecturer, so I won’t go into too much detail, but a fitness function essentially looks at the result of a test case, and decides how close that result is to a desired outcome/objective. The results that don’t fulfil the fitness function are tied up in a burlap sack and thrown in the river; the good ones are bred together, to form the basis of the next round of testing.

According to Facebook’s engineers, most of their secret sauce is in Sapienz’s fitness function, which has three objectives: to test as many of the app’s methods and statements as possible, find as many crashes as possible, and minimise the length of the test sequences required to cause crashes. The first two are all about producing better, crash-free software; the third is to improve the efficiency of the system, so that a decent number of crashes can be found in a reasonable amount of time.

These three objectives are assessed by the fitness function for Pareto efficiency. That is, one objective isn’t more important than the others: if the evolutionary algorithm is only producing long test sequences, but they’re providing good coverage and finding lots of crashes, then those long tests will be kept alive. Over time the system tries to hit Pareto optimality: where it’s impossible to improve one objective without negatively impacting another. So, in this case, the algorithm would attempt to reduce the test sequence length without reducing coverage or the fault revelation.

Here are a couple of graphs/plots showing the code coverage of Sapienz versus Monkey and Dynodroid. As you can see the coverage varies a bit depending on program size.

Here are a couple of graphs/plots showing the code coverage of Sapienz versus Monkey and Dynodroid. As you can see the coverage varies a bit depending on program size.

Sapienz does a few other clever things to root out more crashes than other analysers, too. Many automatic analysers use atomic events—that is, just a stream of inputs that aren’t contextually related. Apps are much more complex than that, though: you do one thing (try to upload an image), and then do another (select an image) within the same context. To test more complex interactions, Sapienz uses motif patterns—groupings of low-level atomic events—which are derived from the various UI elements used by the app.

Sapienz also strays slightly across the border into static analysis: it attempts to reverse-engineer the app (an Android APK in this case) to pull out some strings, which it then uses as natural-language inputs when testing begins. “We found this seeding to be particularly useful when testing apps that require a lot of user-generated content, e.g., Facebook, because it enables Sapienz to post and comment in an apparently more human-meaningful way,” say the researchers.

An example of Sapienz in action.

How good is Sapienz?

The end result, at least according to the engineers and poring through a research paper they produced at University College London before they were acquired by Facebook, is that Sapienz is orders of magnitude more efficient than other dynamic analysers.

For random-based analysers there can be upwards of 15,000 test inputs between crashes; for Sapienz, that number is closer to 100 or 150.

The engineers also report that Sapienz discovers more crashing scenarios than other tools. From a standard test suite of 68 Android apps, Sapienz caused 6,866 crashes, versus 1,196 for the “official” Application Exerciser Monkey and 125 for Dynodroid. More importantly, Sapienz found 104 unique crashing scenarios, versus 41 and 13 respectively. Sapienz also found crashes in 14 apps that the other two analysers reported as crash-free.

The most common types of crash, from Android’s top 1,000 apps.

In another test, the researchers ran Sapienz on the top 1,000 most popular Android apps and found 558 unique crashes, and then tried to report those crashes to the developers. At the time of publishing, 14 had been confirmed as real faults. In case you’re curious, of those 558 unique crashes, 161 were native crashes (i.e. they occurred outside the Android Java Virtual Machine while executing native code), 149 were of the null pointer variety, and 110 were caused by Android’s activity-not-found exception.

Automated dynamic analysers still have plenty of space for improvement, though: across the suite of 68 apps, Sapienz managed an average code coverage of about 55 percent, versus about 48 percent for Monkey and 43 percent for Dynodroid. Getting into all of those nooks and crannies would obviously generate even more crashes.

Helping engineers find and fix bugs faster

Finding bugs is just the first step. To help engineers locate the problematic bit of code, Sapienz produces a video clip of each test sequence that causes a crash, plus the usual dump of debug text and stack trace. Sadly Facebook didn’t provide one of those video clips, but it did produce the video embedded at the top of this page, which is still pretty cool.

What you see in the video is Sapienz testing the Bookworm app on multiple Android emulators, which are running inside Facebook’s One World system. One World, which is a major component of Facebook’s software quality control processes, began life as the “mobile device lab”—hundreds of different mobile phones, all wired up inside custom server racks to simplify software testing and QA. One World, which Facebook unveiled in May, is the next, more formalised iteration.

Here are some of Facebook's earlier attempts at a mobile device lab, which later matured into One World.

Speaking to Ars in an interview, the researchers’ excitement is palpable. Sapienz was developed by three UCL computer scientists just a few months ago—and now Facebook’s two billion monthly users are reaping the rewards. Historically, most researchers could only dream of having such a large impact so quickly.

But that’s technology for you: in many cases, if you’re competing against another person or company or civilisation, developing or acquiring an advanced technology is one of best ways to get ahead. And thus, if you’re fortunate enough to be a tech specialist or tech company, then developing technology that assists in the rapid deployment of other new technologies is even savvier. Viva la tech!

Listing image: Facebook

Photo of Sebastian Anthony

Sebastian is the editor of Ars Technica UK. He usually writes about low-level hardware, software, and transport, but it is emerging science and the future of technology that really get him excited.

35 Comments

  1. Listing image for first story in Most Read: Verizon refused to unlock man’s iPhone, so he sued the carrier and won