Safety in an unsafe world

Did you know...?
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

Joshua Liebow-Feeser took to the stage at RustConf to describe the methodology that his team uses to encode arbitrary constraints in the Rust type system when working on the Fuchsia operating system (slides). The technique is not unknown to the Rust community, but Liebow-Feeser did a good job of both explaining the method and making a case for why it should be used more widely.

He began with the motivation for his presentation, which was netstack3, Fuchsia's new networking stack written entirely in Rust. The project started six years ago, and Liebow-Feeser led the project for four years. Networking stacks are "very serious", he said. They are responsible for almost all traffic, often implement dozens of different protocols, and are the first line of defense against attackers. They're also just plain large pieces of software.

Netstack3 encompasses 63 crates and 60 developer-years of code. It contains more code than the top ten crates on crates.io combined. Over the last year, the code has finally become ready to deploy. But deploying it to production requires care — networking code can be hard to test, and the developers have to assume it has bugs. For the past eleven months, they have been running the new networking stack on 60 devices, full time. In that time, Liebow-Feeser said, most code would have been expected to show "mountains of bugs". Netstack3 had only three; he attributed that low number to the team's approach of encoding as many important invariants in the type system as possible.

The method

This isn't a new idea, Liebow-Feeser said. Many people have tried to make it so that buggy programs simply don't compile. But the netstack3 team has a concrete, general framework for approaching this kind of design. He broke the process into three steps: definition, enforcement, and consumption. For definition, the programmer must take something that Rust can reason about (usually types) and attach the desired property to it. This is usually done via documentation — describing that a particular trait represents a particular property, for example. Then the programmer enforces the property by making sure that all of the code that directly deals with the type upholds the relevant invariant.

Critically, the programmer must at some point use field privacy or unsafe markers to ensure that these invariants cannot be broken elsewhere. For example, a structure with all public fields can be constructed from anywhere, so it isn't possible to rely on invariants tied to that type. If the structure includes a private field, it can only be constructed in that specific module, which means the programmer can audit all of the uses. When adding a private field isn't a solution, for some reason, Rust programmers can fall back on marking things unsafe, so that other programmers will hopefully read the documentation before using a structure or trait. Finally, other code can rely on the property as a guarantee, such as by making optimizations that require it.

In a codebase using this method, each guarantee will have a single Rust module that deals with the internals of the property. Then, the rest of the code can rely on the type checker to ensure that the property holds everywhere. For a more concrete understanding of what that looks like in practice, Liebow-Feeser introduced several examples.

The first example he gave was nearly trivial: a binary tree that needs to enforce ordering invariants between nodes. Of course Rust cannot natively check for this. So how might the programmer ensure that an invalid tree is never produced? Following Liebow-Feeser's process, they must first document the requirement. Then, they ensure that the code which deals directly with creating and modifying nodes in the tree respects the invariant. They must also ensure that relevant details are hidden behind private fields, so that code outside the module cannot interfere. Finally, the rest of the code can now rely on the fact that the tree is always in order.

That approach will be familiar to any programmer who has implemented a data structure. But the approach generalizes to more subtle properties, Liebow-Feeser said. His second example was actually from Rust's standard library. Rust promises thread safety, but the language itself actually knows nothing about thread-safety properties. Those are all implemented in the standard library.

One relevant property that the authors of the standard library would like to guarantee is that non-thread-safe closures are never passed to std::thread::spawn(). To do this, they created an unsafe trait (Send) that represents code which is safe to run on another thread. Send is unsafe not because it is inherently dangerous, but just because it represents a property that the compiler cannot check. The trait could then be used as a bound to restrict which functions can be passed to spawn(). Finally, while safe Rust code cannot implement the trait directly, the standard library adds various trait-inference rules and derive macros to let other code implement the trait. (In fact, this is a slight simplification because Send is an auto trait — an unstable feature that instructs the compiler to implement traits automatically for compatible types, although Liebow-Feeser did not mention this in his talk.)

Automatic deadlock prevention

The final example Liebow-Feeser gave was significantly more involved, and was taken directly from the netstack3 code. The networking stack needs to be multithreaded with fine-grained locking, for performance, he explained. Rust's existing libraries can guarantee that the code is thread-safe, but they can't guarantee that it won't deadlock. With so many developers working on such a large codebase, being able to know that the code won't deadlock is an important property that isn't easy to guarantee. Netstack3 has 77 mutexes, across thousands of lines of code. Its predecessor, netstack2 (implemented in Go) had a lot of deadlocks.

For the first step, the developers added names to each mutex, in the form of a generic Id type parameter. The relevant IDs are all zero-sized types, so they don't exist at run time, and have no run-time overhead. Then, they defined two unsafe traits: LockAfter and LockBefore. These represent exactly what they sound like — for a specific mutex to be locked after another mutex, LockAfter must be implemented for the relevant types. They added a derive macro, so that there's little boilerplate needed to add to each ID type, and added blanket trait implementations such that if there is a cycle in the graph of locks, it results in a compile time error because of overlapping trait definitions.

In order to actually make the existence of the locking traits useful, however, the developers also needed to add methods that use them. In this case, they created a new Context type that carries around one of the ID types. The lock() method on their mutex type takes a mutably borrowed context, so the original context cannot be used until the mutex is unlocked. At the same time, it provides a new context with the ID of the lock, which can only be used to lock mutexes with the correct LockAfter implementation.

So, from the perspective of all of the code outside the module that implements this, mutexes can only be locked with an appropriate, un-borrowed context object. The context objects impose a global ordering on how mutexes can be locked, and attempting to add an incorrect LockAfter implementation (one that would permit a cycle) is a compile error. Programmers are free to lock whatever mutexes they can get past the compiler, secure in the knowledge that this can't cause a deadlock. In turn, this makes it easier to justify adding more fine-grained locking to the implementation. At run time, all of the type-level machinery associated with this guarantee has been compiled out, so there is no run-time overhead.

Conclusion

In practice, there are actually some problems with the simplified deadlock-prevention example as presented, Liebow-Feeser said. But generally, this ability to take a property that the language does not know about and "teach" it to Rust, so that now it is enforced at compile time, is why he likes to call Rust an "X-safe" language. It's not just memory-safe or thread-safe, but X-safe for any X that one takes the time to implement in the type system.

He concluded his talk by asking people to try this out for themselves, especially in new domains that nobody has tackled yet. He encouraged people to think of functions that panic or return an Option as a code smell — those are all places where the code could encode the necessary invariants at compile time instead. He called on the audience to "make your APIs exactly match the structure of the problem". At the same time, he cautioned people that the appropriate solution may not be the same every time. For deadlocking in netstack3, they used traits and trait inference rules to ensure that there are no cycles. But not every circumstance will warrant the use of traits; the important thing is to follow the method he laid out.

Liebow-Feeser thinks "this can reshape how we do software engineering". There are lots of domains where correctness is important, and successfully scaling software in those areas will require making it possible for tools to verify that correctness, he said. Whether his prediction is right remains to be seen — but in any case, the method he spoke about seems like a nice framework to unify different techniques for ensuring program safety.

Index entries for this article
Conference	RustConf/2024