On rewriting Waypipe in Rust

Waypipe is a proxy for Wayland applications, which makes it possible to run an application on a different computer but interact with it locally, as if it were actually running on the local computer. (Wayland is the slowly-improving window system protocol for Linux, successor to X11; which most applications now support. The protocol sends plain data over a Unix socket, along with file descriptors to share less serializable things like window surface image data.)

It was written by me during the summer of 2019, and was implemented in C because libwayland used C, because most libraries provide a C interface, because other programming languages often aren’t available or are hard to install as a user on old, shared systems, and because no complicated data structures or libraries were used for which C++ would be necessary. The core operations (basic protocol parsing and shared memory buffer replication) did not take long to implement, and were done in a week. Most of Waypipe’s code is spent making this practical: making the buffer replication for displayed windows run fast and only when necessary; handling other Wayland “protocols” (read: Wayland object types and associated methods), supporting replication of DMABUFs (GPU-side memory buffers used to transfer image data between applications; typically used by OpenGL and Vulkan in place of CPU-side shared memory file descriptors.), and optionally video-encoding DMABUFs.

Making Waypipe reliable, secure, and efficient has been challenging. Waypipe receives and sends messages from Wayland applications and compositors, which it should not trust to use the various Wayland protocols properly. In addition to the (currently rather theoretical) risk of malicious applications, regular mistakes and complicated stacks of libraries can use the Wayland protocols in unexpected ways. There are several libraries implementing the base wire protocol, a number of compositors and toolkits that use it, libraries that extend or try to “share” a single Wayland connection with an existing program, and clients that people have written which directly use a wayland library instead of going through a toolkit, similarly to how many people directly used Xlib.

My approach was to try to write reliable code that handles all errors, in some form or another. (Ideally, by cleanly shutting down the connection and sending an error message to the application; this is what libwayland-server also does.) Of course, to make reliable code, I needed to test it. My main strategies were: trying many Wayland clients and subcomponents of Waypipe (worked, but tests take a while to write and still miss things), injecting errors (to check how broken memory allocation failure paths were), using addressanitizer and static analysis tools to detect issues, and fuzzing (to see what crashes when a fuzzer controls the Wayland message inputs and the internal protocol used to connect the local and remote Waypipe instances; like testing, this requires some framework code to let the fuzzer provide and manipulate file descriptors, which still doesn’t cover all cases).

Altogether, these testing approaches appear to have worked, but they require a measure of active maintainance over time as the code is updated. New Wayland protocols and protocol revisions continue to be made and Waypipe has needed and will often need to adapt to them; the wl_drm protocol once used to share DMABUFs has now been entirely replaced by zwp_linux_dmabuf_v1, and new protocols for explicit synchronization, presentation timing, screen capturing, and color management are now done or being designed. There have also been new feature requests and ideas for performance improvements. Implementing all of these required or will require new code, which is not as well tested as the older code and would require a lot of work to bring to the same standard.

Rewriting Waypipe in Rust was expected to have multiple benefits. First, to reduce the cost of making changes and adding new features at the same level of security; Rust provides a framework with which to encapsulate memory-unsafe code, and a safe and comprehensive standard library, which together should significantly reduce the number of places where memory-unsafe bugs could appear in Waypipe. Second, I wanted to change Waypipe’s DMABUF handling backend library from libgbm to vulkan to improve performance, handle explicit synchronization, and more efficiently do RGB to YCbCr conversion for the optional video encoding feature; in total I expected that this would require changing or adding about half of Waypipe’s lines of non-test code. Third: for me to better learn Rust; and fourth: because I had been hearing about other C or C++ to rust rewrite projects, and was curious whether a rewrite would be worth it. The best way to determine that was to try it.

In practice

The rewrite went roughly as expected.

Instead of doing an incremental port of Waypipe, converting its various logical parts piece by piece, I redeveloped the Rust version in parallel, roughly following the same development path as the original Waypipe. (Except this time I knew the end goal.) That is, I started with a simplified form of the command line interface, and then developed a basic main proxy loop, Wayland protocol parsing logic, and shared memory buffer replication. The initial step was easier because I already had written a different (local) Wayland proxy program in Rust (windowtolayer). Once that was ready, I iteratively added back the various features of Waypipe, starting with damage tracking, compression support, and multithreaded buffer diff calculation and application; often testing the code by connecting it to the original Waypipe implementation.

Much of my time in the middle of the port was spent implementing DMABUF support, this time using Vulkan instead of libgbm. I started with a simple, single-threaded implementation and once that worked, progressively introduced multi-threading, buffer update calculations, zwp_linux_dmabuf_v1 protocol handling, and stride adjustments to match the weird way the original C implementation adjusted nominal buffer strides when using libgbm. To implement Waypipe’s optional video encoding feature, I started with the possibly tricky case of hardware video encoding and decoding. As Vulkan hardware video extensions had been released in the last few years, I just used ffmpeg’s encoder/decoder based on them, which was recently added but worked with few issues. Software video encoding and decoding were easy to add afterwards.

The second 90% of the work has been spent on all the miscellaneous tasks: bringing the Rust rewrite up to feature parity with the original version, getting it to integrate with Waypipe’s existing build system (using meson.), and resolving the issues found after I deemed the Rust port good enough and brought it into the main git repository.

The rewritten code is slightly larger: tokei reports the C implementation had about 12000 lines of code, without comments and tests, and 19000 with comments and tests, while the Rust implementation has about 16000 lines of code without comments and tests, and 23000 when comments and tests are included. (I am ignoring about 5000 lines of auto-generated Wayland protocol handling data and code which are tracked in git for Rust, but auto-generated in C.) The largest chunk of the difference comes from the DMABUF copying and video encoding implementation using Vulkan and libavcodec, which together use about 4000 more lines than the C implementation (which had about 400 lines for libgbm, and 1200 to libswscale, libavcodec, and vaapi interaction); most of these lines would still have been needed, had the library change been done in C.

Test code was generally more efficient to write for the Rust implementation because higher-level constructs are available; for example, making it possible to compare two vectors of bytes with ==, or using closures to efficiently reuse the same generated parse_<msg> and write_<msg> functions in the main protocol replication test framework as were used in the main proxy logic. The C protocol replication tests skipped many checks because they would be awkward and repetitive to write or would need more code generation. Note: These are benefits that would be available had I used C++ or some other language for Waypipe instead; I also had the advantage with the rewrite of focusing on “end-to-end” tests running the Waypipe’s proxy logic (as exposed through two Unix sockets) against various Wayland protocol transcripts. I expect this approach will require less maintenance with time than the more integrated tests used for the C implementation.
Lifetimes and exclusive references: were annoying to work with in some early code, but Waypipe generally either does nothing complicated, or has multiple independent references to objects and needs to use Rc<_> or Arc<_>. They have prevented a few incorrect designs. One suboptimal thing remains: the main proxy loop (loop_inner() in src/mainloop.rs) uses nix::poll::poll() which takes nix::poll::PollFd which contain references to OwnedFd objects that are owned by various structures for Wayland protocol file descriptor replication, called ShadowFds for historical reasons; the ShadowFd objects are stored under Rc<RefCell<_>>, and making their OwnedFds available to poll() currently requires acquiring and storing a Ref<_> for each one in a separate vector, and also extracting the return events from each PollFd into a separate vector because the PollFds need to be dropped immediately to drop the Ref<_>s, so the following code can access the ShadowFds. There are ways to avoid these extra vectors, and building them wasn’t expensive to begin with; but ultimately the problem is that I’ve spent too much time thinking about how to refactor this thing.
When rewriting code, I sometimes noticed details that I’d missed in the original; like incomplete ssh argument parsing, or a rare edge case when clients construct a wp_presentation_feedback object immediately after binding wp_presentation. I probably would have missed these if I had translated the code instead of writing it from scratch, and comparing it with the C implementation later. Other minor improvements (like not precisely replicating DMABUF modifiers) were discovered through the use of Vulkan instead of libgbm.
Much of the work in this rewrite was rather tedious, with little fundamentally new code. (I have used Vulkan before.) The only interesting bugs I have had to track down so far were memory safety+threading issues in libraries I was using, and an unfortunate typo when almost-but-not-quite copy-pasting code. The technically interesting parts (making more efficient buffer difference calculations) have been postponed until after remaining regressions have been discovered, the next release is done, and I start changing Waypipe’s internal protocol.
Rust’s error and string types are much better than C’s; Result, Option, and first-class tuples make detecting and unpacking errors require much less work than C; one no longer needs to check which magic values identify failure, whether errno is set or where else error messages are stored, and which arguments are returned by pointer in what circumstances. As the Wayland wire protocol is binary I did not need to use much C string handling in the original implementation.
Waypipe varies in how well checked its unsafe code is; I’ve tried to document core operations on file descriptors and memory maps in detail; on the other hand, most of the DMABUF and video code is unsafe and FFI-heavy, and may leak memory when failures occur. (Fortunately, most failures are fatal, so the leaks here are not critical.) I’ve been using direct library bindings via bindgen or unsafe crates like ash for external libraries because the current safe bindings generally are missing required features, require statically linking in libraries, or bring in too many other dependencies.
One of the original implementation’s design mistakes, perhaps, was trying to cleanly handle memory allocation failures (and report an error to the application) instead of just exiting when malloc() returns NULL; this made the code more complicated and added many failure paths to the code that are hard to test. While may be possible to write a Wayland client that can make Waypipe’s calls to malloc() return NULL, normal clients will not do this. Because Waypipe uses one process per Wayland connection it is safe for it to abort when malloc() fails. On the other hand, Rust has good enough memory and error handling to reliably and safely do a clean shutdown when malloc() fails, but standard library changes to enable this are still unstable or in progress.
Error handling in longer stretches of unsafe code (ensuring everything is freed on failure) can be more awkward than C, because the standard goto cleanup; trick is not available. Wrapping things with a type that destroys them on Drop generally works instead. (Properly unwinding on panic for FFI wrappers generally is not needed, because C libraries generally do not panic and the FFI wrappers are usually straightforward leaf functions.)
enums are useful to make the possible states of a structure clear, but doing this often requires that I define more structs for each possible state, and name them all. Picking good names for them all is a major unsolved problem in theory, so in practice I just pick bad names.
Waypipe’s build system now somewhat of a mess: meson runs cargo through an intermediate script to control the location of the output executable, and I still haven’t fully connected meson’s various build types to cargo’s. (I am continuing to use meson because it is used by Waypipe’s original C implementation, which I’ve moved into a subfolder of the repository, and because Waypipe has a man page that needs to be installed to the right place.) This continues to evolve.
Rust uses much more build space (about 250MB) than C (14 MB) when building with debuginfo; this is mostly caused by a few big dependencies and 4MB compiled build scripts.
bindgen is nice to have, but translates C’s char into i8 or u8 depending on the platform, instead of translating it to std::ffi::c_char. As a result, I used *const i8 a few times in my own code, until discovering the build failures on platforms where c_char = u8. After that, I switched to c_char and started checking the C headers whenever I wanted to know whether a function’s argument actually was * char, * int8_t, or * uint8_t.
cargo test is OK, but could be better. There is no convenient way to set per-test timeouts. (Some of Waypipe’s tests should never take more than a millisecond of CPU time; others take a fraction of a second if things go well.) Maybe I should switch to nextest; although I’d prefer configuring test properties in the code instead of in a separate config file. Even with nextest, though, there is the limitation that tests appear to be pass/fail and do not have a way to communicate that they are inconclusive. As Waypipe needs to maintain copies of DMABUFs, some of Waypipe’s tests are run performed for each render device available on the system. These tests would ideally be considered SKIPPED. I have also observed the video encoding implementation producing odd results (a constant color on a non-constant image); ideally tests observing this could report UNCLEAR as this is not clearly Waypipe’s fault. I’m certainly not the first person to want either of these behaviors.
Rust’s integer support is much better than C’s, for which implicit conversions are common and can hide mistakes, which in turn are hard to enable warnings for because the conversions are common. Rust also provides useful features like ilog2, isqrt, leading_zeros, next_power_of_two, saturated_add that to do well in C require intrinsics, carefully written bit manipulation, or that you write a function for them yourself (which the compiler hopefully identifies and replaces with the ideal implementation.)
Because it was easy to do with bindgen and libloading, the rewrite now dynamically loads libavcodec and libavutil at runtime, when necessary. This reduces the time to start the waypipe executable (as measured by timing waypipe --help) from 45 to 5 milliseconds.
I did not use any async/await code under the assumption that it would be too complicated and not worth the benefit. Many of Waypipe’s off-main thread tasks are compute heavy, and these tasks often wait for a specific region of a shared resource (mirror of a buffer) to become available or for the GPU to finish an operation.
There currently does not appear to be a stabilized and universally efficient way for Waypipe to safely interact with shared memory regions, other than by using architecture-specific assembly. Under the C++-like memory model that Rust uses, arbitrary shared memory found through mmap should be considered volatile, since any arbitrary process or DMA device could modify or react to the memory in “ways unknown to the implementation”. However, Waypipe in particular can assume it is connected to a well-behaving Wayland process, and that there are no side effects to memory access and that ordering of its writes does not matter, as long as they all happen before the application reads the contents of Waypipe’s next sendmsg(). Similarly, Waypipe only needs to see memory writes that happen before its last recvmsg() returns, and only needs to be “safe” when reading from the shared memory region: the compiler should never assume that two repeated or overlapping read operations will return the same result.

Using &[u8] would not provide this guarantee, so Waypipe currently treats memory buffers shared with other processes as essentially &[AtomicU8], using Relaxed memory access ordering. This is probably fine in practice on current architectures, as the relaxed atomic operations would be implemented either with plain loads and stores, or with something stronger. There is still the theoretical problem that, as far as I am aware, Atomic types are only guaranteed to work when the memory is updated “within the memory model”. (For example, one could imagine an architecture where the compiler’s preferred atomic operations will crash the program if they overlap with DMA operations, but it has volatile operations which are OK.) As an alternative, Waypipe might be able to use std::ptr::read_volatile and std::ptr::write_volatile on entire 64-byte cache lines and thereby give the compiler more freedom to optimize than if Waypipe were to do volatile operations on a single u8 or u64 at a time.

Things that I’d like to have

A cross-GPU-platform library for general data compression and decompression on GPU with Vulkan; ideally for lz4 or zstd, but some other CPU-friendly format would be OK.
For bindgen to accept a list of functions so that, if it does not generate bindings for all of them, it should return a failing exit code. bindgen currently can only filter which of the functions (or variables, constants, etc.) it makes bindings for.
A variation on the format!() macro that produces an iterator instead of a String; this would make it possible to (without restructuring the code very much) eliminate many intermediate allocations from dynamically chosen trees of format! operations, like the following:
```
format!("{} is {}",
  if a { format!("{:x}", b) } else { "C" },
  if z { format!("{:x}", y) } else { "Z" })
```
I would not be surprised if this already exists.

Possible improvements for Rust

Having learned more Rust recently, it is my irresponsibility to suggest things wiser programmers probably can explain are bad ideas.

I sometimes use key k1 to lookup an &mut value x from a BTreeMap, read data from x to determine a key k2 distinct from k1, which I use to lookup &mut value y, and then modify both x and y in some fashion. Doing this requires dropping x and then looking it up again in the map. Sometimes there is a third key whose value I’d like to modify, but the total number is always small. The extra lookups could be avoided with RefCell, but that has significant space overhead and is awkward to use when programming. I think this problem could be solved with a sort of split_at_mut()-analogue; a method on BTreeMap that looks something like
```
get_mut_and_remainder(&mut self, key: &K) -> Option<(&mut V, RemainingMap<K,V,1>)>
```
where RemainingMap<K,V,N> is a type referring to the BTreeMap which keeps a list of N references &K and allows mutable lookups (but not insertions or deletions) of keys with a
```
get_mut_and_remainder<N>(&mut self, key: &K) -> Option<(&mut V, RemainingMap<K,V,N + 1>)>
```
signature, failing when key matches any of the N references stored so far. This would be an adaptive version of the currently-unstable HashMap::get_many_mut. One can emulate something like this idea for slices using split_at_mut(), but I don’t see how to soundly and efficiently build it on top of BTreeMap’s current API. Maybe there is a crate that already does this.
As far as I understand it, Rust has a notion of “uninitialized memory”, where, quoting MaybeUninit’s documentation, it is “undefined behavior to have uninitialized data in a variable even if that variable has an integer type”. I don’t think this is necessary, and believe that Rust’s existing rules and mechanisms for making unconditional promises to the compiler are sufficient to enable all practical optimizations.

Currently, the memory provided to Rust by alloc may be uninitialized, and the memory region needs to be manipulated by pointer or through MaybeUninit instead of by &mut [u8] slice, because std::slice::from_raw_parts_mut requires that the data region it operates on be properly initialized for the slice type (in this case, u8). As a result, one often requires two variants of any FFI function that fills a region of memory: one straightforwardly usable one which takes an &mut [u8], and one which uses raw pointers (which is unsafe to use) or MaybeUninit<u8> (safer but complicated). In practice, these two variants would produce the same code, but if a crate provides the &mut [u8] version one cannot obtain the raw pointer or MaybeUninit version from it. (Without laundering the pointer through FFI.) I ran into this issue when trying to use nix::sys::uio::readv on a fresh allocation, and when making wrapper functions for lz4 and zstd compression and decompression.

Making alloc provide an initialized [u8] (albeit with arbitrary contents) would avoid the above code duplication and the number of uses of unsafe required when making data structures or using external libraries. But I do not think it would inhibit necessary compiler optimizations, because Rust has good mechanisms for introducing undefined behavior (read: unconditional promises to the compiler). If one wants to optimize bounds checks around a partially initialized region of memory, then std::hint::assert_unchecked can be used to instruct the compiler which addresses are actually being read from, or one can access memory through an intermediate slice (with associated undefined behavior if an unchecked access is out of bounds for that slice.) Similarly, when allocating memory for a non-plain type T, one does not need an “uninitialized memory” concept to make accessing T undefined behavior; the compiler should already assume that blindly transmuting raw memory ([u8; _]) into T is invalid, because it is not guaranteed that the memory has valid contents for T. Finally, the use of uninitialized variables (e.g. let x: u8; x += 1;) is already a language error in Rust.

Also: I read a document explaining adding undef to LLVM; it gives mostly C-specific or internal justifications: like discarding implicit function return values, optimizing global variable initialization, or improving compilation when a variable in an outer scope is not used when a given condition holds: none of these should affect the Rust abstract model.

Also: I read a relevant post from 2019 mentioning an old set data structure which can work when its memory region is arbitrarily initialized; the sparse set reads from “uninitialized” but exclusively owned memory. I should note two other examples which do not need initialization: first, there are catalytic algorithms which use an (arbitrarily initialized) region of memory in their calculation and later return it with the content reset to its initial values; these only require exclusive access to the memory region. Second, my favorite binary tree inversion algorithm, which uses only O(log(tree depth)) words of space (in exchange for awesome and superlative runtime): it uses the algorithm of Savitch’s theorem to identify which addresses in memory correspond to tree nodes, and then swaps the children of each tree node. This algorithm will read from memory that the algorithm does not own (and which may constantly be changing); but only requires exclusive (&mut) access to the set of tree nodes; if one just wanted to count tree nodes, read-only non-exclusive (&) access would suffice.

Conclusion

Was the rewrite worth it? I suspect yes: improving the code does seem to be somewhat easier to do in Rust than with the original, where I could never be certain that I was not missing some edge case, and moving DMABUF handling to use Vulkan has significantly improved performance. I will know for certain in a few years when I see what types of bugs I run into. Rewriting the code did take time; I did not precisely measure it but would estimate a month of work so far (spread out over a longer period, since Waypipe was not my sole focus); this is similar to the time needed to develop the program to begin with. Could I have acheived the same effect with a month of work in C? Probably, but I would not have as much confidence that the project quality would remain stable in the future, when I will probably make many changes and spend less time testing them. (For example: I held off on parallelizing buffer diff message application with the C version, because I expected it to be a difficult task to do right.)

Overall, I think Waypipe was appropriate for a Rust rewrite: Waypipe is network facing code, needs to be efficient, does some parsing, and uses multiple threads; and was originally written in C. Interacting with existing libraries’ C APIs was, as expected, more tedious to do than in C, but I think the improvements to Waypipe’s core logic are worth it.

In general, I would pick Rust for new projects that do a lot of parsing or communication with other (untrusted or badly written) processes, are CPU limited and need to be fast or power-efficient, require fast startup, and do not deeply use large and irreplacable libraries from some other language. I would want to switch from C or C++ to Rust if the project is something that I use and make changes to often enough for the cost of making the change to be worth it; but this is rare. Switching from existing memory safe languages is probably only worth it when performance is at stake, and it is not practical to convert just the hot code.

I would not currently use Rust for glue scripts, basic file conversion, data analysis, game scripting, or exploratory programming; languages with a garbage collector and a more compact syntax (like Python, Scheme, Haskell, Clojure) tend to be better there.

Often the choice of language is controlled by which libraries are available: I’ve used C++ for many things because it was the easiest interface for a major library (Qt, OpenCASCADE, CGAL, or SDL/OpenGL). C is OK for small programs where most of the content is interaction with C APIs, but the language itself is the limiting factor beyond a certain scale, when proper number handling, string operations, or nontrivial data structures are required.

Finally, a reminder: Waypipe has been available for five years, and using it exposes one’s local Wayland compositor to an application running on a different computer. Even though Waypipe makes some sanity checks on the messages it receives, it cannot guard against bugs in a Wayland compositor. As before, do not assume that Waypipe itself possibly being more secure makes it safe to waypipe ssh into a compromised computer and run GUI programs; Wayland compositors are in general not well tested against adversarial clients.

Home