Searching for Unknown Unknowns - eno PDF Reader Blog

7 min read Original article ↗

No matter how good you are at writing code, with any sufficiently complex project you will run into bugs. As a web developer, I'd become pretty adept at defensively coding to limit the impact of bugs when they do occur. eno was a whole new beast though: a GPU rendered desktop app written from scratch in Zig. I never worried that our webapps would crash the entire browser if the user did something we didn't expect, but in compiled desktop applications, the tiniest of errors can cause a segfault. Needless to say, this had me waking up at night in a panic.

In pursuit of a good night's rest I needed to have some baseline confidence that eno wouldn't crash when a user ended up in an unexpected state. It's easy enough to account for and test the golden path of a workflow, but as the feature set matured it quickly became impossible to check every possible user story. Think about all the states a single UI element can be in: hovered, mid-drag, scrolled to some position, with or without a document open, with or without an active search. Multiply that across every element in the application and you end up with a universe of possible states that no amount of manual testing could cover.

Within this universe there are bugs you know about (known knowns), bugs you suspect might exist in tricky areas (known unknowns), and bugs you've never considered. These were the unknown unknowns, and they were the most dangerous because I couldn't write a test for something I hadn't thought of. The question was: could we systematically turn these unknown unknowns into known ones?

let it soak

To find these unknown unknowns, we turned to a time tested strategy: brute force. The thinking was, if we could simulate enough interactions for long enough of an interval we would eventually stumble into weird states that we never would have thought to test manually.

Luckily there were only so many ways in which a user could interact with eno: clicking, dragging, scrolling, and typing. We could easily issue synthetic versions of these inputs at the point they are ingested. So we built what we call a soak test, a long running test which processes production workloads over long periods of time to see how they behave. Our soak test runs a loop that generates random mouse moves, clicks, drags, key presses, scroll events, and text input, firing multiple events per frame for up to 100,000 frames by default.

This immediately revealed a couple low hanging bugs, which were trivially patched. After those quick wins though, we got stuck. The loop would never really do anything interesting. Some UI interactions required you to click and drag, others required a click on an input field, inserting text, then clicking on a specific button to continue. The chance of each of these events happening in succession with a purely random algorithm were slim to none. A truly random search of the space had hit a local minima.

a way out of the valley

While worrying about unknown unknowns late one night, I came across this really cool talk from the CEO of Antithesis, Testing a Single-Node, Single Threaded, Distributed System Written in 1985 by Will Wilson. The central idea was that with a few simple heuristics you can get an automated system to string together enough random actions to beat Super Mario Bros! This led us to create the following heuristics for our soak tester:

Action repetition. Each action type has a probability of repeating itself on the next event. Where events frequently occur in sequence (e.g. scroll events) they have a high chance to repeat. Where events usually don't repeat (e.g. right clicks) they have close to zero chance of repetition.

Word list. When the soak test detects that a text input field is focused, instead of hammering random characters it queues up a phrase of 1-3 words from a predefined vocabulary ("contract", "amendment", "quarterly report", etc.) followed by an enter key. The phrase drains character-by-character across frames so it looks like real typing. This means we can actually submit searches, and trigger workflows that depend on textual input.

Gravity. Most of the screen isn't clickable, so we gave clickable elements a gravity like effect on the cursor. On each cursor move we scan the UI tree from the currently hovered element outward looking for clickable boxes. The nearest one exerts an inverse-linear gravitational pull on the cursor's velocity, pulling it towards its clickable region.

These three heuristics alone have revolutionized our soak test and led to the discovery of many more unknown unknowns.

soak test in action, you can see how erratic it behaves even after adding heuristics

rewinding the tape

Every soak test is seeded, and every event is written to a trail log as it happens, continuously flushed to disk so the file survives even if eno crashes. If eno does crash, our custom panic handler dumps the last 512 events from a ring buffer to stderr along with the seed. This allows us to replay the exact trail file or just pass the seed back in, giving us a hint as to what was happening in eno right before the fatal crash.

Beyond replaying trails, we added a record mode in our debug builds that captures user input into scenario files. We launch eno with a scenario name, interacted with it normally, and close the window. The recorded trail file captures every mouse move, click, keypress, and scroll event with frame-accurate timing. These scenario files can then be replayed deterministically against new builds, giving us confidence that changes haven't broken our main user flows.

a dash of intelligence

Even with heuristics the soak test is still fundamentally random. There are workflows in our application that require a specific sequence of actions, things like opening a folder, running a search, pinning the results, then scrolling through annotations across multiple documents. The chance of the soak test stumbling into that exact sequence is astronomically small.

To bridge this gap we built an MCP (Model Context Protocol) server that lets an LLM (Large Language Model) see and interact with the application. The LLM can take screenshots, read UI trees and issue inputs allowing it to navigate menus, click on specific elements, type queries and generally use the app the same way a user might. We used this to create soak scenarios. Instead of hoping randomness covers our important workflows, we can have the LLM systematically exercise them and record the interactions.

Claude using our MCP to create a soak scenario for repositiong panes in eno (3x speed)

to infinity and beyond

Running our soak scenarios guarantees that eno does not crash (i.e. no liveness bugs), but it does not guarantee correctness. The application could render garbage and the soak test would happily report a pass. Our next step is to integrate strategic screenshotting and diffing of screenshots so we can detect visual regressions, not just crashes.

Beyond that, we want to hook up local LLMs to run the application exhaustively, all the time. Not just recording scenarios for later replay, but having an LLM continuously explore the app, trying new things, and flagging anything that looks wrong. The goal is to have a system that is always searching for unknown unknowns, even when we're asleep.