Sketching WebNN with AI

This second part in a series about implementing WebNN with AI.

“In an insane world, it was the sanest choice.”

— Sarah Connor, Terminator 2: Judgment Day

We are at the beginning of the second week of the implementation of WebNN. It is a Tuesday, and as I am writing these lines, a coding agent is trying to figure out the reason for a failing test.

Last week saw some frantic initial coding activity — the software equivalent of a sketch— which I will describe in this post. I will also describe what I’m doing this week, which is the first step away from the sketch and towards a more refined implementation — the first layer of paint.

The feature branch I am working on is webnn, and you can see the diff here (against master of my personal fork which is sort-of up to date with upstream Servo).

But first, some introduction on implementing the Web.

For the readers joining only now: the gist of the story is that I am implementing WebNN on a branch of my personal fork of Servo. Because the project has banned AI, this cannot merge in main. I am doing this as my answer to the autonomous coding slop like fastrender that has been released in the last months. You can read more on the backstory in the first part.

On the implementation of a Web Standard

When I implement a standard, my work usually goes through various stages. The initial stage — the sketching I mentioned above — involves identifying a complete chunk of the standard — complete enough to pass a significant WPT test — and then implementing the “front-end” part of that chunk and some limited “back-end”. I will explain both terms below.

In most standards, the front-end consist of a public interface that the Javascript(JS) can call into. In Servo, this front-end is implemented as Rust code inside the script component, where the event-loop runs. I’ve once tried to write-down a kind of workflow of how to write this code, and you can read it in a chapter of the Servo book.

Now, the front-end is usually supported by some kind of conceptual “back-end”, which is where you can find the parts that will interface with the surrounding operating system and implement the capabilities offered by the front-end— the Web is an application platform.

This back-end can be defined in the standards in different ways:

by default, using the concept of running steps in-parallel.
In some standard specific way: for example WebGPU uses a concept of timelines.

For example, to implement the Fetch standard, you need a front-end which will give the JS access to a fetch function, and then you need a back-end that actually does the networking.
In the spec, the JS available method is found at #dom-global-fetch, and you can see the parallel work starting at (what currently is, this always changes) Step 11 of #concept-main-fetch, which reads: “If recursive is false, then run the remaining steps in parallel.”

So what does all this mean for implementing WebNN?

On the Implementation of WebNN

The WebNN standard introduce a spec-specific concept: the context timeline. So while the createContext method of the navigator(hence available to JS) launches the context creation in-parallel, all the other backend stuff is run by enqueuing steps on the timeline of an existing context. Example: step 8 of #dom-mlcontext-dispatch reads: “Enqueue the following steps to graph.[[context]].[[timeline]]".

Other than that, the front-end part is fairly, well, standard.

So for my initial sketch, I had the following goal in mind: implement enough of the front-end and backend so that I could run a significant conformance test, such as add.https.any.js. For this, I needed the following:

Context creation.
An operand like add.
Creating, writing, and reading of tensors.
Graph building.
Graph dispatch.
Backend compute of the dispatched graph(CoreML only). For this I would be using rustnn.

And yes, that is quite a lot. Next we will look at some initial considerations on how to make the Servo codebase agent-ready.

Enter the Agent

I am a beneficiary of a Github program granting free Copilot Pro licenses to some open-source maintainers(the irony is that the open-source project I help maintain, Servo, has banned AI).

The premium requests are allocated on a monthly basis, and when I started this, I had already burned all my premium requests on a side project(an agent that writes web apps to perform tasks without risk of prompt injection).

So I thought: why not try to see how far I get with one of the unlimited models, like Raptor mini. I did some initial testing by quizzing it on how it would do certain things in the codebase, and was surprised to find out it appeared capable. So I decided to use that for at least the initial part of the project.

I like the idea of using a cheap model for making what would be a very large and meaningful change to an even larger existing codebase. It would be the autonomous AI slop antithesis: proof that what matters is not: how expensive is your LLM compute, but rather: how good is your codebase and guidance.

Besides this, the only other concept I started with was the idea of building up the guidance to the agent over time, by having the agent itself write any lesson learned to readme files, which would be organized in a kind of nested architecture. And of-course, I needed a way to have the agent browse the Web standard to implement.

Let’s take a look at that first week of frantic coding next.

The First Week of Sketching

I started by checking out the spec into a specs folder which itself was ignored by git. The idea is that I wanted the spec as a local artifact in the workspace itself, but this should not become part of the git history.

I added a top-levelAGENTS.md file, which contained the information relevant for the codebase as a whole, such as how to look for README.md files for the task. I then added readme files for both the script component and the webnn directory within it, each containing info relevant for that specific level only.

I also settled on using rustnn to access the operating system ML capabilities, and found pywebnn to provide excellent examples of how to integrate it into a runtime.

I then directed the agent through a step wise implementation of the front-end only, leaving all timeline queuing as TODO, saving that part for later in the week.

This quickly turned into the usual — well-documented online — dopamine-fueled agentic coding experience. I attempted to review each change made by the agent to guide it along the way. It was quite hard to get the agent to follow the guidance that was building-up in the readme files; I had to repeat myself in follow-up prompts, and even once, in despair, lost my nerves and told it to “READ THE <expletive> GUIDANCE!”.

But despite the occasional fury, it was a good and productive experience. By late Friday, I had the following:

24 commits; adding up to almost 7k lines of code.
a limited subset of add.https.any.js that would pass(I had to manually remove some tests because they would cause hangs and crash the suite), which would execute by way of CoreML.

You can view the diff (against the master branch of my personal fork, which is not that far beyond upstream main) at that point here.

I reviewed the sketch on Saturday while sitting in a coffeeshop. (Normally, I avoid weekend work, but agents put you in a certain ‘rage against the machine’ mode.) Despite my constant reviews, slop had slipped through.

In particular, the readme files had grown into these large documents full of duplicated and inconsistent content (which probably made it harder for the agent to remain consistent). The code was often not documented as I wanted to, and you could spot all kinds of weird things(for example: this seemed like a clear bug). I made mental notes to address those later, and also resigned myself to keep going and just do a big clean-up later down the road (for example: I’ve stopped trying to tell the agent to not document changes in the code but only the latest state, or to not use fully qualified imports: those things will be dealt with later).

A positive surprise during this week was the discovery of search-bikeshed, which gives you “simple agent-friendly search CLI for w3c bikeshed files.” This felt like a major upgrade from just having a checked-out version of the spec in a folder ignored by git. Although it proved hard to get the agent to use the tool, and even harder to get it to use it properly to document the code, I think this still points to the way forward: not sci-fi swarms of agents, but simple LLM aware coding tools that make a difference.

Also, a sketch is just a sketch is just a sketch; so, came Monday, it was time to start putting some paint on the thing.

The Underpaint

What made that first week’s output just a sketch? Among other things, the backend consisted of only a single “manager” thread, which handled all incoming messages, and performed both the compilation and dispatch of a graph.

Blocking the main thread of a component is undesirable for various reasons, such as performance or responsiveness. One very simple example of a problem that might arise: since the manager can manage any number of ML contexts across different web apps, if the compute blocks the manager, that means it blocks any operation, even relatively cheap ones like creating a tensor, for all contexts across the user agent.

So it’s a good idea for a component like the webnn backend to have a main thread that doesn’t do any work for a long time, and to instead offload such work to a (pool of ) thread(s).

And this change brings us to the first major conceptual bottleneck experienced in the agentic coding flow.

The First Bottleneck: The False Promise of Threading.

The agent tried to “help” by adding a thread but immediately blocked the main thread anyway.

The gist of it is that when I asked the agent to add a thread to do the compute on, it just made the main manager thread block on receiving the result from the compute: you may as well just do everything on the main thread.

I wanted to test the model a little, and I told it to give me a plan for how to make this non-blocking, and it gave me what was actually a pretty good plan. But, what was missing from it was the fact that up to that point we had treated the channel from script to the backend as the timeline concept in the spec; as in: the channel would serialize all requests. But if we started doing the compute on a background thread, while still doing other operations like reading tensors on the main thread, then we would break the serialization property of the timeline.

We had to introduce for the first time a proper timeline concept, which the agent did under my supervision. It summarized the interaction in this gist (which you have to take with a grain of salt), and here is the commit.

The Second Bottleneck: The Data Ownership Crisis.

Excessive cloning was a symptom of a structure that didn’t “own” its data properly.

There was another conceptual bottleneck, one somewhat harder to explain in detail, the next day. I did some sampling and saw that there was a lot of cloning of the graph info on the backend. When I looked at the code in more details I saw to my horror that I hadn’t noticed how some pretty nonsensical structures had crept into it: despite my ongoing review of all changes, I hadn’t paid close enough attention.

This time the fix was not just removing some clones here and there, but re-organizing the compilation and compute workflows so that each owned what it needed, which made almost all cloning unnecessary: the best times are when performance flows naturally from a code structure that makes sense.

The agent wrote a summary of that episode, available in this gist (again, to be taken with a grain of salt) and here is the first commit, as well as a very important second commit to fix for a problem (that I spotted in the first commit, again a good example of agentic <expletive>).

In my opinion, both episodes, as well as countless other smaller ones encountered in that first long week, show how programming is not just about finding a local optimum. Even if you have a spec and a standardized test suite to test against, each implementation requires identifying, and resolving, problems that are more architectural or conceptual in nature than about just optimizing a loop. Code structures that makes sense are the foundation for a system working as intended, both in terms of functionality and performance. And you can only discover the right structures as you go; it’s not a fixed blue print type of thing.

And you know what? The AI really sucks at this conceptual kind of work. So instead of worrying about exponential supposedly getting out of hand, I’d focus on that part of the work.

What next?

The difference between the code as of this mid-week — with the underpaint on— and the initial sketch, can be see over here.

What remains to be done? A lot:

The whole losing of context.
A long list of graph builder operands, and an almost equally long list of inputs to support(currently only float 32).
I have no idea on how to integrate a context with a WebGPU device.
Countless other unknown things that will come-up as I keep going through the test suite.
Compute currently does a lot of data shuffling on the cpu; I should look into the device tensor concept of pywebnn and see if I can do something similar.
At some point one has to start thinking about supporting more than just CoreML.
Just looking at the diff above now, I see lots of comments that are huge and by now out of date, as well as some non-sense code: classic AI slop that still slips through the net of my reviews, but I think I’ll just leave it and do a big clean-up, or several of those, later (fixed one glaring and important issue just now).

So, still a long way from varnishing day. I’ll be posting follow-up entries to this diary, and in the meantime you can follow-along on the feature branch.

As of today— let’s say one and a half week of work into it — I have something that starts to make sense, and, if it wasn’t for the AI, that milestone would have taken months of work to achieve.