Guidelines Don’t Scale. Patterns Do.

10 min read Original article ↗

What I learned from failing to ship a Chrome extension. Twice.

Fayner Brack

A spaceship parked in the middle of nowhere. In the left side, mess, destruction and failure. In the right side, a carefully built sattelite tower in one piece that crosses behind the spaceship to meet their fate.

Large language models are trained on the open web. The open web is full of blog posts, tutorials, and Stack Overflow answers. Very little of that content reflects production-grade engineering. The ratio of decent code to everything else online is small. So when you ask a model to write software, it draws from a distribution where the median is not great.

The natural response is to write more prompts. Add guidelines. Specify rules. Tell the model to prefer Partial<Type> over as Type in TypeScript. Tell it to avoid God classes. Tell it to write tests.

This works. Until it doesn’t.

Prompts that compensate for training gaps are coupled to the model’s weaknesses.

When Claude Opus makes a specific mistake often enough, you encode a rule to prevent it. That rule is now implicitly tied to that model version. The next release fixes the original problem but introduces a new one. Your prompt still guards against the old mistake. Now you have two layers of correction, one of them stale.

The feedback loop for “how much prompting is enough” is hard to make objective. Trial and error is the honest answer most practitioners give. But trial and error without a stable reference point just produces drift. You write more rules. Some contradict others. The prompt file grows. Nobody trims it.

There’s a harder problem underneath all of this.

Context windows have a ceiling that token counts don’t reveal.

A model with a 200k token window and a model with a 1M token window will both start hallucinating around 60k tokens of active context. The number on the label is not the number that matters. What matters is how much the model can hold in coherent focus at once. That number is smaller than anyone wants to admit.

So if your strategy is to dump a large specification into context and let the model follow it, you hit a wall. Not a hard crash. A slow degradation. The model starts ignoring rules near the middle of the prompt. It invents plausible-sounding patterns that match nothing in your codebase. The output looks right. It isn’t.

This is where the real question lands. If guidelines degrade at scale, and prompts couple to model versions, what do you anchor to?

The codebase is the prompt.

The most reliable way I’ve found to guide an LLM is to build a reproducible architecture for a given class of problem. A browser extension pattern won’t help you build a data pipeline. It’s one pattern per problem shape.

Tests, linting rules, code coverage thresholds, a consistent directory structure, and clear naming conventions baked into the code itself. Not described in a markdown file the model may or may not attend to.

When a new feature needs the same tools, you point the model at an existing implementation and say “build X, follow this pattern.” The model does not need a five-page specification for how to write a service layer. It needs one good service layer to reference. Sometimes writing that first implementation by hand is faster than the back-and-forth of prompting a model to get it right.

This is what LLMs do well. Pattern replication. Given a concrete example and a clear target, they produce consistent output. Given abstract rules and long documents, they produce approximations.

What this looks like in practice: building browser extensions for Hutch.

I’m building Hutch in the open. It’s a reading list app, and it runs as both a Firefox and Chrome browser extension. The way those extensions came to exist is a clean example of why patterns beat guidelines.

The Firefox extension came first. I built it with a high degree of human intervention. I wrote the tests, set up 100% coverage enforcement, configured the linter, chose the directory structure. The model helped, but the architecture was mine. That’s the cost of the first feature in any system. You can’t skip it.

The Firefox extension’s background.ts uses the WebExtension browser.* API throughout. Here's the token storage and message listener:

// firefox-extension/src/runtime/background/background.ts

const tokenStorage: TokenStorage = {
async getTokens(): Promise<OAuthTokens | null> {
const result = await browser.storage.local.get(STORAGE_KEY);
const raw = result[STORAGE_KEY];
if (!raw) return null;
return raw as OAuthTokens;
},
// ...
};

browser.runtime.onMessage.addListener((raw, _sender, sendResponse) => {
if ((raw as { type: string }).type === "shortcut-pressed") {
return;
}

const message = raw as PopupMessage;

corePromise.then((core) => {
switch (message.type) {
case "login": { /* ... */ }
case "save-current-tab": { /* ... */ }
case "remove-item": { /* ... */ }
// ...
}
});

...
});

Then I asked the model to build the Chrome extension by porting the Firefox one. That first attempt produced code that looked similar but went Chrome-native throughout:

// PR #79 — chrome-extension/src/runtime/background/background.ts

const tokenStorage: TokenStorage = {
async getTokens(): Promise<OAuthTokens | null> {
const result = await chrome.storage.local.get(STORAGE_KEY); // chrome.*, not browser.*
const raw = result[STORAGE_KEY];
if (!raw) return null;
return raw as OAuthTokens;
},
// ...
};

chrome.runtime.onMessage.addListener((raw, _sender, sendResponse) => {
// ...
const message = raw as PopupMessage;

corePromise.then((core) => {
switch (message.type) {
case "login": { /* ... */ }
// ...
}
});

...
});

The structure was the same. But every single browser.* call became chrome.*. Over 30 occurrences. The context menus used chrome.contextMenus. The tab management used chrome.tabs. The OAuth flow used chrome.windows. The model picked the Chrome-native API because that's what Chrome extension tutorials on the web use. That's the training distribution talking.

CI broke repeatedly. Chrome handles service workers differently from Firefox’s background scripts. The model couldn’t fix the E2E tests: Chrome headless mode doesn’t support loading unpacked extensions with service workers, and the model kept trying. After multiple rounds of automated CI fixes that all failed, I closed the PR and wrote: “Split into multiple refactoring steps first.”

Get Fayner Brack’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

That’s the key moment. The problem was not that the model needed a guideline saying “use webextension-polyfill instead of chrome.* APIs.” The problem was that no shared architecture existed for the model to follow.

The extraction

So I pulled the browser-agnostic logic out of the Firefox extension and into a new package: browser-extension-core. The core defines a BrowserShell interface that each extension implements:

// browser-extension-core/src/shell.types.ts

export interface BrowserShell {
onShortcutPressed: (handler: () => void) => void;
openLoginScreen: (params: { url: string; title: string }) => void;
focusLoginWindow: () => void;
getActiveTab: () => Promise<{ id?: number; url: string; title: string } | null>;
queryActiveTabs: () => Promise<Array<{ id?: number; url?: string; title?: string }>>;
setIcon: SetIcon;
createContextMenus: () => void;
onContextMenuClicked: (handler: (info: { /* ... */ }, tab?: { /* ... */ }) => void) => void;
onTabActivated: (handler: (tabId: number, url: string) => void) => void;
onTabUpdated: (handler: (tabId: number, url: string) => void) => void;
onLoginWindowClosed: (handler: () => void) => void;
}

Everything browser-agnostic moved into the core. OAuth, token management, the reading list, popup logic, build tooling. Each extension became a thin shell that wires browser-specific APIs into this interface. The BrowserExtensionCore function takes a BrowserShell and does the rest:

// browser-extension-core/src/core.ts

export function BrowserExtensionCore(
shell: BrowserShell,
deps: { auth: Auth; logger: HutchLogger; readingList: ReadingList }
): Core {
const eventBus = createEventBus();
const saveCurrentTab = initSaveCurrentTab({ saveUrl: readingList.saveUrl });
const { updateIconForTab } = initIconStatus({ /* ... */ });
// ...
}

The project structure became explicit about what’s shared and what isn’t:

projects/
firefox-extension/
src/runtime/
background/background.ts # thin: implements BrowserShell with browser.*
background/tinted-icon.browser.ts # Firefox-specific icon tinting
chrome-extension/
src/runtime/
background/background.ts # thin: implements BrowserShell with polyfill
background/create-context-menus.ts # Chrome-specific context menu handling
offscreen/offscreen.ts # Chrome-only: service workers can't use canvas
browser-extension-core/
src/
core.ts # all business logic
shell.types.ts # the interface both extensions implement
auth/ # OAuth, PKCE, token management
reading-list/ # save, remove, fetch, find
popup/ # filtering, pagination
build/ # shared build tooling

The second attempt.

I tried the Chrome extension again, then dumped that PR to do everything locally. When the final commit landed, the Chrome background.ts looked like this:

// chrome-extension/src/runtime/background/background.ts (final)

import browser from "webextension-polyfill"; // ← one import
import { BrowserExtensionCore, /* ... */ } from "browser-extension-core";
import { initCreateContextMenus } from "./create-context-menus";

const CLIENT_ID = "hutch-chrome-extension"; // ← one string

const shell: BrowserShell = {
// ...
createContextMenus: initCreateContextMenus(browser.contextMenus), // ← one API difference
// ...
};

The diff between the final Firefox and Chrome background.ts is 5 meaningful lines. One polyfill import. One client ID string. browser.menus vs browser.contextMenus (Firefox and Chrome name the same API differently). An offscreen message filter (Chrome service workers can't use OffscreenCanvas directly, so icon tinting routes through an offscreen document). Everything else is identical.

I didn’t write a guideline saying “use webextension-polyfill.” I didn’t write a guideline saying “keep your background.ts thin.” The Firefox extension’s code said those things by existing. Both files still use as type assertions at message boundaries. The pattern replicated the architecture. It also replicated the shortcuts. Patterns carry the good and the bad.

The one genuine Chrome-specific divergence, context menu creation, got its own extracted file:

// chrome-extension/src/runtime/background/create-context-menus.ts

export function initCreateContextMenus(contextMenus: ContextMenusApi) {
return async function createContextMenus() {
await contextMenus.removeAll();
contextMenus.create({
id: MENU_ITEM_SAVE_PAGE,
title: "Save Page to Hutch",
contexts: ["page"],
});
contextMenus.create({
id: MENU_ITEM_SAVE_LINK,
title: "Save Link to Hutch",
contexts: ["link"],
});
};
}

Chrome service workers can be killed and restarted by the browser. Firefox background scripts persist. So Chrome needs removeAll() before creating context menus to avoid duplicates after a restart. That's a real API difference. The model handled it by extracting a small function. No guideline needed. The pattern of "extract browser-specific behavior into its own file" was already established by tinted-icon.browser.ts in the Firefox extension.

The coverage config tells the rest of the story:

// chrome-extension/enforce-coverage.config.js

// All testable business logic moved to browser-extension-core.
// Chrome-extension only contains browser-specific bootstrap code
// (entry points excluded below) and *.browser.ts files (excluded by base config).
const config = {
thresholds: {
statements: 100,
branches: 100,
functions: 100,
lines: 100,
},
extraExcludePatterns: [
'src/runtime/background/background.ts',
'src/runtime/popup/popup.ts',
'src/runtime/content/shortcut.ts',
'src/runtime/offscreen/offscreen.ts'
],
};

100% coverage. But the chrome-extension project barely contains testable logic. The entry points are excluded because they’re bootstrap code. All the business logic lives in browser-extension-core, which has its own test suite. The Chrome project is a thin shell. That's the shape you want when a model is writing the code: a small surface area of browser-specific glue around a well-tested core. There’s a name for this in the xUnit Patterns book: the Humble Object. Keep the hard-to-test boundary code thin. Move the testable logic somewhere else.

The full history is public in the Hutch repo. The failed PR is still there. So is the extraction, the second attempt, and the final commit. You can trace the exact moment when “give the model better instructions” stopped working and “give the model better code” started.

A messy codebase does not become less messy when you add AI. It becomes messier faster.

None of this means prompt engineering is useless. For the first pass on a new system, for one-off scripts, for exploration, prompts carry weight. The problem is when teams treat prompting as the primary interface with AI-assisted development and invest months into perfecting spec documents that will not survive the next model update.

The spec is fragile. The code pattern is durable.

If you need to write a guideline telling the model what good code looks like, the model doesn’t know yet. And if the model doesn’t know, your guideline is fighting the training distribution. That’s a losing position over time.

A better position: make the codebase so clear that the model can’t miss the pattern. The code is the documentation, the tests are the spec, and the linter is the style guide.

Writing better prompts is a reasonable goal.

Eliminating the need for them is the real work.