Audit, Test, Automate: How We Decide What AI Can Own

I run a tiny compiler startup: two people, pre-revenue. Like most companies at this stage, we are always short on staff, money, and time.

When Claude Code and Codex became good enough to use every day, they changed our trajectory. They let us do a lot more with much less.

Making it work was seriously painful, though. We had to rebuild the work itself: break it into smaller tasks, define what could be checked, and decide what was safe to delegate.

This post explains the delegation system we use. It is a field note for founders and small teams asking the same question: what can AI own, what can it assist with, and what should stay human?

Logging

The logging is an exhaustive task inventory: for a month, write down every task you do, how long it takes, and everything you wanted to do but didn’t.

This is very obviously very tedious and very important. The log becomes the automation backlog so it matters.

Let me show you how I did it.

This is one of my Google Keep logs:

2/27/26
wrote AI article  0.5 hour
executed unit test + fix (Claude; just reviewed/merged)
.25 cleanup tickets (some "todo" were actually done)
0.5h would love: organize a dinner to discuss our new product

After a month of that treatment, I dumped all of them into Claude and asked it to group them; I created a spreadsheet of my internal processes and how long they took:

3 hours writing tests
2 hours reconciling bank accounts
10 hours reading
1 hour cleaning up tickets

Once you’ve grouped everything together, you can start rebuilding how your company works.

The 3-Questions Test

The audit lists everything that you’re doing. Now we test whether each task is a good fit for AI execution. These three questions will tell you so:

Can the task be learned from reliable public sources?
Is a conventional result good enough?
Can we check the output, or bound the failure radius?

Publicly documented

Frontier labs compete with each other fiercely. They are training their LLMs on everything public they can get their hands on: blogs, textbooks, examples, documentation. So a good rule of thumb: if it’s somewhat public, there is a good chance the LLM knows about it and can act on it.

To explain my point, I’ll use a counter-intuitive example where LLMs do not work well: email deliverability. There are a lot of blog posts and explanations, but most of what is online is wrong, outdated, or useless. AI can help with the basic cases — SPF, DKIM, DMARC setup — but it is helpless on the real problem: reputation. Bookkeeping is different. The rules are public, classes are accessible, and the expected output is conventional.

A deeper version of this question: LLMs are very strong at translation problems — converting between two well-documented formalisms. Code to docs, spec to tests, prose to structured data, engineering ideas to patent claims. The transformer architecture was built for translation. And translation forces tacit assumptions to surface: an engineer says “the system handles errors gracefully,” but a formal target language can’t accept that: you have to spell out what happens, under which conditions, with which fallback. The vernacular hides assumptions; the formal target can’t. Same thing with math and natural language: in vernacular “nothing” and “zero” are basically the same, but in programming they’re radically different, translating forces you to pick, and the LLM usually picks the most conventional answer for you.

“Publicly documented” doesn’t mean “there are blog posts.” It means: an outsider could learn it from public sources (books, standards, accredited courses).

Quality of result

The theory says: if it’s not a source of competitive advantage, you outsource it. Back when printing was a real business, that was the standard example: if you’re not a printer, you need to print as well as competitors and printing better wouldn’t bring you additional business.

With AI, the question is similar but harsher. An LLM out of the box will give you an average result (unless you equip it with strong context, taste, evaluation and iteration). If the result you want is not average, you will need to fight the AI to get it.

But average result is perfectly fine for a surprising number of tasks and we can work with that already. E.g. A routine development status report is a task where average is fine. But if you need to convince a Google VP to use your product, it’s probably not something for an AI.

Finally, it looks like a trivial question but it actually defines what your company isn’t; in the sense you don’t compete on being average and getting it wrong can tank your business. IBM’s PC history is the canonical warning: pieces that looked like supplier decisions became industry control points. The framework cannot fully protect you from that. Every delegation still needs one extra question: could this define the company?

eval & Failure Radius

I left the result evaluation for the end because it’s the most complex and the most important piece. Evaluation splits into two cases: work you can judge directly, and work you cannot.

If you are able to evaluate (e.g. code if you’re a programmer, a design if you’re a designer, etc.):

My rule of thumb is to never spend more time reviewing the work of an AI than I would spend reviewing the work of a human.

The rationale here is purely economic: you’re the main cost, not the AI, so if you spend more time verifying you’re increasing your cost instead of decreasing it (and yes, for the sake of simplicity, I assume a production cost of zero because you delegated the task to an AI). For instance, coding tasks are being automated because lead developers already are spending a lot of time reviewing PRs, therefore, the “reviewing cost” is the same (and the “make” cost falls to almost zero).

Now, the hard part. In life, you can’t evaluate everything; with AI it’s the same. How do you judge the work of your CPA? How do you know that your doctor prescribed you the right medicine? You just can’t. Either the failure is too hard to recover from (e.g. picking the wrong surgeon for your open heart surgery) or the feedback loop is too long to be useful (e.g. exercising properly for your health).

The state-of-the-art solution for unverifiable tasks is to operate from trust: working only with people you know (incidentally that’s why known brands extract a premium). But AI is not “people”. So I don’t trust it.

For each unverifiable task, I define a failure radius: what’s the worst that can happen if it fails? In most cases, either a downstream layer catches it or nothing does. Vibe coding is the canonical example: if you’re not a coder, you can’t really judge the code. You can only bound it: run it locally, test the behavior, and keep it away from production until someone competent reviews it.

When verification is structurally impossible, the answer isn’t “verify harder” — it’s “bound the downside.”

Summary

After running your tasks through these three questions you’ll find that many tasks fail the test. That’s a sign that you need to redesign your process. We had a report to generate for our health provider. All of it could be done with AI but the failure radius was too large (we could get expelled if we made a mistake) for AI to own the task. So we added a human verification step.

Examples

Let me show you how I used it and what results I got.

Comcast Outage Credit task

Comcast at my home isn’t very reliable. Multiple outages per day for a while. (They eventually fixed it. Another story.)

I sent an agent to negotiate a credit on my behalf. I gave it the facts (outage length, the WeWork I was forced to work from). The agent ran a multi-turn negotiation, pushed back on the first offer, cited the concrete costs the outage caused me, and got $25 back.

The framework predicted this would work:

Getting Comcast refunds is documented to death in forums.
I didn’t need an outlier result. This was an experiment.
I could evaluate the outcome (money back).

The interesting part isn’t the amount. It’s that the failure radius was near zero (worst case, Comcast says no).

Write a LinkedIn Post

Without missing a beat, I asked my agent to write me a LinkedIn post about my Comcast adventure. The LLM proposal was mediocre but LinkedIn needs above-average content to cut through the noise.

The framework predicted it. LinkedIn writing is commoditized knowledge and I can evaluate what’s good, but the task needs outlier quality. You’re competing for attention against everyone else posting. Average loses.

All three need to pass. One structural mismatch kills full automation. LinkedIn passes the first and the last question but it fails the second one.

Filing Patents

Patent filing is one of our core processes. We moved much of the invention-capture pipeline to AI, not to save cost, but to file more patent applications.

Before AI, filing an application cost between $5K–$40K, not counting our time. And our time is the real bottleneck: prior-art search, writing the invention, explaining patentability, and drafting claims.

Our current process is AI-heavy but with human oversight. Periodically, AI reviews our repositories. It proposes candidate inventions, runs a preliminary prior-art search, and prepares a filing package: invention summary, prior-art notes, claim candidates, and draft specification.

Then humans decide. We approve, reject, amend, or merge candidates, and decide whether to involve counsel.

Patent drafting is exactly the kind of problem LLMs are strong at: translation from engineering ideas to legal claims. LLMs have seen thousands of examples on both sides. They can do 90% of the routine filing package which is most of the production work needed to capture an invention.

The framework predicted it (patents are publicly documented and average quality is enough for a company like ours). For us, the failure radius is bounded because the alternative would be not to file many patents. To be clear, for the patents we judge very important we still use counsel. It’s a classical quantity vs. quality trade-off.

Comcast credit

LinkedIn post

Unit-test generation

Patent filing package

Email deliverability

Conclusion

AI made us a more explicit company; less tacit knowledge, more scripts and “executable prose”.

That is the real leverage. Before delegating work to AI, we first had to design the work: log it, group it, name it, split it into smaller tasks, and decide what success looked like.

Then every task had to pass the same test: is the domain publicly learnable, is an ordinary result good enough, and can we check the output? If we could not fully check it, we had to ask a different question: what is the failure radius?

Most tasks fail this test at first. That is not a reason to give up. It is a signal that the process is still too vague. Add a checklist. Add tests. Add a human review step. Add a narrower scope. Add a permission boundary. Once the work becomes inspectable, some of it becomes delegable. Once it becomes delegable, some of it becomes automatable.

This is the operating lesson for us: AI is a forcing function for better delegation. It rewards clear inputs, bounded tasks, explicit review, and honest judgment about risk.

For a tiny team, that matters more than the model. We do not win by asking AI to “do the work.” We win by redesigning the work so that humans keep what “only humans can do”, while AI handles everything else.

Start with the audit. Write down everything you do for a month. You will hate it. Do it anyway. That list is where the leverage starts.