Putting Intelligence to Work

Frontier models can theoretically augment over 90% of knowledge work tasks.¹ In practice, even Anthropic’s own employees fully delegate less than 20% of their work to Claude.² OpenAI measures a 6-17x gap between what power users extract from models and what everyone else does.³

This gap has a name, capability overhang, and it’s the historical norm. Electric dynamos were “everywhere but in the economic statistics” in 1900. Less than 5% of factories used electric motors. The real productivity gains didn’t arrive until the 1920s, when factories were redesigned around unit-drive systems.⁴ The engine was never the bottleneck. The factory was.

We’re in the same phase with AI. The engine is here. What hasn’t caught up is everything surrounding it — the environment a model operates in, and how we hand it work. Thomas Hughes called this pattern a “reverse salient”: a segment of an advancing front that falls behind while the rest surges ahead.⁵

Most conversations about AI focus on the model. How smart it is, what benchmarks it clears, whether it can reason or code or plan. But the gap between theoretical capability and actual use isn’t a model problem. It’s a systems problem. Two systems, specifically.

The interface shapes what work you can hand off — ChatGPT’s chat window, Claude Code’s terminal, a Slack bot. The harness shapes what the model can actually do with it. If the model is a CPU, the harness is the operating system:⁶ it manages what context the model sees, what tools it can reach, and how its execution is controlled.

Interfaces are visible — you interact with them. Harnesses operate behind the surface. Both evolve as models get smarter, but the harness is the one that determines what’s possible, and it’s the one most people miss.

Chat is where most people still interact with AI. You ask, the model responds, you refine. Everything is synchronous, and the model’s output is bounded by your input rate. The system can’t move forward without you driving it.⁷

The harness constraint: chat systems are read-only with respect to the outside world. They fetch information — search the web, retrieve documents, call tools for data — but they can’t act on anything. No files edited, no messages sent, no state changed anywhere.

Alignment is simple: you’re right there. You monitor every response, steer through your input, correct mistakes in real time. The user is the alignment mechanism. This works because the model navigates almost no ambiguity on its own — but it also means the system’s output is capped by your attention.

You define the work, the system drives execution.⁸ You hand over a task — refactor this module, draft this report, investigate this bug — and the system decides how. Which files to read, what approach to take, when to ask for clarification. You monitor and intervene for course corrections, but the low-level decisions happen without you.

The harness goes from read-only to read-write. The model edits files, calls APIs, interacts with external services, writes and executes code. Doing work in the outside world requires tools that act in it.

In chat, you align the model by watching it. In collaboration, you can’t watch every decision — the system makes dozens of small choices between each of your interventions. Those choices need to be good without you in the loop. Context is how you get there. Context requires infrastructure:

Memory persists what the system learns across sessions, so yesterday’s corrections survive into today. Without it, every conversation starts from zero — the system repeats mistakes you’ve already corrected.

Skills package tools, context, and instructions into reusable capabilities, so a skill for code review carries different judgment than a skill for customer outreach. Without them, every task gets the same generic treatment.

Rules codify your intent into persistent instructions — never push to main without tests, ask before deleting files — so the system respects boundaries you shouldn’t have to re-state.

Context engineering⁹ is the discipline of curating all of this, because the model’s judgment is only as good as the context it reasons over. Get the context wrong and a capable model makes confident, well-reasoned, completely wrong decisions.

Control shifts from real-time to codified. Instead of steering each response, you set permissions, define approval gates, write rules. Configuration scales across sessions and tasks. Conversation doesn’t. This is a profound shift — you’re still the alignment mechanism, but through persistent architecture, not transient dialogue.

The result buys back enormous amounts of time. But it doesn’t scale infinitely. You can run multiple sessions in parallel, but the cognitive load of context-switching between them becomes the new bottleneck. You’re a collaborator, not a manager. The system still needs you close.

Everyone wants to build management-paradigm systems. Almost nobody knows how.

The premise is seductive: human attention decouples from system execution. The system runs for hours, takes most decisions autonomously, surfaces results when done. You review async. The constraint on throughput shifts from your attention to the system’s capability — an order of magnitude more parallelization becomes possible.

Claude’s background tasks, Codex’s autonomous runs, and Linear’s Agents all gesture toward this. None fully deliver it. The reason isn’t the interface — it’s the harness underneath. Four unsolved problems block the way.

Self-evaluation. In collaboration, you verify output. In management, the harness needs to. Test suites, conformance checks, automated review — the system needs ways to judge its own work before surfacing it.¹¹ Without self-evaluation, a system running autonomously for hours compounds errors instead of catching them.

Escalation judgment. Knowing when to proceed vs. when to surface a decision to the human. Too many escalations and you’re back to collaboration. Too few and you get silent failures. Nobody has solved this boundary — and it’s the core design problem of the entire paradigm.

Durable state. Tasks running hours or days need error recovery, checkpointing, the ability to resume. Problems solved decades ago in workflow orchestration, but novel for AI systems that reason their way through execution.

Deeper context. With most decisions automated, quality depends entirely on alignment. Context is alignment’s most efficient mechanism, so systems get hungrier for it — and today’s harnesses can barely keep up with collaboration-scale context demands, let alone management-scale.

An honest question hides behind these four problems: is management even the right paradigm for most work? Chat, collaboration, and management might be three different tools for three different jobs — a screwdriver, a drill, and a lathe, not a progression from primitive to advanced. Some work may be permanently best served by tight human-system coupling. Management may only work for tasks with clear verification criteria and low ambiguity. The industry assumes a progression. Reality might be a toolkit.

A deeper problem cuts across all three paradigms, and it gets more acute as autonomy increases.

Most context systems capture information — what was decided, what happened, what the data shows. Almost none capture process — how decisions get made, in what order, with what checks, under what constraints.

How a human triages tickets, how an agent sequences a refactor, what gets verified first and what gets deferred, when to escalate and when to proceed — this is how “how work gets done” becomes legible to a system.

A model that knows what you decided can answer questions about the decision. A model that knows how you decide can make the next decision the way you would. The gap between those two is the gap between a search engine and a colleague.

This distinction matters because alignment scales with process-context in a way it doesn’t with information-context. You can dump every document in your organization into a retrieval system and the model still won’t make decisions like you do. Give it ten examples of how you actually work through a decision and something qualitatively different happens. The model starts exercising judgment, not just retrieving facts.

In-context learning and auto-memory are first steps — crude ones. The gap between what process-context capture could deliver and what today’s systems actually capture is where the most value is hiding.

One pattern dominates as harnesses evolve: alignment gets more expensive at every step.

In chat, alignment is free. You’re right there, steering every response. In collaboration, alignment costs real engineering — memory, skills, rules, context engineering, all designed to make the model’s autonomous decisions good enough to trust. In management, alignment becomes the entire game. A system running for hours, making hundreds of decisions, needs to have internalized enough of your intent to act well on your behalf.

The harness carries that intent.

This is why the harness matters more than the interface. Interfaces are what you see. Harnesses are what the system is. A beautiful management interface on top of a chat-paradigm harness is a dashboard for a system that can’t actually do the work.

But even a perfect harness doesn’t close the gap alone. Everything above is supply-side — better systems, more capability unlocked. The demand side is harder and less tractable.

People don’t just need better tools. They need to want to delegate. Trust, liability, skill atrophy, professional identity — these are real barriers that no harness will dissolve. Mica Endsley’s research on the “out-of-the-loop performance problem” frames the tension precisely:¹⁰ as automation increases and system reliability improves, human operators’ situational awareness decreases, making them less able to intervene when the system fails. The management paradigm accepts this tradeoff deliberately. The question is whether today’s harnesses are reliable enough to justify it.

For most tasks, they aren’t. Which is why collaboration matters most right now — the unsexy middle ground where harnesses are good enough to do real work, but humans are still close enough to catch failures. The present that actually ships value, while the future gets built underneath.

[1] Anthropic Economic Index, January 2026. Link

[2] “How AI is Transforming Work at Anthropic.” Link

[3] OpenAI, “Ending the Capability Overhang,” December 2025. Link

[4] Paul David, “The Dynamo and the Computer,” American Economic Review, 1990. Link

[5] Thomas Hughes, Networks of Power, 1983. Link

[6] The model-as-CPU, harness-as-OS analogy was arrived at independently by multiple sources: Andrej Karpathy Link; Cobus Greyling Link; Hugo Nogueira Link

[7] Jakob Nielsen, “AI: First New UI Paradigm in 60 Years,” Nielsen Norman Group. Link

[8] Thomas Sheridan & William Verplank, Human-Computer Interactive Supervisory Control, MIT Man-Machine Systems Laboratory, 1978. Link

[9] Context engineering was coined by Tobi Lutke (Shopify) and Andrej Karpathy in mid-2025. Anthropic’s Applied AI team published the most rigorous treatment, distinguishing it from prompt engineering. Link

[10] Mica Endsley, “The Out-of-the-Loop Performance Problem and Level of Control in Automation,” Human Factors, 1995. Link

[11] Multiple sources converge on verification as the force multiplier for autonomous agents. Simon Willison: Link. Mitchell Hashimoto Link

Putting Intelligence to Work

Discussion about this post

Ready for more?