Claude Opus 4.6 Test: 100 % autonomous AI teams - Is it possible?

711 work results in one night, 180 euros a month, zero programmers awake. A documented practical test - with unanswered questions.

This article was not accompanied by a reporter. There was nothing to observe. Just a person sleeping and a machine working. The evidence is screenshots and invoices, not impressions. This is perhaps the most inconspicuous change that this AI technology brings with it: Not only is the work shifting - the stories about it are also becoming harder to tell.

February 2026, Central Hesse, Germany

On Friday evening, after a normal working week, Christopher Helm opens a computer set up especially for the test - isolated from the company network, the ISO 27001 officer had, despite extensive AI-Model System-Card and Sabotage risk report, insisted - and hands over to an AI system that is the first in its series adaptive thinking offers full control.

For the next nine hours, without supervision. In this, thinking is optional for the model. The model evaluates the complexity of each request and decides whether and how much it should think.

Somewhere between the blast furnaces of Buderus in Wetzlar, the switch cabinets of Rittal in Herborn and the now sold heat pumps of Viessmann in Allendorf - and yet connected to data centers, for whose energy requirements Microsoft is reactivating Three Mile Island, Google is commissioning three new nuclear power plants and Amazon is planning mini-reactors next to its server farms. 58.7 million text modules will flow through this infrastructure that night, triggered by a laptop in a small town.

What happens when you try it without structure is shown by Peter Steinberger - founder of PSPDFKit (today Nutrient), who came out of retirement to play „to mess with AI“, as his GitHub-Bio reveals. His Moltbot project went viral: a network of autonomous bots that communicate via WhatsApp, Telegram and Signal - social media where only bots post. 134,000 people saw his post on X, in which he himself warned: „This is a bad idea and you should not do it.“ Steinberger has earned his money - PSPDFKit went to Insight Partners for $116 million, followed by a sabbatical. Now, on February 11, he is traveling to the Pragmatic Summit to San Francisco to talk about vibe coding. Whoever has their exit gets to play. Moltbot is reminiscent of ICQ: lots of onlookers, little substance. More metaverse than infrastructure.

Despite higher costs than GPT 5-3, Helm confirms the statements made by Users on Reddit. Claude is more sensitive, especially when it comes to tasks that do not involve software code, but documents, presentations or contracts.

Helm builds the opposite: a separate control program - an intermediate layer that organizes a virtual team like a department: roles defined, handover points specified, quality criteria formulated, input and output formats structured between the agents. Then a sentence:

„Hi, I have to leave soon. What do you need from me so you can work on your own?“

At the other end: Claude Opus 4.6, Anthropics published on February 5 at 6:51 pm German time - rank 1 on Terminal-Bench 2.0 for autonomous coding, rank 1 on Humanity's Last Exam for interdisciplinary reasoning. The answer comes immediately - not as a text, but as a query:

„I need these permissions to work autonomously for the next 9 hours. Can I have all of them?“

Die Rückfrage: Opus 4.6 fordert Vollzugriff – Dateien bearbeiten, CLI-Befehle ausführen, Verzeichnisse anlegen, Wire-Prompts ändern. „Can I have all of them?" Screenshot aus Claude Code, 6. Februar 2026. — The query: Opus 4.6 requires full access - edit files, execute CLI commands, create directories, change wire prompts. „Can I have all of them?“ Screenshot from Claude Code, February 6, 2026.

Helm confirmed. Take another quick look twenty minutes later - „Hi, just walked past the office, do you need anything?“ - and goes to sleep.

When he wakes up the next morning, he has 711 work results. A complete working day - without any work.

Table of contents

What the figures say

Das autonome Team bei der Arbeit: API-Dashboard vom 7. Februar 2026. 58,7 Millionen eingehende und 5 Millionen ausgehende Tokens über Sonnet-4, dazu 656 automatisierte Websuchen. Hier läuft nur das Team – der Supervisor Opus 4.6 steuerte separat über Claude Code. — The autonomous team at work: API dashboard from February 7, 2026. 58.7 million incoming and 5 million outgoing tokens via Sonnet-4, plus 656 automated web searches. Only the team is running here - the supervisor Opus 4.6 controlled separately via Claude Code.

The status display: over eight hours of runtime, 11 percent of the weekly usage limit for the supervisor used up. The API dashboard shows the team's work: 58.7 million incoming and 5 million outgoing text modules (so-called tokens) via the cheaper Sonnet 4 models, plus 656 automated web searches. Opus 4.6 does not appear in the API diagram - it ran as a supervisor via Anthropic's command line tool Claude Code on a separate contingent. The invoices: Helm was using the Claude Max plan with five times the quota for 90 euros a month until the day before. For the test, he switched to the twenty-fold quota at 180 euros per month - the pro rata difference for the current billing period amounted to 93.92 euros.

By comparison, a comparable unit of work would have cost around 25 euros on microtask platforms a year ago. However, there is a decisive disadvantage: if you assign 711 jobs there, you have to check 711 results manually afterwards. In Helm's architecture, this is done by the intermediate layer itself - it automatically validates the agents' work based on defined criteria. Opus, in turn, monitors this control layer. The result is two-stage quality assurance without a human inspector. 711 times 25 equals almost 17,800 euros - in one night. Extrapolated to a month with 20 working nights, that would be 356,000 euros in Microtask equivalent. In a year: over four million. The machine that was responsible for this costs 180 euros a month. Even if you add in the running costs for the Sonnet models - at Anthropic's current prices, 58 million processed text modules are in the low three-digit euro range - the ratio is still beyond any cost calculation.

Of course, the comparison is flawed. Not every night delivers 711 units, not every unit corresponds exactly to what a microtask service provider produces in an hour, and the preparatory work for the system is not included in the calculation. But even with generous discounts, there is still an order of magnitude that is difficult to ignore.

Opus thinks, Sonnet works - but who organizes the team?

Despite new Agent Teams function of Opus 4.6 and correct configuration in the Claude Code Settings there was a problem with the implementation.

How do you write a prompt that orchestrates an entire department?

The definition of Checks and balances of non-binary, knowledge-heavy decision-making situations is complex. Be it the Constitution of the United States of 1787 or Claude's Constitution on January 22, 2026.

Checks and balances
Each power (legislative, executive and judicial) must also have the means to defend its own function and thus ensure that the overall political system remains balanced.
The principle has become a basic tenet of the democratic exercise of power as a whole. Wherever political potential arises and power is exercised, there must be an opportunity to form and exercise countervailing power. The principle thus aims to balance different interests, supports the tendency towards equilibrium and promotes cooperation between different poles of power, thus ensuring socio-political stability.
Federal Agency for Civic Education

How to write Prompts for the Claude Agent team?

Most instructions in companies, e.g. the procedure for due diligence, define guidelines for implementation. However, there are rarely instructions on how the output of an individual employee is to be checked by the head of department. A task that goes beyond the mere Summary of legal documents goes beyond that.

How does a department head check his employees and how risk-averse should a department head be in deciding how highly qualified an employee must be to be allowed to work on a task in accordance with its complexity? How can you define the goal that does not lead to unwanted behavior (Goodhart's Law). Or also LLM Leaderboards are Bullshit - Goodhart's Law Strikes Again : r/LocalLLaMA.

Transferred to the use case: Which AI model (Teammates) receives which task, or may not receive it.

How do you structure AI teams?

How does the supervisor of the department heads (Chief) check whether (1) all department heads know their expertise, (2) have researched the current business context, (3) have reconciled possible differences in the understanding of (1) and (2) at department head level before assigning tasks to their teammates?

Opus 4.6 was unfortunately unable to solve this problem. As a result, Helm used a trick that he didn't want to reveal to me. Just this much: the final AI team architecture must be able to evaluate itself, had three levels just like Claude and would take over the work of 30 person-days in a typical consulting project in the DACH region.

The AI Chief supervises two AI department heads who have to report to him in a formally defined way. One department head continuously revises the knowledge of the entire organization. The other department head works on customer projects. Employees of the department heads only receive the context they need for their tasks. The boss and the head of department coordinate with a Project triangle for AI projects, which task requires junior and which senior qualities. The AI chief optimizes the margin of the projects and the department heads have the task of avoiding confusion among employees, see Examples of AI hallucinations.

Architekturdiagramm der autonomen KI-Nachtschicht: Ebene 1 Opus 4.6 als Supervisor via Claude Code, Ebene 2 eigenes Steuerungsprogramm mit Qualitätskontrolle, Ebene 3 spezialisierte Sonnet-4-Agenten für Recherche, Verarbeitung und Prüfung. 58,7 Millionen Tokens, 656 Websuchen, 711 Ergebnisse. — Architecture diagram of the autonomous AI night shift: Level 1 Opus 4.6 as supervisor via Claude Code, level 2 own control program with quality control, level 3 specialized Sonnet 4 agents for research, processing and testing. 58.7 million tokens, 656 web searches, 711 results.

The principle: small, inexpensive work packages are created at the bottom - raw material. This raw material moves up the decision-making hierarchy: it is combined, enriched and validated. Each stage makes the result more complex and more valuable. The 17,800 euros are not generated by one large order, but by hundreds of small units that are refined like raw materials. Some results do not make it into the end product despite processing - because the quality control sorts them out. The result is literally a CombineA merger of companies that work closely together in terms of production under centralized control - with a parent company (Opus), regulated hierarchy and planning requirements. The only difference is that the general manager wakes up in the morning and the financial reports are already available.

In concrete terms, this means that a research agent searches through sources, evaluates each one for relevance and extracts usable facts. A processing agent takes these facts and integrates them into existing documents - with references, without overwriting existing content. A reviewer validates whether the transfer between the stages meets the defined criteria. Each agent works in its own session with its own context window; the results are transferred via structured file formats, not via shared memory. Errors are therefore detected by the supervisor at the interfaces, not in the end product.

The crucial point: Opus 4.6 not only controls the team - it can also change the team. The supervisor can adapt instructions, restructure workflows and create new task types. „When I come back in the morning, the system is not only ready, but potentially smarter than it was the night before - even if I operate it myself the next time.“

However, this also means that the system that was started in the evening is no longer necessarily running in the morning. Opus may have rewritten work instructions, adapted test criteria or introduced new task types - without consulting the supervisor. Structural quality control still applies, but it checks against rules that the supervisor may have changed themselves. Helm takes a sober view: „That's the price for allowing the system to improve. If you don't want that, you take the self-optimization out - and you have a rigid system that is just as stupid after eight hours as it was at the beginning.“ It was a conscious decision, he says, and he checked in the morning to see what changes Opus had made. But the possibility of an autonomous system overriding its own rules is a risk that grows with running time.

Benchmark reports vs. production operations

The previous reporting on Opus 4.6 - from Heise about Tom's Guide up to the DEV Community - tests the model in the chat. What's missing: a documented case in which Opus 4.6 produces autonomously for a full working day, with verifiable costs and an architecture that goes beyond „I asked Claude a question“.

Helm's architecture measures something that no benchmark does: whether a team delivers a unit of work. Benchmarks lose their discriminatory power anyway - Artificial Analysis has removed three of them from its index, because all Frontier models achieve over 90 percent. Models are also smarter than measured in the test thanks to runtime reasoning: OpenAI's o1 jumped from 9.3 to 74.4 percent in a Math Olympiad without any change in training. For decisions without a clear yes or no - relevance check, prioritization, context assignment - there is no unit test. What was previously considered intangible is at least partially automated and documented in the Combine.

The ratio of 58.7 million team tokens to 11 percent supervisor quota reflects this hierarchy: Opus thinks and decides, Sonnet works. Helm has deliberately minimized the consumption of the expensive model - API price for Opus 4.6: 5 dollars per million input tokens, 25 dollars output; Sonnet-4 costs a fifth of that. The architecture also solves a practical problem: Opus 4.6 is throttled by the provider during intensive use; other users report interruptions after 30 minutes. Helm's solution: only use Opus as a supervisor, not as a workhorse. „My session ran for an hour and a half under Opus before a ten-minute break. But in the meantime, Sonnet had carried on working.“

Quality without supervision - and its limits

A legitimate question: anyone who has 711 work results produced overnight cannot have checked every single one in detail. Helm acknowledges this: „I can't say that I checked every unit line by line. What I can check is the structure: Have the agents communicated correctly? Are the handovers correct? Have the defined quality criteria been met?“ This structural check did not reveal any errors - no obviously invented facts, no failed handovers, no agents who got caught in loops.

But that is not the same as a content release. „Subjectively, I will reject some results. Not because they are faulty, but because they don't meet the standards I set for the end product.“ It is a human decision - and that is precisely where the limit of autonomy lies: the system can produce and check itself for formal correctness. The assessment of whether a result is good enough remains with the human being.

The fact that the structural error rate appears to be zero is not solely due to the model. Helm has structured the communication between the agents and made it validatable so that errors in the handover are visible at an early stage. „So at least I know whether the team is communicating correctly in line with my business requirements.“ A characteristic that - as he notes - sometimes has to be worked hard for in human teams: that knowledge is shared. Here it was contrived, but it worked.

This is also the reason why Helm's approach differs from what Anthropic itself called Opus 4.6. Headline feature advertisesAgent Teams„ - parallel agents that coordinate autonomously, built directly into Claude Code. Anthropic compares it to a „talented team of humans“. It's tempting to switch the feature on and let it run. But those who use Agent Teams without their own control layer get three things: higher costs because everything runs via the expensive Opus model - the same night via the API instead of a flat rate: around 420 dollars per run, with 20 working nights over 8,000 dollars a month instead of 180 euros; no documented coordination between the agents because individual actors do not log the basis of their decisions; and in the end no Explainability of AI - the result is a kind of Opus Machinae, where no one can understand how it came about.

If, on the other hand, you model the basis of the team yourself, you can watch how it works - helicopter boss, without anyone getting angry. The documentation of the individual tasks gains value through combination: each stage adds context, each handover is logged. In the end, the combination is comprehensible - not because the result is self-explanatory, but because the path to it is documented.

The „And then?“ trick

One special feature: the last command in each task chain was a variant of „Find the next task“. Not a sophisticated system, but a deliberately simple principle - comparable to a child who keeps asking: »And then? And then what?« Technically, this is the same mechanism that made Steinberger's Moltbot viral: autonomous agents that nudge each other without human intervention. The difference is not in the principle, but in everything that is missing around it. Moltbot lets bots talk to each other on WhatsApp and Signal - without defined roles, without handover formats, without quality control. The result is entertaining and uncontrollable. In Helm's architecture, the same endless loop runs, but within the combine: every agent has a planned task, every handover is structured, every result is checked against defined criteria before it reaches the next stage.

„I didn't know where the model would end up after eight hours. But that was exactly the experiment: „Can you give an AI model an open task chain and trust it to make sensible decisions?“ The answer depends on whether you give the model a structure within which it is allowed to make decisions - or whether you simply let it run.

What has changed - and what hasn't

The quality of individual results was already good with the previous models, says Helm. The difference is the stability over time. „With Sonnet, I originally had the problem that the work results were a lot rather than scarce and usable.“ Now he can use Sonnet, but under the supervision of Opus. The real improvement: „The model remains stable for hours. With previous versions, you would have found a large amount of unusable, repetitive results the next morning.“ Reliability over a full working day, without loops, without invented facts - that's new.

Ten years of preparation for a working day without work

At the 2015 Bankathon, four students sat awake for 37 hours to build a prototype that won a special prize. In 2026, one of them sleeps while the machine produces. The difference is not just the technology.

„At one level, the preparation took ten years,“ says Helm. „I've been working in this field for a decade. I know what's important. This experience is in the way I structure the tasks - not in the model.“ In addition, there was a day of infrastructure development and the preparatory work to define work units that can be evaluated uniformly and measured on an ongoing basis so that every result is comparable - regardless of which agent produced it.

„It's a paradox. A working day without work sounds like automation. But in order for me to do nothing for a working day, I had to understand beforehand exactly what needed to be done, in what order, at what granularity and with what quality criteria. That's not a task for an AI model. It's a task for someone who knows their subject.“

No contradiction: the critical side

Anyone who now suspects euphoria should know the other side. On the same day that Helm launched his test, he published two analyses on his blog that convey the exact opposite of enthusiasm.

At „Opus, opium of the people? - Free from competence at last“ he calculates what happens when anyone can do the work of a team for 20 dollars a month: Displacement of entry-level positions, devaluation of expertise, a labor market that can no longer build experience because learning positions disappear. In „Opus Machinae - the human, the all-too-human“ he draws parallels between the voluntary submission to AI-generated results and the mortification practices of Opus Dei - based on Nietzsche, who in 1878 in The human, the all-too-human wrote: „Beliefs are more dangerous enemies of the truth than lies.“

He sees no contradiction in this - rather the obligation to say both at the same time.

What this means for banks and insurance companies

The question raised by the test is particularly relevant for the financial sector.

In 2015, Sparkasse Paderborn-Detmold sent a team to the Bankathon - a tentative attempt, as Jonas Jansen wrote at the time. „If you call Bafin as a start-up today, you end up with the gatekeeper and that's it,“ said Figo founder André Bajorat. Back then, the financial sector needed the Bankathon to get in touch with developers at all. The Bankathon article „A night with the nerds“ by Jonas Jansen was published on November 22, 2015 in the Frankfurter Allgemeine Sonntagszeitung (archived at the FIM Core Competence Center, University of Augsburg).

Ten years later, a single person with domain knowledge can set up an autonomous AI team that produces hundreds of structured work units overnight - for less than 200 euros a month. Document processing, compliance checks, customer analysis, regulatory research: the areas of application in which Banks and Insurance are already experimenting with AI today are precisely the type of structured, rule-based but high-volume tasks that are suitable for this approach. The EUR 25 per microtask unit from the calculation above is a conservative estimate - in regulated industries with MaRisk requirements, DORA compliance and the dual-control principle, the real costs per comparable microtask unit are much higher. Work unit significantly higher. The actual ratio of EUR 180 per month to industry costs is even more drastic than the EUR 17,800 calculation suggests.

This test shows that the technology is capable of doing this. It remains to be seen who in the specialist departments has the necessary experience to put it to good use.

Daniel Kehne, the other student from the factory hall in Offenbach, uses technology to tell refugees in their own language where they need to register and what rights they have. Not an AI agent, not an opus, not an autonomous night run - but a 40-strong team working three days a week for people who would otherwise have no voice. „It's a privilege to have the time and resources to stand up for the vulnerable in our society,“ Kehne wrote on the Integreat website.

The 2015 Bankathon required 63 nerds in a factory hall. In 2026, one person will be enough to instruct the machine and go to sleep. Who benefits from that is another question.

Christopher Helm is Managing Director of Helm & Nagel GmbH, an AI provider in the B2B sector for cognitive automation, and operator of the Konfuzio IDP for intelligent document processing. The test described here took place in the night from February 6 to 7, 2026 - around 30 hours after the release of Claude Opus 4.6 on February 5 at 18:51 CET.

All the figures mentioned are documented by screenshots of the Anthropic dashboard, token logs and invoices. Written by Claude Opus 4.6 based on an interview with Christopher Helm, Managing Director of Helm & Nagel GmbH. Guest article.

Did you find this page helpful?

Thank you for your feedback!