When Science Goes Agentic

In a couple of years, we will inspect AI-generated source code about as often as we inspect the assembly output of a compiler. Which is to say, far less often—outside of high-stakes and adversarial settings. The trajectory is clear: vibe coding is not a fad but a transition, a stepping stone. Debugging AI-generated code will shrink dramatically for a lot of everyday software—not because the code will be flawless, but because the feedback loops between generation, testing, and correction will tighten until human inspection becomes the bottleneck rather than the safeguard. In this respect, requiring the co-generation, with code, of mechanically verifiable formal attestations can also improve the process.

In April 2025, with “The Last Solo Programmers,” I explored what this shift could mean for the programming profession. One year later, the shift is very clear. With improved model capabilities, Anthropic Opus 4.6 is an example, the same wave is now hitting science itself. If code is no longer the bottleneck—if generating, testing, and iterating on computational experiments becomes nearly free—then what is? The answer, I believe, is everything that surrounds the code: the conjectures that seed research, the review systems that validate it, and the platforms where it is shared. This post is about what likely happens to those surrounding structures when science goes agentic.

Vibe Science: Research by Conversation

The term “Vibe Science” was used in an August 2025 paper that warned of the AI threats to scientific rigour. This was the hallmark of early, AI-assisted scientific text production, with plenty of hallucinations. Since then the toolbox has evolved and, I believe, is more grounded.

Here is the current state of affairs. A researcher sits down with an AI assistant, describes a conjecture or a dataset, and begins a conversation. The AI agent does the heavy lifting: fitting models to data, testing statistical properties, making predictions and evaluating their quality. These models are no longer limited to brute-force exploration; they can draw on broad knowledge of mathematical tools and propose novel approaches to advance existing problems.

The most striking marker of the transition we are facing is found on a March 2026 note by Prof. Donald Knuth, who famously stopped using email in 1990 to protect his productivity. Knuth and Filip Stappers have been assessing the capabilities of Claude Code, and it paid off:

“Shock! Shock! I learned yesterday that an open problem I’d been working on for several weeks had just been solved by Claude Opus 4.6—Anthropic’s hybrid reasoning model that had been released three weeks earlier! It seems that I’ll have to revise my opinions about ‘generative AI’ one of these days.”

Other authors are experimenting with the full workflow, from conjecture to research and writing. Vincent Grégoire used AI to write a full academic finance paper in four days, responding to a UCLA Anderson conference challenge to use maximum AI assistance. Inspired by a Dario Amodei interview about compute infrastructure economics, he used Claude to generate a literature review and detailed research plan for a real options model of AI investment. He then let Claude Code run autonomously overnight to produce a complete first draft, including analytical solutions, numerical simulations, and all manuscript sections. He iterated by running Claude Code, Codex CLI, and Gemini in parallel as reviewers, consolidating their feedback and having Claude implement fixes.

I did a smaller, but similar experiment, reported as “The Patterns of Research Excellence.”

From Vibe to Agentic: a Spectrum

It helps to think of this as a spectrum with three stages.

Vibe Science: A human is still driving—choosing hypotheses, running most experiments, and using the model as a fast collaborator.
Agentic Science: The system is running loops on its own—designing experiments, executing them, logging results, and iterating, with a human mostly gating objectives and stopping conditions.
Autonomous Science: The system is selecting problems, allocating effort across many parallel lines of attack, and escalating to humans mainly for audits, constraints, and high-level direction.

We are currently crossing from Vibe to Agentic. The research itself—data collection, analysis, model building—already can proceed with limited supervision. The writing can now be generated once research outcomes sit in structured files; humans still polish, but this gap is closing. What distinguishes the agentic stage is not any single capability, but the chaining of capabilities into end-to-end workflows where the AI manages complexity that would overwhelm a single researcher.

The Review Bottleneck

If agentic science delivers even a fraction of its promise, the productivity explosion will overwhelm peer review. This is not a hypothetical concern; the system was already strained. Unbalanced economic incentives in for-profit journals had made finding willing reviewers increasingly difficult well before AI entered the picture.

The research community is now split on how to respond. AAAI-26 has adopted a two-phase reviewing process that includes an AI-generated review in Phase 1 and AI-generated summaries of reviewer discussions. ICLR ran a serious experiment with AI-assisted reviewing and is studying the results. No major conference has replaced human decision-making with AI; the current consensus is that AI should supplement, not substitute.

But this consensus assumes a roughly stable volume of submissions. If agentic science scales output exponentially, and the tools are already capable of this, human review simply cannot keep pace. There is also an honesty gap: it remains unclear how authors currently disclose AI use in their work. The spectrum runs from grammar correction to fully automated writing, and the boundaries are fuzzy.

The Dead-End Archive

Scaling agentic science eventually will require agentic reviewing: automated cross-validation based on submitted papers and codebases. If we had experiment bundles as first-class objects, “review” becomes less about reading prose and more about replaying, spot-checking, and auditing bundles at scale. But before we get there, something genuinely new needs to emerge from the agentic workflow itself.

An Achilles’ heel of classical science is the absence of incentive to report negative results. Journals want positive findings; researchers want citations. Failed experiments go into desk drawers. But in agentic science with tools like Claude Code and Codex, complex research projects naturally produce a trail of intermediate experiments and dead ends, recorded in memory and markdown files. The AI needs these records to avoid repeating mistakes; they are a byproduct of the workflow, not an extra burden.

These experimental memories could be shared as part of the submission process, alongside code and text. The key artifact isn’t just “the paper.” It’s the experiment bundle: data pointers + preprocessing recipe, exact prompts/agent configs, environment spec, seeds, evaluation script, and the dead-end log (the failed branches and why they were abandoned). Follow-up research, whether human or agentic, could build on this record and avoid revisiting dead ends.

The potential efficiency gains are enormous. Consider a field where every published result comes with a structured log of what was tried and why it failed, available for any subsequent agent to ingest before beginning its own exploration. This is a genuinely new research artifact: the experiment log as first-class publication material. Not a methods section written after the fact, but the actual record of the research process as it unfolded.

The Missing Platform: An Agentic Science Hub

Here is the gap. There is no dedicated platform for sharing AI experiment files: no versioned, commentable registry where you can fork someone’s research configuration, run your own experiments, and publish results alongside the original. The infrastructure that would make agentic science reproducible and cumulative does not yet exist.

There are early movers. AIxiv is experimenting with AI-native research publication. GitHub comes closest to the right model—it has stars, forks, issues, and pull requests; and community repositories like hesreallyhim/awesome-claude-code are already functioning as informal skill and workflow registries, but none of this is purpose-built for research workflows.

This gap is an opportunity. A platform designed for agentic science would combine version control for experiment files, structured metadata for methods and dead ends, forking and follow-up mechanisms, and generative review pipelines. Building it could fundamentally change how research is shared, reproduced, and extended.

Time to Catch Up

We have been here before. Each wave of automation, from mechanical calculators to IDEs to Stack Overflow to vibe coding, reshaped the profession without ending it. Science will be reshaped, too. In some areas, the researchers who thrive will be those who learn to direct agentic workflows effectively, much as the most prolific programmers today are those who collaborate productively with AI rather than competing against it.

The question is not whether science goes agentic; it is already treading that path. The question is whether we build the infrastructure to make it rigorous, transparent, and cumulative. The tools are moving faster than the institutions. It is time for the institutions to catch up.

Carlos Baquero is a professor in the Department of Informatics Engineering within the Faculty of Engineering at Portugal’s Porto University and is also affiliated with INESC TEC. His research is focused on distributed systems and algorithms.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read