Pull Requests Are a Poor Fit For Agentic AI

5 min read Original article ↗

Menu

2025-07-29

Generative AI-assisted software engineering is here to stay, with all the consequences that brings. But the form it takes and the workflows we use to wield it? We are still in the spaghetti stage of product development, throwing everything at the wall to see what sticks.

One workflow product teams love is AI showing up with ready-to-go pull requests (PR). PR comments prompt the AI to update the solution, and the developer’s only responsibility is code review. That’s their dream.

Note

I’m not anti-AI. I use it for auto-complete, research, and agentic AI - Claude Code and Junie. The terminal is home for me and I most often interact via Simon Willison’s llm CLI tool. Like everyone else, I’m learning what it’s good for and where it falls down.

Are Pull Requests a Good Fit?

Even assuming agent-created PRs eventually get Good Now Actually (they are not good yet, see appendix), PR-centric workflows stack the deck against AI. This is for two reasons - context and a broken feedback loop. I won’t dwell on the former, because that’s easier to solve. A broken feedback loop is the biggest weakness.

Problems with Feedback Loops

If agents create PRs that are 80% good for some measure of good, there’s still a social human brittleness around code reviews.

When coding with an AI agent - Claude, Cursor, Windsurf - the human in the loop has a resident mental model of the system, and is positioned to guide the agent and correct errors. Any time the agent gets off track, the human can step in, redirect, or trash the whole session and reset. Crucially, the human must be following along to make this work.

But put that same human in the role of reviewer, and the feedback quality plummets. Humans are bad at code reviews. Hilariously bad. Really bad mistakes are missed all the time because humans generally don’t take the time to do a thorough review and a thorough review takes time.

But humans approving an at-first-glance plausible solution is just one failure mode.

When a human PR author gets questioned by a human PR reviewer, it’s frequently because the reviewer doesn’t understand important context. The author explains the context and they move on.

The AI in a PR workflow isn’t setup to do that. You’ve seen this before. “Oh you are so right, what a powerful and wise developer you are for catching my mistake. Let me change that." and proceeds to generate a production outage. LLM system prompts are tuned to deference for the exact reason that makes LLMs “creative.” LLMs don’t self-detect hallucinations and rely on a human-provided feedback loop to keep from going off the rails.

Generative PRs make the innate social human brittleness around PRs load-bearing.

Problems with Context

I won’t dwell on this because this is (maybe) the most fixable issue with agentic PR workflows. As models get more powerful, handle larger windows, and use more sources for context, we might have a prayer at solving the low success rate for fix suggestions of current models.

The context window for distributed computer systems is huge. Most complicated systems created by humankind huge. And that’s just the code readily available in the average monorepo. There’s an iceberg of hidden dependencies, third party source code, and stale documentation that go into a production bug-fix.

Humans aren’t good computers. Humans that can kind of sort of reconstruct machine state in their heads when reading code do well in this field. But even the best engineers need observability tools and live execution context to troubleshoot these systems. Static analysis isn’t enough for humans, it won’t be for AI either.

If we could feed the entire live CPU state of 100 production servers into our LLM context, good luck getting an LLM to play a scenario forward to validate a fix. claude 4 still can’t reliably add numbers together, it’s not going to execute the ARM ISA in its “head” anytime soon. That validation will fall to another external tool, solvable, more sweat equity.

Where from here?

It’s not the AI’s fault. It’s a human social brittleness that the PR-workflow products are built on that’s the real weak link. I’m not against AI code tools. But as a product, PR-based AI code tools are at a disadvantage compared to their AI code-companion counterparts.

To make the AI/PR workflow work, it’s going to take a more adversarial AI and huge sweat equity in automated verification before humans see the PR.

Appendix A: Background

Two months ago, this Reddit post took a bemused swipe at Microsoft’s Shiny New Agentic Copilot and the time it’s wasting Microsoft developers.

reddit screenshot of 'my new hobby' post

The most experienced .NET developers in the universe, iteratively coaxing copilot to cough up working fixes.

It wasn’t going well.

But any pity you may feel for these Principle+ MSFT engineers is deeply misplaced.

We’re always on the lookout for tools to help increase our effficiency [sic]. This has the potential to be a massive one, and we’re taking advantage. That requires understanding the tools and their current and future limits, hence all the experimentation. It is my opinion that anyone not at least thinking about benefiting from such tools will be left behind. – Stephen Toub

Appendix B: AI Maximalist Counterpoints

Quote Commentary
The models are good now, actually. An evergreen quote dusted off every few months.
Don’t believe your lying eyes. Your experience is invalid, there’s a steep learning curve, you’re holding it wrong. Much agentic, so AI, very religious cult.
You may [express your misgivings about agentic PRs], but nobody should accept your opinion as fact. I was actually told this by a human, but it’s possible an LLM had taken possession of their body.

Related tags:

email comments to paul@bauer.codes

site menu

Back to top