LLMs Are Still worst at Complex Tasks

I spent the last 2 years building AI agents. What I learned surprised me: these tools are incredibly powerful for certain things, but when complexity increases, they fall apart in ways that are hard to predict and even harder to fix.

Press enter or click to view image in full size

When Everything Looks Right But Nothing Works

I was testing an agent I’d built for the Enterprise. The task seemed straightforward: create a flat file that would trigger a specific biztalk map. The agent had full access to the schema, the map definition, and all the code it could need.

It generated the file. Everything looked perfect the structure was right, the data was there, all the required fields were populated. I ran it. Nothing happened.

I fed the error back to the agent. It made adjustments. Still nothing. Three rounds of back and forth, and we were nowhere. The agent kept focusing on the data structure, tweaking field names, adjusting formats. All reasonable changes, but none of them addressed the actual problem.

Finally, I looked at the file myself. The issue jumped out immediately: the BizTalk map had elementFormDefault undefined, which meant every data element needed xmlns="" in the XML. The agent had the schema file open(traced using Langfuse). It had been looking at it the whole time. But it never saw the problem.

When I asked why it missed this, the agent gave me an answer I’d never heard before: “mental jumping.” I still don’t know what that means, but it perfectly captures what was happening the agent was jumping around between possibilities without actually seeing what was in front of it.

This wasn’t the only time. Another case: a clear type mismatch error. The kind of thing you’d spot in two seconds if you were reading the code.

The Rocket That Wouldn’t Launch

I work as an AI engineer, building complex agents as my day job. On the side, I’ve been developing lynecode, my own AI coding assistant. I open sourced it specifically so I could test these systems in the wild and see where they break.

One test case seemed simple enough: build a basic rocket simulator that starts on Earth and flies to the moon. The agent (using Gemini 2.5 Pro) built it on the first try. Great! Except the rocket looked terrible just a basic shape on the screen.

“Make it look better,” I said.

The agent updated the code. Now the rocket wouldn’t launch at all. Just sat there on the launch pad. I asked it to fix the error. It tried several approaches, adjusting physics calculations, tweaking animation timings, modifying state management. Nothing worked.

I looked at the code myself. There was a missing closed brace just a single { character. That was it.

Here’s the thing: this agent had linting tool. It had Semgrep checking the code. These tools exist specifically to catch syntax errors like missing braces. They were running. They were reporting issues. But the agent couldn’t connect the dots between the error reports and the problem it was trying to solve.

Why This Keeps Happening

After months of watching agents fail in these ways, I started seeing patterns. Research shows that weak verification mechanisms contribute significantly to multi agent system failures, and when agents can’t properly verify their work, small errors cascade through the entire workflow.

Studies have found that roughly 32% of agent failures come from not understanding their assigned tasks, while another 28% fail due to coordination problems. That’s 60% of failures before you even get to the actual coding work.

But there’s something deeper going on. Research examining incorrect code from models like GPT-4 and CodeGen found that models consistently make non syntactic mistakes errors that don’t cause crashes but produce wrong outputs. The code runs. It just doesn’t do what you asked for.

Developers report that models get stuck in repetitive loops, repeatedly trying the same fix when it doesn’t work, a behavior researchers call “pattern locking”. Once an LLM decides it’s seen a particular kind of bug before, it keeps proposing the same solution. There’s no moment where it thinks, “wait, maybe I’m wrong about this.”

The Context Problem Nobody Talks About

Here’s what makes this especially frustrating: these agents often have everything they need. The schema files are there. The error messages are clear. The linting tools are dispose. But research on agent performance shows that flooding models with information actively degrades their reasoning they have to sift through noise while trying to solve the original problem.

When models process large amounts of data in fragmented windows, they miss critical connections between different parts of the information, leading to gaps in reasoning. It’s not that they can’t see the missing brace or the type mismatch. It’s that they’re looking at so much context that they can’t figure out what matters.

One developer testing AI coding assistants found that providing exhaustive context about numerous parameters resulted in unclear and overly complicated code, while limiting scope to just a few clearly outlined examples made the pattern immediately clear to the model.

What Actually Works (And What Doesn’t)

People will tell you that LLM hallucinations are the big problem the model invents methods that don’t exist, creates fake libraries, confidently states things that aren’t true. But hallucinated methods get caught immediately when you run the code, making them one of the least harmful types of errors. The real danger is the code that compiles fine but silently does the wrong thing.

After working with these systems every day, I’ve found a few things that help:

Break everything down into tiny pieces. Don’t ask for a whole feature. Ask for one small function. Test it. Then ask for the next piece. The agent that tried to fix my rocket in one shot failed. When I asked it to focus solely on the launch sequence, ignoring everything else, it found the missing brace.

Make the agent explain before coding. Before writing any code, have it describe what it’s going to do and why. This catches a lot of misunderstandings early. It’s like rubber duck debugging, except the duck talks back.

Don’t trust the first solution. Even if it works. Especially if it works quickly. Developers at companies building AI agents report that while these tools are great at brainstorming and handling repetition, there’s no replacement for human intuition when something really matters. Run the code. Read it carefully. Look for the things that seem too easy.

Set up hard boundaries. If an agent has tried the same fix twice, stop it. Make it check different files. Make it look at different assumptions. This prevents the pattern locking behavior where agents repeatedly apply fixes that didn’t work the first time.

Where We Actually Are

The tools are improving. Models released this year are noticeably better than what we had six months ago. But we’re nowhere near the “just describe what you want and the AI builds it” future that gets promised in demos and tweets.

Analysis of code generation across multiple models found that logical errors, incomplete code, and context misunderstanding accounted for most failures, with even advanced models like GPT-5.1 . These aren’t edge cases. They’re the normal operating mode.

The most successful approach I’ve found is treating LLMs as very knowledgeable junior developers who can’t quite see the full picture. Give them small, specific tasks. Check their work carefully. Don’t let them make architectural decisions. Don’t trust them with anything that touches production without thorough review.

And definitely don’t assume that because an agent has all the files open and all the tools running, it actually knows what’s wrong with your code.

MY TAKE

We’re in a weird transition period. LLMs are powerful enough that you’d be foolish not to use them, but not reliable enough to trust them with anything complex. This creates a strange workflow where you’re constantly switching between “let the AI do this” and “I need to check every line.”

The real bottleneck isn’t the model’s knowledge or its ability to write code. It’s the gap between having information and actually knowing what to do with it. An agent can have your entire codebase in context and still miss a missing brace. It can have linting errors spelled out clearly and not understand what they mean. It can make the same mistake five times in a row because it’s convinced it’s seen this pattern before.

I think we’re going to be in this in-between state for a while. The tools will get better at narrow tasks. But complex, multistep problems that require understanding how different systems interact? Where you need to hold several levels of context in your head at once? Where the solution isn’t obvious from the error message?

Those still need human developers. Not because humans are magically better at everything, but because we have intuition about what matters and what doesn’t. We can look at a thousand lines of code and immediately focus on the three lines that matter. We can see a missing brace and know it’s the problem, rather than trying seventeen different physics calculations first.

The hype says AI will replace programmers. The reality is that it’s changed programming into a different kind of work more orchestration, more verification, more cleanup of almost right solutions. You still need to understand the code. You still need to debug the weird failures. You’re just doing it faster and with a very confident assistant.

That’s not bad. But it’s not revolutionary either. It’s just different