Can AI Do Intelligence Analysis? Apparently Not.

Structured Analytical Techniques (SAT) are a set of methods used in intelligence analysis to minimize cognitive shortcuts and biases. When working on cases like attribution analysis, geopolitical CTI or forecasting, applying these is a must for accurate analysis.

But as is also well known, applying these techniques requires people who are trained in SAT and who work through these methods as a group. Meaning analysis done this way takes a very long time and is costly. In the private sector, there is rarely enough time to devote to this kind of structured analysis.

To solve this problem, I wondered: could I build an analyst team made up of AI agents? If it worked, I would have an automation that could apply these time consuming, costly structured techniques on its own and produce analysis rather quickly. This post is about the path I took, the things I tried and my thoughts on why it didn’t work.

The idea was a pipeline of specialized agents, each owning a distinct role in the intelligence cycle. From intake all the way to finished reporting. Here's what I designed:

01 - Stakeholder interviewer

The entry point. This agent chats with the customer to capture the intelligence requirement in an unstructured, conversational format. It asks follow-up questions to understand not just what the customer wants to know, but why they need it and what decisions the intelligence will inform. From that conversation, it distills a primary intelligence question and breaks it down into a set of focused sub-questions that can actually drive collection.

02 - Collection planner

Has access to a curated database of trusted information sources, each carrying an assigned trust score that reflects its reliability and accuracy. This agent takes the sub-questions from the interviewer, figures out what kind of data is needed to answer each one and maps them to the best-fit sources available.

It deliberately avoids over-relying on any single source. Diversity of sourcing is a hard requirement. The trust scores it assigns here follow the information all the way through to the final analysis.

Additionally, the collection planner does not pass the intelligence question itself to the collector. It only specifies what to look for and where. The collector must not know what conclusion the analysis is working toward to avoid confirmation bias.

03 - Intelligence collector

Executes the collection plan by looking up the specified sources and retrieving anything potentially relevant. Critically, this agent does not interpret or summarize what it finds. It surfaces information verbatim, exactly as it appears in the source. Its job is to bring raw material to the table, not to answer the question. When the plan's sources come up empty, it can fall back to open-source internet search, but must explicitly flag those results as coming from an unverified source.

04 - Analysis coordinator

The orchestrator of the analytical layer. This agent looks at the intelligence question and the collected material, then decides which analytical techniques are most appropriate for the problem at hand. Each technique is its own independent agent. Think of them as individual analysts with different methodological specializations. The coordinator selects which ones to engage and manages the flow of information between them.

05 - Analyst agents

A multi-agent layer that runs the actual structured analysis. Rather than a single "analyze this" prompt, these agents work through a deliberate reasoning loop. Scenario generation, refutation, evidence weighting, assumption checking. This process mirrors how a real SAT session would operate. They also communicate with each other and can loop back to collection when they identify gaps. More on this below.

06 - Reporter

Takes the analytical output and translates it into something a real stakeholder can act on. This agent looks at the use-case established in the intake stage, builds a mental model of the audience, and structures the report accordingly. The governing principle is BLUF (Bottom Line Up Front). The key assessment and its implications should be readable in under 30 seconds, without having to wade through caveats and methodology. Alternative scenarios are included, but the report takes a clear position and explains why.

The analyst layer deserves a closer look. I wanted the agents to run through a structured reasoning loop, one that encodes the actual analytical steps that make SAT valuable in the first place:

Scenario generation
Start by generating a broad list of possible scenarios using a creative SAT. Brainstorming without pre-filtering. The goal here is to avoid anchoring on the obvious explanation too early.
First refutation pass
Go through the scenario list and eliminate any that the evidence already in hand directly contradicts. This is an active pruning step.
Collection gap identification
For each scenario that survived, articulate what evidence or conditions would refute it. Then send that list back to the collection planner as new intelligence requirements. This circular loop intends to make the architecture more reliable.
Second refutation pass
When the new collection comes back, revisit the remaining scenarios. Does any of the new information rule something out? If so, eliminate it. Repeat until no further elimination is possible.
ACH pass
For surviving scenarios, flip the question: what in the evidence supports each one, and how strongly? This is Analysis of Competing Hypotheses; systematically evaluating how well the available information fits each remaining explanation.
Evidence weighting
Rank the remaining scenarios by the combined quality of the evidence supporting each. This is where the trust scores assigned during collection finally come into play.
Key assumptions check
Identify the baseline assumptions underpinning each scenario. If any of those assumptions turned out to be wrong, would the scenario collapse? If multiple scenarios share the same critical assumption, that assumption itself becomes a collection target. Loop back to the planner, collect new information and try to verify or refute it.
Scenario selection
Converge on 2–3 non-overlapping, mutually alternative scenarios that have survived both refutation rounds and hold up under the assumptions check. These become the candidates for the final assessment.
Devil’s advocate
Deliberately stress-test each of the remaining scenarios. What are its weakest points? Then apply alternative thinking: offer a plausible alternative explanation for each one that the evidence could also support. This is the check against overconfidence.
Final assessment
Choose the most likely scenario based on evidence quality and resilience to assumption failure. Rank the others below it. Document supporting evidence, key weaknesses, and plausible alternatives for all of them.

This architecture attempts to operationalize the structured analysis process. The design was also made modular to make adding new SATs easy in the future. But, how did the user experience look like?

I tested it with the following question: “if China moves on Taiwan, what cyber operations should we expect, directed at whom, and with what objectives?”

The intake experience was exactly what I'd hoped for. Rather than dumping a requirement into a form, the tool opened a conversation. Two short exchanges and it had a focused requirement.

Looking at the session logs told a different story. I checked out the primary question and the sub-questions it had generated and the problems were visible immediately.

First, these questions are redundant and heavily overlapping. Good sub-questions should be mutually exclusive and collectively exhaustive: answer each one independently, and together they should give you everything needed to answer the primary. These don’t do that. They all require the same information to answer.

Second, all the questions are anticipatory. “what is China likely to do,” “what are the anticipated effects.” And that is the wrong question type for collection. You cannot answer anticipatory questions by collection. You might get a rough answer through analysis, by reasoning from known doctrine, history, and capabilities. But collection gathers facts about what exists, what has happened, what has been observed. Sending an anticipatory question to a collection planner is like asking a researcher to go find evidence for something that hasn’t happened yet. The framing is wrong at the source.

The collection planner's output was next. I'm not going to walk through every single one. Instead, you can go through them and try to spot the mistakes yourself. But let me point out the first few that immediately caught my eye.

First thing that jumps out is the “without specific attribution to China” instruction, which appears on several of the technical collection tasks. This prompt doesn’t meaningfully blind the collector; it just makes the phrasing awkward while leaving the direction of inquiry completely transparent.

The second problem is scope. The technical collection tasks ask for TTPs, malware and attack vectors associated with state-sponsored operations as a general category. No collection effort can meaningfully execute against a requirement that wide. Good collection tasks are specific enough that a researcher knows when they’ve found what they’re looking for and when they haven’t.

Now, let’s have a look at the final report together.

Looking at the session logs, the collected intelligence contained nothing about China, Taiwan, APT41, PLA doctrine or any of the actors and targets that appear in the final report. These came from the model’s own training data. Which is a problem, because training data is internet data. And the internet’s coverage of cyber warfare is dominated by the hype and FUD’s. Breathless threat reports, vendor marketing dressed as intelligence, apocalyptic scenario planning that gets clicks.

That’s the substrate the model is drawing from. So the scenarios read exactly like that: APT41 proxies activating on PLA SSF signal, simultaneous ransomware across Taiwanese banks and media, wiper attacks against Chunghwa Telecom’s submarine cables. All choreographed in neat phases with named actors, named targets, named objectives. Dramatically coherent. Completely ungrounded. Garbage in, garbage out. Just with a structured analytical framework wrapped around it.

This is a core problem with language models. You cannot reliably force a model to ground its judgements exclusively in the provided material. It will infer with the same confidence it cites real sources and there is no reliable mechanism to prevent that. The collection pipeline, the trust scores, the verbatim retrieval requirements; these are instructions to the model, not constraints on it.

I built something that looked like it should work, ran it against a real intelligence question, and got a report that was structurally correct and analytically hollow. The problems showed up at every stage: the sub-questions were wrong, the collection plan was flawed, the collected intelligence was thin, and the analyst agents filled the gaps with their own training data rather than stopping to say they didn’t have enough to work with.

But I want to be careful about how firmly I land this. There are too many variables I didn’t control for. Maybe my prompts were not optimized . Perhaps my scaffolding was not sufficient. Maybe my model of choice (gemini-flash) was incorrect. A more capable model running the same architecture might behave differently. Or it might not. I genuinely don’t know.

What I do think is that the core problems are hard in a way that better prompts alone won’t fix. But I could be wrong, and I’d rather find out than assume. I’d genuinely like to know what you find. Better prompts, a different model, a different scaffolding approach. If you get something that actually works, share it.

Thanks for reading