State of the Agent: A Census of 375 Claude Code Agent Primitives

By the Numbers

The Census

375 primitives scanned

119 agents

69 plugins

18 domains evaluated

57,565 pairwise overlaps

23% significant overlap

72% define boundaries

<9% address uncertainty

10,000 LLM calls

0.30 avg boundary score

Chapter 1

What Are Agent Primitives?

Taxonomy

Three Primitives

Agent

Full persona in AGENT.md. Own context window, tools, model.

Skill

Reusable capability in SKILL.md. Procedural, slash-invoked.

Command

Bounded action in commands/. Legacy, now merged into skills.

Convention-driven, not contract-driven. No schema, no types — just natural language.

Claude Code is Anthropic's CLI for AI-assisted software development. Its plugin system lets developers extend Claude's capabilities by defining specialized configurations in Markdown and YAML files. These configurations come in three types, referred to here as agent primitives.

Primitive	What It Is	Defined In
Agent	A full persona with its own system prompt, tool access, and model configuration. Runs in an isolated context window. Think of it as a job description for the AI: scope of responsibility, areas of expertise, rules of engagement.	AGENT.md
Skill	A reusable capability or workflow that extends what Claude can do. Can be invoked as a slash command or loaded automatically when relevant. More procedural than persona-driven.	SKILL.md
Command	A slash-command shortcut that triggers a specific, bounded action. The older convention, now functionally merged into skills but still widely used.	.md files in commands/

A plugin bundles all three. A single plugin directory might contain 2 agents, 5 skills, and 3 commands, all working together. The wshobson/agents repository organizes its contributions this way: 69 plugin directories, each contributed by a different author, containing a total of 119 agents, 153 skills, and 81 commands.

What makes this interesting is that Claude Code plugins are configuration-driven behavior augmentation. The system is convention-driven, not contract-driven: There is no schema validation, type checking, or enforced interface. A plugin is just Markdown and YAML files in a directory that follows naming conventions. Claude interprets the natural language in those files and adapts its behavior accordingly. This means the quality, clarity, and structure of that text directly shapes what the agent actually does.

This taxonomy matters for the analysis that follows. Agents with full personas need boundaries and uncertainty guidance more than simple commands do. When measuring "does this entity define what it doesn't know?", the answer carries different weight depending on whether it's an autonomous agent or a one-line command shortcut.

Chapter 2

Methodology

Data source. The wshobson/agents repository, one of the largest collections of Claude Code agent definitions on GitHub with nearly 30,000 stars. All 69 plugin directories were scanned recursively, processing 496 total files and deduplicating by content hash to arrive at 375 unique primitives.

Tool. agent-evals v0.3.0 is an open-source analysis tool for agent configurations. It reads system prompts in YAML, JSON, Markdown, and other formats. The tool has two modes: deterministic static analysis (keyword matching, regex scoring, Jaccard overlap) and a live probe mode that uses LLM-driven testing to evaluate agent behavior at runtime. This census uses both.

Runtime probes. In addition to static analysis, a behavioral probe was run using agent-evals' LLM harness mode. Each agent's system prompt was tested with calibrated questions designed to probe domain boundaries — questions deliberately outside the agent's claimed expertise. The LLM (Llama 3.3 70B) was given each agent's system prompt and asked to answer, reporting its confidence. Each probe ran once deterministically (temperature 0) and three times stochastically (temperature 0.7) to measure consistency. 420 of 428 agents were probed across 2,500 questions and 10,000 API calls. Use the Static / Runtime tabs on charts in Chapters 7 and 8 to compare the two perspectives.

What Gets Measured

Domain coverage. Each system prompt is checked for keywords across 18 evaluation domains (backend, frontend, security, databases, testing, etc.). An agent matching half the keywords in a domain scores 1.0 for that domain. This is keyword-based pattern matching, not semantic understanding.

Pairwise overlap. For every pair of primitives, Jaccard similarity is computed on their sets of strong domains (score > 0.3). This produces 57,565 overlap scores. The analysis also checks for direct contradictions: cases where one agent says "always use X" and another says "never use X."

Boundary language. A regex check for words like "don't," "avoid," "outside," "limit," "boundary," or "refer to" in the system prompt. If present, the primitive scores 0.7 for boundary definition; if absent, 0.3.

Uncertainty guidance. A regex check for words like "uncertain," "unsure," "don't know," "not sure," or "confidence." If present, the primitive scores 0.8; if absent, 0.3.

The static analysis layer is fully deterministic — two runs on the same input produce identical results. The runtime probe layer introduces controlled randomness (three stochastic runs per probe) to measure behavioral variance. Together, they reveal where an agent's configuration (what it says) diverges from its behavior (what it does).

Chapter 3

Ecosystem Flow

How does the ecosystem organize itself? When developers create agent plugins, which domains do they target, and how do plugins distribute their focus? The Sankey diagram below maps the flow from plugins (left) to domains (right). Link width is proportional to how many primitives in a plugin claim a given domain. Small plugins with fewer than 4 total links are grouped under "Other" to keep things readable.

Click to enable zoom

Loading flow data...

Figure 1. Plugin-to-domain Sankey diagram. Left nodes are plugins (blue), right nodes are the 18 evaluation domains (amber). Hover for details. Zoom and pan with scroll and drag.

The flow reveals a heavily concentrated ecosystem. A handful of domains attract the vast majority of plugin attention, while others see relatively little coverage. Backend development, security, and testing are the thickest arteries, with most plugins claiming at least some presence in these areas.

Notice the pattern of fan-out: many plugins connect to the same popular domains. This isn't surprising since these domains represent core software engineering work, but it does mean that users installing multiple plugins are likely to end up with significant redundancy in those areas.

Network Insight

Domain Concentration

Backend development and security dominate the ecosystem, with 40% of agent claims concentrated in just 5 domains.

This concentration suggests opportunity for differentiation in emerging domains like cloud infrastructure and database design.

Chapter 4

The Domain Landscape

Domain Insight

18 Domains

Top 5 claim 40% of coverage

Broad: backend, security, testing, frontend, DevOps

Niche: mobile, data science, distributed systems

Non-tech: legal, medical, financial

If every agent declares what it knows, where does the collective knowledge cluster? The treemap below sizes each of the 18 evaluation domains by how many primitives claim them. The selectivity slider filters out domains that match too broadly: a domain with high selectivity matches few agents (more meaningful signal), while one with low selectivity matches most agents (potentially just noise from generic language).

Max selectivity 0.70

Loading domain data...

Figure 2. Domain distribution treemap. Tile area encodes agent count. Darker tiles indicate higher selectivity (fewer agents match). Drag the slider to filter out overly broad domains.

At the default threshold, several broad domains dominate the map. As you lower the selectivity cutoff, the generalist domains drop away and the more specialized ones come into focus. This is where the interesting structure lives: domains that are claimed by a meaningful subset of agents rather than nearly all of them.

The domains that survive aggressive filtering tend to be the most technically specific (mobile, distributed systems, data science) and the non-technical domains (legal, medical, financial). These represent genuine specialization rather than incidental keyword matching.

Chapter 5

Inside a Plugin

Plugin Insight

Internal Cohesion

0.39 avg cohesion

High cohesion = skills designed to support their agents

Low cohesion = grab-bag of unrelated capabilities

A plugin is more than a collection of agents. The best-designed plugins have internal coherence: their agents, skills, and commands share enough functional ground to work together, without so much overlap that they become redundant. The radial graph below shows the internal structure of each plugin. Nodes are colored by type and lines connect entities that share functional overlap.

Plugin

--- cohesion

Loading plugin data...

Figure 3. Plugin anatomy graph. Nodes are colored by type: agents, skills, commands. Line thickness encodes overlap score. Dashed red lines indicate conflicts. Cohesion is the average agent-to-skill overlap within the plugin.

Cohesion is the average overlap score between agent-type and skill-type entities within a plugin. Higher cohesion suggests the skills are designed to support the agents (they share functional ground). Very low cohesion might indicate a grab-bag of unrelated capabilities. The ecosystem average is 0.39.

Browse through several plugins to see the variety: some are tightly focused with a single agent supported by several skills, others are broad toolkits spanning many domains.

Chapter 6

The Overlap Problem

When two plugins independently define agents for the same domain, what happens? The chord diagram below visualizes cross-plugin functional overlap. Each arc represents a plugin; chords connect plugins whose agents overlap above the threshold. Thicker chords mean more overlapping pairs. Red chords indicate detected conflicts: cases where one agent's system prompt says “always use X” or “prefer X” while another says “avoid X” or “never use X.” These aren't domain mismatches — they're related agents giving opposite advice about the same tools or practices.

Min overlap 0.70

Loading overlap data...

Figure 4. Cross-plugin overlap chord diagram. Arcs are plugins, chords connect plugins with overlapping agents above the threshold. Red indicates conflicts. Hover for detail.

23% of agent pairs show significant functional overlap. That number is worth sitting with. In an ecosystem of independently authored plugins, some overlap is natural and even healthy: it means popular domains are well-served. But the density of connections at even moderate thresholds suggests users installing several plugins will encounter real redundancy.

The more concerning signal is the red chords: direct contradictions between agents in different plugins. These are cases where one agent says "always use X" and another says "avoid X." A developer using both plugins gets conflicting guidance with no warning.

Complexity Insight

Redundant Agents

23% of agent pairs show significant functional overlap, indicating the potential for redundancy in the ecosystem.

This redundancy creates decision fatigue for users and maintenance overhead for plugin authors, without necessarily increasing coverage of unique capabilities.

Chapter 7

The Boundary Gap

Key Finding

The 8:1 Ratio

72% define boundaries

<9% address uncertainty

Agents know what they do, but rarely say what they don't know.

This is the most actionable finding in the census. Boundary language means the system prompt explicitly states what the agent should not do, or where its expertise ends ("I focus on backend development and should not be used for frontend work"). Uncertainty guidance means the prompt instructs the agent how to handle things it doesn't know ("If I'm unsure about a security implication, I'll flag it rather than guess").

The chart below breaks down both metrics by primitive type.

Loading boundary data...

Figure 5. Boundary definition vs. uncertainty guidance by primitive type. Green bars show the percentage with boundary language; red bars show uncertainty guidance.Runtime boundary respect and refusal health by primitive type. Blue bars show mean boundary score from live probes; red bars show refusal health.

72% of primitives include some form of boundary language, but fewer than 9% address uncertainty. That's an 8:1 ratio. Agents are the best of the three types at both, but even among agents the uncertainty gap is stark.

This matters because an agent that confidently defines its scope ("I am a security expert") but never acknowledges limits ("If I encounter an unfamiliar vulnerability class...") may produce authoritative-sounding answers in areas where it should hedge. The boundary gap is the distance between claiming competence and acknowledging limits.

Switch to Runtime to see the say-do gap. When 2,500 calibrated probes were sent to these same agents through an LLM harness, the picture changed dramatically. The mean boundary respect score dropped to 0.30 — most agents confidently answered questions outside their claimed domain, regardless of what their system prompt said. Refusal health averaged just 0.04: almost no agent refused when it should have. Writing "I only handle backend tasks" in a prompt and actually declining frontend questions are very different things.

Chapter 8

Score Distributions

The previous chapters presented aggregate findings. Below are the full distributions of the three key metrics across all 375 primitives, broken down by type. This reveals the shape of the data behind the averages.

Loading score data...

Figure 6. Score distributions for scope clarity, boundary definition, and uncertainty guidance. Bars are grouped by primitive type: agents, skills, commands.Runtime score distributions for boundary respect, calibration, consistency, and refusal health across 420 probed agents.

Distribution

Three Metrics

Scope clusters high — most agents say what they do

Boundary bimodal — present or absent, no middle ground

Uncertainty overwhelmingly absent across all types

Scope clarity clusters high. Most primitives clearly state what they do, which makes sense since a system prompt that doesn't describe its purpose wouldn't be very useful.

Boundary definition shows a bimodal split: a cluster near 0.3 (no boundary language detected) and another near 0.7 (boundary language present). There's little middle ground, which reflects the binary nature of the regex check.

Uncertainty guidance clusters overwhelmingly near 0.3 (absent) for all types. The handful of primitives that do address uncertainty are scattered without a clear pattern by type.

Chapter 9

Limitations

This analysis has meaningful constraints that should inform how you interpret the results.

Keyword matching has false positives and negatives. A prompt saying "I avoid long pauses" would trigger the boundary regex on "avoid," even though it has nothing to do with scope boundaries. Conversely, sophisticated boundary instructions using novel phrasing could be missed entirely.
The 18 domains are one taxonomy, not the only possible one. A different set of domains would produce different coverage and overlap numbers. Some real specializations (e.g., accessibility, performance optimization) don't have their own domain in this framework.
The say-do gap has its own limits. The runtime probes used a single model (Llama 3.3 70B) — different models may respect boundary language differently with the same system prompts. Probe questions are synthetic, not real user queries. Three stochastic runs is sufficient to detect variance but not achieve statistical significance per individual probe. And confidence self-reports may not reflect actual accuracy. Recent research from Gloaguen et al. ("Evaluating AGENTS.md", ETH Zurich, Feb. 2026) found that the relationship between configuration content and actual task performance is surprisingly complex: agents generally follow the instructions in context files, but that doesn't always translate to better outcomes.
One ecosystem, not all Claude Code usage. The wshobson/agents repository is large and active, but it's one community collection. Developers writing private agent configurations, corporate teams with internal plugins, and other public repositories are not represented.
Overlap scoring is coarse. Jaccard similarity on domain sets treats all shared domains equally. Two agents sharing "backend" and "databases" might actually have very different specializations within those broad categories.
Boundary and uncertainty scores are binary. A prompt either contains the trigger words (score 0.7/0.8) or it doesn't (score 0.3). There is no gradient for how well the boundaries are defined or how thoughtful the uncertainty guidance is.

Chapter 10

What This Means

The agent ecosystem is growing fast and organically. Thousands of developers are independently writing configurations that define how AI collaborators should behave, and a few patterns are worth noting.

The say-do gap is the headline finding. 72% of agents include boundary language in their system prompts, but when tested with out-of-scope questions, the mean runtime boundary score is just 0.30. Most agents confidently answer outside their domain regardless of what their prompt says. Refusal health averages 0.04 — almost no agent refuses when it should. Writing boundaries is easy; making them work is the harder problem.

The boundary gap is still the lowest-hanging fruit. Adding explicit scope limits and uncertainty instructions to a system prompt is straightforward. Plugin authors could meaningfully improve their agents by adding a few lines about what the agent should not attempt and how it should handle the edges of its knowledge. But authors should test those boundaries at runtime, not just assume the language will stick.

Concentration creates both opportunity and risk. The heavy clustering around backend, security, and testing domains means those areas are well-served but potentially oversaturated. Developers building new plugins might find more impact in underserved domains like cloud infrastructure, observability, or data science.

Overlap isn't inherently bad, but conflicts are. Multiple plugins covering the same domain gives users choice. But contradictory guidance between plugins is a real problem that currently has no systematic detection mechanism beyond tools like agent-evals.

Skills need boundaries too. Skills and commands have even lower boundary and uncertainty rates than agents. As these primitives take on more autonomous roles with features like forked execution contexts, their need for guardrails grows.

Future editions of this census will track how the ecosystem evolves — whether the boundary gap narrows as awareness grows, and how new Claude Code features change the patterns in community-authored configurations.

Related research. While this census examines the structure of agent configurations, a complementary question is whether those configurations actually help. Gloaguen et al. ("Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?", ETH Zurich, Feb. 2026) evaluated context files (AGENTS.md, CLAUDE.md) across four coding agents and two benchmarks. Their findings are nuanced: LLM-generated context files actually decreased task success rates by 2-3% while increasing cost by over 20%, while human-written files provided a modest ~4% improvement. Agents do follow the instructions in context files (tools mentioned get used 1.6-2.5x more), but more instructions also mean more exploration, more testing, and more reasoning tokens spent. The authors conclude that "unnecessary requirements make tasks harder" and that context files "should describe only minimal requirements." This parallels the boundary gap finding in this census: the challenge isn't just whether agents have guardrails, but whether the right guardrails are expressed concisely enough to actually help. Their work evaluates effectiveness at the repository level; this census maps the structural patterns across an entire ecosystem. Together, they suggest that the quality and minimalism of agent configurations matters more than their quantity.

Appendix

How This Was Built

This report runs entirely in your browser. The data is stored in Parquet files served over HTTP, and every chart, filter, and mini-tool queries those files using SQL executed client-side in WebAssembly. There is no backend API processing your interactions.

Analysis

Static analysis. agent-evals v0.3.0 scans the wshobson/agents repository recursively, reading 496 configuration files across 69 plugin directories. Each file is parsed (YAML, JSON, Markdown) and scored against 18 evaluation domains using keyword matching. Boundary language and uncertainty guidance are detected via regex patterns. Pairwise overlap is computed as Jaccard similarity on domain sets. The output is a single JSON file containing every primitive with its scores, domains, and metadata. Content-hashing deduplicates the 496 files down to 375 unique primitives.

Runtime probes. agent-evals' LLM harness mode sent 2,500 calibrated probes across 420 agents via Llama 3.3 70B, totaling 10,000 API calls. For each primitive, the tool generates domain-relevant boundary, calibration, overlap, and refusal probes. Each probe runs once deterministically (temperature 0) and three times stochastically (temperature 0.7). The harness measures confidence calibration, hedging behavior, refusal rates, and response consistency. This produces four behavioral scores per agent — boundary respect, calibration, consistency, and refusal health — that complement the three static scores.

Data Pipeline

JSON to Parquet. A Go program reads the analysis JSON and produces seven Parquet files: agents, overlaps, agent_domains, domains, plugins, issues_summary, and meta. Parquet was chosen specifically because DuckDB-WASM can read it directly over HTTP with no server-side processing. The columnar format also compresses well: the entire dataset is seven files totaling under 300KB.

Schema design. The schema is normalized for analytical queries. The overlaps table alone contains 57,565 rows (one per agent pair) with precomputed Jaccard scores and conflict flags. The agent_domains table is a many-to-many bridge with 3,594 rows. This structure lets the frontend run joins and aggregations in SQL rather than doing client-side data wrangling in JavaScript.

Client-Side OLAP

DuckDB in the browser. When the page loads, it initializes DuckDB-WASM and registers each Parquet file as an HTTP-backed table. From that point on, every chart, slider interaction, dropdown change, and mini-tool query executes real SQL in your browser's WebAssembly runtime. The Sankey diagram is a GROUP BY with HAVING. The overlap chord diagram is a self-join with a threshold filter. The Agent Lookup is a LIKE query. These are not pre-rendered visualizations — they are live queries against a columnar database running locally.

Query architecture. All SQL is defined in a central queries.js module. Chart modules import query templates, substitute parameters (slider values, dropdown selections), and call a shared query() helper that executes against the DuckDB connection. Results come back as typed JavaScript arrays. This means adding a new visualization is: write a SQL query, write a D3 render function, wire them together.

Why this is nice. Client-side OLAP means the page works on any static file host — no server processing queries, just immutable Parquet files and your browser doing the compute. DuckDB-WASM is genuinely one of the most impressive pieces of engineering in the data tooling space right now. Loading a full analytical database in the browser that runs real SQL against columnar data at near-native speed, all in under 300KB of data? It's hard not to get excited about that.

Visualization

Six interactive charts built with D3.js v7: Sankey diagram (plugin-to-domain flow), treemap (domain landscape), chord diagram (cross-plugin overlap), force-directed network (plugin anatomy), grouped bar chart (boundary comparison), and small multiples histograms (score distributions). Each chart subscribes to DuckDB query results and re-renders when control inputs change. Slider and dropdown interactions re-query DuckDB and update charts in real time.

Component	Tool	Role
Analysis	agent-evals v0.3.0	Static + live probe analysis of agent configurations
Data pipeline	Go + Apache Parquet	JSON to columnar Parquet conversion
Client-side SQL	DuckDB-WASM 1.29	In-browser analytical queries over HTTP-backed Parquet
Visualization	D3.js v7	Interactive SVG charts driven by live SQL results
Web server	Go + go:embed	Single-binary static file server
Source	agent-census	Data pipeline, queries, and frontend code