Agentic workflows for software development

18 min read Original article ↗

The promise and the reality: Notes from the field

QuantumBlack, AI by McKinsey

Over the past two years, many teams have adopted generative AI copilots, and more recently AI agents, within their software development workflows. The pitch seemed straightforward: developers would use copilots as intelligent assistants, pair programming with AI to move faster and build better software.

And it works, up to a point. As we’ve observed from McKinsey engagements, while the “developer with AI assistant” model makes individual practitioners faster, in an enterprise context, the efficiency improvement from idea to live feature is typically less significant. The handoff from requirements to design to implementation is where context goes to die. Decisions buried in Slack threads. Assumptions in someone’s head. Rationale re-litigated because no one can find the original reasoning. AI assistants can accelerate the work within a phase of the SDLC as long as you don’t expect them to fix the boundaries between them.

AI agents also introduce problems of their own:

  • Unpredictable outcomes: Different developers prompting the same model get different results. Quality depends on individual skill, not systematic process. You can’t staff around it or plan for it.
  • No audit trail: Decisions live in chat windows. When an auditor asks why the system was built this way, or a new team member asks why we chose Redis over SQS, the reasoning is either lost or scattered across dozens of conversations.

These limitations are all about workflow design.

The value comes when agents operate inside conventions, structured specifications, and deterministic processes. A growing number of teams are adopting this principle under the label spec-driven development (SDD) where structured specifications drive what agents produce and ad hoc prompts are eliminated.

Our successful implementations followed a specific pattern: deterministic orchestration for workflow control, paired with bounded agent execution and automated evaluation at each step.

This article covers that pattern: the two-layer model (orchestration vs execution), how folder structures and naming conventions create machine-readable workflows, and what this looks like end-to-end.

Press enter or click to view image in full size

This article describes a pattern that uses deterministic orchestration for workflow control and bounded agent execution with automated evaluation at each step.

Orchestration layer: deterministic workflows based on conventions

The orchestration layer stays deterministic. Agents shouldn’t decide what comes next or where artifacts should live. Instead, we use a deterministic workflow engine that follows predefined rules to move work through stages.

This is a critical design choice. Early on, we experimented with letting agents orchestrate themselves, deciding when to move from requirements to design, or which task to work on next. On smaller projects this worked. On larger codebases with cross-cutting concerns and multiple in-flight features, agents routinely skipped steps, created circular dependencies, or got stuck in analysis loops. Agents are good at generating content within a bounded problem; they struggle with meta-level decisions about workflow sequencing.

What works is a conventional, rule-based workflow engine that:

  • Enforces phase transitions: Requirements must be complete before tasks can be generated; architecture must be reviewed before implementation starts
  • Manages dependencies: Tasks can only execute when their dependencies are satisfied
  • Tracks artifact state: Each artifact (requirement, task, etc.) has a state machine (draft → in-review → approved → complete) stored in frontmatter. The engine reads this to determine which tasks are ready to execute, which are blocked, and which are done
  • Triggers agents at the right time: “When REQ-001 is approved, generate technical tasks” is deterministic; “figure out what to do next” is not

The orchestration runs around the agents. Agents don’t decide what phase we’re in or what comes next; they execute tasks given to them by the workflow engine. “If all tasks for REQ-001 are complete, mark REQ-001 complete. If all requirements are complete, trigger deployment.” No intelligence, no judgment calls. Just deterministic state transitions.

Execution layer: agents + evaluations

Within each phase, agents do the creative work:

  • Analyzing requirements and breaking them down into technical tasks
  • Proposing technical architectures
  • Writing code and tests
  • Creating documentation

We use specialized agents for different tasks rather than one general-purpose agent: a requirements agent for breaking down features, an architecture agent for design decisions, a coding agent for implementation, and a knowledge agent that other agents call to query project context and track assumptions.

Each has a clear responsibility, similar to microservices architecture. Like microservices, this trades one complex agent for multiple simpler agents plus the orchestration cost of managing interfaces between them. Bounded agents produce more predictable outputs.

These agents operate within guidelines, much like a new team member would, but with zero implicit understanding. Unlike even the most junior developer, they bring no background knowledge, no intuition from past projects, no ability to pick up conventions by osmosis. Every guideline must be explicit and machine-readable. Modern agent platforms are converging on this through agent skills: reusable, modular instructions (often SKILL.md files) that encode domain-specific expertise. Each of our specialized agents is essentially a skill, a bounded set of instructions, templates, and evaluation criteria for a specific type of work.

Each artifact type has a structured template and a definition of done. Requirements, designs, and tasks aren’t freeform text but artifacts with consistent structure. Traceability and sequencing live in machine-readable metadata so the workflow engine can move work forward deterministically and the next agent always gets complete inputs.

Every agent output goes through evaluation before the workflow proceeds: a combination of deterministic checks and agentic validation, without a human reviewer in the loop.

Deterministic checks run first: linters, test suites, structural validation (is required frontmatter and section structure present? do cross-references resolve?). These are fast, reliable, and catch obvious issues.

For checks that require judgment (“Are these acceptance criteria actually testable?” or “Does this architecture follow established patterns?”), we use a dedicated critic agent that runs inline at the end of each phase. The critic validates the producing agent’s output against the definition of done, returning pass/fail with explanation.

If either layer rejects the output, the producing agent iterates (still within the same phase) until the artifact passes both. Only then does the workflow engine advance to the next phase. We cap iterations at 3–5 attempts to prevent infinite loops; if the agent can’t pass evals within the limit, the workflow fails and rolls back for human intervention.

Putting it together

Each phase follows the same lifecycle: deterministic pre-events set up context, the agent does creative work, and deterministic post-events validate the output (fast structural checks first, then a critic agent for judgment calls). The fail loop sends the agent back for another attempt; the only way forward is through the post gate.

Press enter or click to view image in full size

Putting it together to demonstrate how the layers combine across the full workflow.

Now we’ve covered the two layers conceptually, here’s how the orchestration layer works in practice. The key design decision, shared across spec-driven tools like Spec Kit and Kiro, is separating persistent project context from per-feature specifications, and co-locating both within a repo alongside application source code:

.sdlc/
context/ # Persistent project context
project-overview.md # What the system does, tech stack, scope
architecture.md # Architecture decisions and patterns
conventions.md # Naming, structure, coding standards
templates/ # Reusable artifact templates
requirement-template.md
task-template.md
specs/ # Per-feature specifications
REQ-001-notification-system/
requirement.md # The spec
tasks/
TASK-001-implement-notification-service.md
TASK-002-create-email-channel.md
knowledge/ # Accumulated project knowledge & answered questions

src/ # Source code (normal location)
tests/ # Tests (normal location)
AGENTS.md # Root-level agent context

These are more than organizational preferences because they’re part of the workflow engine’s contract. The folder hierarchy and naming conventions tell the system:

  • What’s persistent vs per-feature: Files in context/ apply to all features; files in specs/ belong to a specific requirement
  • What’s related: A requirement and its tasks live together in one folder. Everything about REQ-001 is in one place
  • Where agents should read and write: A requirements agent writes to specs/REQ-001-*/ and reads from context/
  • Artifact relationships: The REQ-* prefix and folder structure let the engine parse parent-child relationships programmatically

Each artifact also carries explicit traceability in its metadata, which enables the workflow engine to:

  • Validate completeness (are all acceptance criteria from REQ-001 covered by tasks?)
  • Compute task order (what can run in parallel vs sequentially?)
  • Block invalid transitions (can’t mark REQ-001 complete if TASK-001 is still in progress)

The combination of deterministic workflow engine, strict folder conventions, and naming patterns makes the system predictable, auditable, and debuggable. When something goes wrong (and it does), you can trace back through the decision tree. Spec evolution is handled through frontmatter (each artifact carries its status: draft, in-review, approved, complete) and branching. When a requirement changes after tasks have been generated, the workflow engine can invalidate dependent tasks based on their status and trigger regeneration on a new branch.

End-to-end example: building a notification system

The following composite example is simplified to illustrate the convention-based approach end-to-end. The patterns are real; the feature is illustrative.

Branch creation

Before any work begins, the workflow engine creates a feature branch (agent/REQ-001-notification-system) for the entire workflow. All phases execute on this branch, with commits after each step. Git is the state store: the branch represents the feature workflow; commits represent completed phases. The repository itself is the source of truth.

The workflow engine handles deterministic Git operations (clone/branch/commit/push/open PR), keeping them out of the agent’s scope. Agents produce artifacts; the engine moves them through the repo. Humans only enter at the end when the PR is opened (or earlier only if the workflow fails its eval gates and escalates).

Phase 1: Requirements

We start by validating .sdlc/context/project-overview.md (the grounding context for all agent interactions) and updating it if the project scope has changed. Then we give an agent this prompt: "Create a requirement for a notification system supporting email, in-app, and push channels with user preferences". The agent produces .sdlc/specs/REQ-001-notification-system/requirement.md:

---
id: REQ-001
title: "Notification System"
status: draft
---

## Description

System must provide multi-channel notifications (email, in-app, push) with
user-configurable preferences and reliable delivery tracking.

## Acceptance criteria

- [ ] Users can receive notifications via email
- [ ] Users can receive in-app notifications
- [ ] Users can receive push notifications (mobile/web)
- [ ] Users can configure channel preferences per notification type
- [ ] Failed deliveries are retried with exponential backoff
- [ ] Notification history is queryable for 90 days
- [ ] Notifications can be marked as read/unread

## External dependencies

- Email service (SendGrid/SES)
- Push notification service (Firebase Cloud Messaging)
- PostgreSQL database

The agent has done creative work (analyzing the requirement, identifying edge cases, researching best practices) but within a strict structure. Deterministic checks validate that required frontmatter and sections are present; the critic agent confirms that acceptance criteria are testable.

During creation, the requirements agent called the knowledge agent multiple times. Because these are structured tool calls with assumptions logged within the repository, they appear as discrete, reviewable items in the pull request:

  • “Should notifications be sent synchronously or via a queue?” No existing convention found. Assumption logged: Using async queue-based delivery (not synchronous).
  • “What’s our email provider?” Found in architecture.md: SendGrid for transactional email.
  • “Do we need notification templates or freeform content?” No existing convention found. Question logged for human review.

The agent commits this requirement to the branch with a message like feat(REQ-001): add notification system requirement. Assumptions are logged as structured data, ready for review later.

Phase 2: Architecture

The architecture agent reads REQ-001, which includes the assumptions logged during requirements, including “Using async queue-based delivery.” It checks architecture.md for existing patterns (finds SendGrid is the established email provider), and queries the knowledge agent for areas not yet covered. It encounters gaps: no established convention for queue infrastructure or notification storage. Rather than just logging assumptions, the agent proposes architectural decisions with rationale:

## Message Queue (proposed addition to architecture.md)

Use Redis-backed queue (Bull/BullMQ) for async job processing.
Rationale: Handles retry logic, backoff, and dead-letter queues out of the box;
Redis already in stack for caching.

## Event Storage (proposed addition to architecture.md)

Store notifications in PostgreSQL with 90-day retention, partitioned by created_at.
Rationale: Queryable history for user-facing inbox; partitioning enables efficient cleanup.

The agent commits the architecture proposal: feat(REQ-001): add architecture for notification system. These proposed conventions will be reviewed alongside everything else when the PR is opened.

Phase 3: Technical tasks

The agent generates concrete implementation tasks inside REQ-001-notification-system/tasks/:

  • TASK-001-implement-notification-service.md
  • TASK-002-create-email-channel.md
  • TASK-003-create-push-channel.md
  • TASK-004-add-user-preferences.md
  • TASK-005-implement-notification-api.md

Each task specifies:

  • Parent requirement (REQ-001)
  • Files to create/modify
  • Acceptance criteria
  • Dependencies on other tasks

The eval checks that tasks form a valid dependency graph (no cycles, all files accounted for). The agent commits: feat(REQ-001): add technical tasks for notification system.

Phase 4: Implementation

The agent continues on the same agent/REQ-001-notification-system branch, now writing code. For each task:

  1. Agent writes code following the technical task spec
  2. Automated evals run: linting, tests, coverage checks
  3. Workflow engine commits with a message referencing the task ID

When all tasks are complete, the workflow engine pushes to remote and creates a pull request. This is the first point where a human enters the loop, reviewing the complete feature rather than individual phases. The PR diff shows everything: the requirement spec, architecture decisions, task breakdowns, and implementation code, all on one branch with a clear commit history:

# Files changed in agent/REQ-001-notification-system vs main
+ .sdlc/specs/REQ-001-notification-system/requirement.md
+ .sdlc/specs/REQ-001-notification-system/architecture.md
+ .sdlc/specs/REQ-001-notification-system/tasks/TASK-001-implement-notification-service.md
+ .sdlc/specs/REQ-001-notification-system/tasks/TASK-002-create-email-channel.md
+ ... (other task files)
+ src/notifications/notification.service.ts
+ src/notifications/channels/email.channel.ts
+ src/notifications/channels/push.channel.ts
+ src/notifications/preferences.service.ts
+ src/notifications/notification.controller.ts
+ tests/notifications/notification.service.spec.ts

The human reviews the complete feature (specs, architecture, tasks, and code) in one place. They can see the assumptions logged during requirements, the architectural decisions proposed, and how tasks map to implementation. They can approve and merge, request changes (the agent reworks on the branch), or add threaded comments on any artifact. The agent executes the full workflow autonomously, but humans review the complete output before it merges.

Traceability in action

Now imagine six months later, a new team member asks: “Why are we using Redis queues instead of a managed service like SQS?”

You can trace back:

  • src/notifications/notification.service.ts → Created by TASK-001
  • TASK-001 → Implements REQ-001 (notification system)
  • Architecture commit → Agent proposed Redis-backed queue with rationale: “Redis already in stack for caching; Bull/BullMQ handles retry logic out of the box”
  • PR approved → Decision documented in architecture.md

The decision trail is preserved in the folder structure and explicit links. This is especially valuable in regulated industries. When auditors ask why a particular choice was made, the chain of reasoning from business requirement through architecture to implementation is already there, with documented assumptions and review approvals at each step.

“Isn’t this just waterfall?”

Yes. That’s the point.

Waterfall got a bad reputation not because sequential phases are inherently wrong, but because the economics didn’t work: writing specs took months, requirements changed mid-flight, and by the time you reached implementation the documents were already stale. Teams responded by abandoning structure altogether, shipping faster but losing traceability.

Agents change the economics. When an agent can execute the full requirements → architecture → tasks → implementation cycle in hours, not months, you can afford the structure. Teams run multiple complete cycles per day. A product manager kicks off three competing feature experiments on Monday morning and reviews working implementations by afternoon. If requirements change, you don’t update stale documents. You run a fresh cycle with updated inputs.

The phased structure gives you what waterfall promised (traceability, architectural consistency, clear decision trails) without the cost that made it impractical. We’re not avoiding waterfall’s shape. We’re solving the problems that made it fail. And ironically in delivering multiple iterations a day, we can deliver on Agile’s original promise better than ever before.

Technical details that matter

Evaluations validate each artifact

As described earlier, each phase runs both deterministic checks and critic agent validation. Crucially, these evaluations run on the output artifacts (the files agents produce) not on conversational responses. Each template defines a Definition of Done that maps directly to eval checks. Here’s what that looks like concretely.

For requirements, deterministic checks include:

  • Required frontmatter present (for example: id, title, status)
  • All mandatory sections present (Description, Acceptance Criteria, External dependencies)
  • No orphan artifacts (every child references a valid parent)
  • Cross-references resolve (dependencies point to artifacts that exist)

The critic agent handles judgment calls: “Can each acceptance criterion be verified with a test? Are any ambiguous?” It returns pass/fail with explanation. This is cheaper and faster than human review for catching obvious issues, and it means human reviewers see only artifacts that have already met a baseline quality threshold.

For code, the balance shifts toward deterministic checks:

  • Linters and formatters
  • Unit tests that the agent wrote alongside the code
  • Coverage thresholds (configurable per project)
  • Architecture compliance checks (for example, ensuring that notification dispatch logic lives in src/notifications/ and doesn't leak into controllers, verified by import analysis or AST checks)

The critic agent still runs on code, but focuses on higher-level concerns: does the implementation match the task spec? Are there obvious security issues the linter wouldn’t catch?

Workspace access and knowledge management

Agents need access to the full workspace and not just the current artifact, but context windows are finite. Memory systems sound appealing (“just remember everything”) but retrieved memories consume the same tokens needed for actual work.

Rather than each agent independently searching the workspace and managing its own context, we implemented a dedicated knowledge agent that other agents call as a tool. When a requirements agent, architecture agent, or coding agent encounters uncertainty (“What’s our email provider?” or “Do we support multi-tenancy?”), it calls the knowledge agent via a structured tool call.

The knowledge agent searches the project’s knowledge base (architecture docs, conventions, previous decisions) and returns one of two things:

  1. An answer: The knowledge base contains the information. The calling agent continues with a grounded decision.
  2. An assumption: The knowledge base doesn’t cover this. The knowledge agent logs the unanswered question alongside the assumption the calling agent will make in order to continue.

Because these interactions happen through structured tool calls, every question asked and every assumption made appears as structured data in the agent’s output. When the agent opens a pull request after completing all phases, reviewers see exactly which assumptions were made and why. Not buried in freeform notes, but as discrete, reviewable items in the diff.

This creates a feedback loop. When a reviewer approves a PR, they’re approving the assumptions alongside the code. When they reject an assumption, the agent reworks with corrected information. Either way, the answer feeds back into the knowledge base: the reviewer’s correction becomes a documented decision that future agents can find. Over time, the knowledge base grows organically from real questions agents encountered, filling gaps that no one thought to document upfront.

This approach also keeps individual agents focused: they don’t fill their context window on search and retrieval.

For simpler applications (single repository, straightforward architecture, small team), a file-based alternative can work well: agents write assumptions directly to files in the repository (e.g., .sdlc/knowledge/assumptions/) using a knowledge skill. The assumptions still get reviewed in PRs and improve the knowledge base over time.

The structured tool call approach becomes valuable at scale: multiple repositories, complex cross-cutting concerns, or when you need programmatic tracking of assumption patterns across features. Start simple; add the dedicated knowledge agent when the file-based approach breaks down. Typically that means assumption files becoming unwieldy, manually correlating assumptions across repos, or wanting to analyze assumption patterns programmatically.

What’s worked in practice?

For leaders

  • Build the end-to-end factory from day one. From the outset, ensure that agents execute the full workflow autonomously and humans only enter the loop when the agent opens a pull request with the complete feature. Interrupting mid-workflow destroys the speed advantage, and reviewing partial work without full context leads to worse decisions.
  • Optimize for throughput. Lean cross-functional teams kick off features, agents execute the full cycle, and the team reviews together. This turns weeks of sequential work into days of parallel exploration.
  • Rewire for agent-first. This is an organizational change, not just a technical one. If decisions live informally in Slack threads or hallway conversations, agents can’t scale your output. Write it down or accept that it won’t be part of the process.

For engineers

  • Create tight feedback loops with evals. The faster an agent gets feedback on whether its output is acceptable, the faster it iterates. Waiting for human review to discover an agent misunderstood the requirement is painful. Encoding that check in an eval catches it earlier.
  • Package domain expertise as portable skills. Define what an agent knows and can do as reusable modules, instead of hard-coding behavior in prompts. Many platforms now support this natively through agent skills (SKILL.md files): modular instructions that encode domain expertise. Workflows don't get rewritten for every team; agents stay stable and skills are the layer that adapts to each use case. Pair skills with evals and traces to monitor quality and roll forward/back like software.

Bottom line

The convention-based, two-layer approach is proving reliable across the teams we’ve worked with, from small teams building MVPs to enterprise teams in regulated industries with compliance requirements.

The less obvious lesson: this requires organizational change that goes beyond technical tooling. You can’t bolt agents onto existing processes, but agents produce structured artifacts cheaply enough to ensure the overhead is minimal compared to the traceability you gain.

The teams getting the most value should be asking, “how do we run ten feature experiments before lunch?” It’s a shift from asking “how do we make developers faster? because agents don’t just accelerate development. Your takeaway message is that agents can now change what’s economically viable to attempt.