Building an ECS Golden Path with Claude Code and AWS CDK

How AI-assisted infrastructure development requires a different mindset than typical application coding

At ZAR, we recently embarked on creating a Golden Path for ECS deployments: a set of opinionated, reusable AWS CDK constructs that encode our best practices for running containerized services. What made this project unique wasn’t just the technical challenge, but the approach. Pair-programming with Claude Code, Anthropic’s AI coding assistant, marked a departure from previous iterations on this theme.

This article shares our experience, the surprises we encountered, and why infrastructure-as-code presents unique challenges for AI-assisted development.

Our CDK Library: A Home for L3 Constructs

Before diving into the Golden Path, it’s worth explaining where these constructs live. We maintain an internal CDK library that serves as our repository for Layer 3 (L3) constructs.

Why a dedicated library?
AWS CDK provides three levels of constructs:

L1 (CloudFormation Resources): Direct mappings to CloudFormation resources
L2 (Curated Constructs): AWS-provided abstractions with sensible defaults
L3 (Patterns): Opinionated, higher-level abstractions combining multiple resources

L3 constructs encode organizational knowledge including our security policies, naming conventions, tagging standards, and operational patterns. Storing them in a versioned, shared library means:

Consistency: Every team uses the same patterns
Governance: Security and compliance requirements are baked in
Velocity: New services get production-ready infrastructure in minutes
Evolution: Improvements benefit all consumers automatically

Our library includes constructs for VPCs, Route53 Hosted Zones, shared ALBs, and now the ECS Golden Path.

What is a Golden Path?

A Golden Path (or “paved road”) is a pre-configured, opinionated way to accomplish a common task. For ECS, our Golden Path includes:

EcsFargateCluster: Creates ECS clusters with Container Insights V2, ECS Exec logging, and Fargate capacity providers
EcsFargateTaskDefinition: Task definitions with automatic logging, KMS permissions for secrets, and init process handling
EcsFargateService: Base Fargate service with auto-scaling and native Blue/Green deployment support
SharedAlbEcsFargateService: ALB-fronted service extending EcsFargateService, adding DNS and health checks

The goal? Allow teams to deploy a production-ready ECS service with ~20 lines of CDK code instead of 200+.

How the Constructs Connect

Each layer builds on the one below, adding capabilities while hiding complexity. Teams can use SharedAlbEcsFargateService for most web services, or drop down to EcsFargateService for internal services without ALB exposure.

The AI-Assisted Development Experience

What Worked Well

Claude excelled at several aspects of CDK development:

1. Boilerplate Generation

CDK constructs involve significant boilerplate — interfaces, props validation, JSDoc comments, type exports. Claude generated these consistently and correctly.

export interface EcsFargateTaskDefinitionProps extends EcsAppProps, VpcProps {
  /**
   * Container image to deploy.
   */
  readonly image: ecs.ContainerImage;  /**
   * CPU units for the task.
   * @default 1024 (1 vCPU)
   */
  readonly cpu?: FargateCpuValue;
  // ... 20+ more props with defaults and documentation
}

2. Test-Driven Development

We followed a TDD approach, and Claude proved adept at writing CDK assertions:

it('should add KMS decrypt permission when secrets are provided', () => {
  const stack = new cdk.Stack();
  new EcsFargateTaskDefinition(stack, 'TaskDef', {
    appName: 'TestApp',
    deployEnvironment: 'staging',
    image: ecs.ContainerImage.fromRegistry('nginx'),
    secrets: {
      DB_PASSWORD: ecs.Secret.fromSecretsManager(secret, 'password'),
    },
  });  Template.fromStack(stack).hasResourceProperties('AWS::IAM::Policy', {
    PolicyDocument: {
      Statement: Match.arrayWith([
        Match.objectLike({
          Action: 'kms:Decrypt',
          Effect: 'Allow',
        }),
      ]),
    },
  });
});

3. AWS CLI for Resource Comparison

When migrating existing infrastructure to use our new constructs, Claude used AWS CLI commands to compare current vs. synthesized resources:

# Check current ECS cluster configuration
aws ecs describe-clusters --clusters ZarECSCluster# Compare with CDK synth output
cdk synth --quiet && cat cdk.out/ZarCoreStack.template.json | jq '.Resources | to_entries[] | select(.value.Type == "AWS::ECS::Cluster")'

This was invaluable for ensuring migrations wouldn’t cause unexpected resource replacements.

The Hard Parts: Why Infrastructure is Different

Here’s where things got interesting. Writing infrastructure code with AI assistance is fundamentally different from application code, and we learned this the hard way.

1. You Can’t Just “Rename Things”

In a Rails app, renaming a class is a refactor. In CDK, renaming a Construct ID can be catastrophic:

// Before
new ecs.Cluster(this, 'EcsFargateCluster', { ... });// After - DANGER!
new ecs.Cluster(this, 'Cluster', { ... });

This simple change causes CloudFormation to:

Create a NEW cluster named differently
Delete the OLD cluster
Which deletes all services, tasks, and causes downtime

Claude initially suggested these kinds of “cleanup” refactors. We had to explicitly establish rules: never rename Construct IDs for existing infrastructure without understanding the replacement implications.

2. Testing Requires Real AWS Environments

Unlike unit tests that run in milliseconds, validating ECS constructs requires:

Actual cdk deploy operations (5–15 minutes)
Blue/Green deployments that fail at runtime, not during cdk synth
Load balancer health checks that timeout
Secrets Manager permissions that only manifest at container startup

We couldn’t just run npm test and ship. Every significant change required deployment to a staging environment.

3. The Blast Radius is Different

A bug in application code might affect one user’s request. A bug in infrastructure code can:

Take down an entire service
Orphan resources costing money
Create security vulnerabilities across all environments
Cause data loss if databases are accidentally replaced

This required a different review process. We had Claude generate cdk diff outputs and analyzed them carefully before any deployment.

4. Native ECS Blue/Green Deployment Complexity

We chose native ECS Blue/Green deployments over CodeDeploy-backed deployments for simplicity. But even “native” comes with complexity:

deploymentController: {
  type: ecs.DeploymentControllerType.ECS,
},
circuitBreaker: { enable: true, rollback: true },

Native ECS Blue/Green still has constraints:

Deployment circuit breaker behavior affects rollback timing
Health check grace periods interact with deployment timeouts
Minimum/maximum healthy percent settings can cause deployment stalls
ALB target group draining impacts deployment speed

Claude could write the code, but understanding the operational implications required human judgment and real-world testing.

Migration Strategy

Migrating existing infrastructure to Golden Path constructs required careful planning:

Inventory Current Resources: Used AWS CLI to document all existing resource names and configurations
Match Logical IDs: Ensured our constructs generated the same CloudFormation logical IDs where resource replacement was not an option
Use `cdk diff` Religiously: Every change was diffed before deployment
Incremental Migration: Migrated one service at a time, not everything at once
Keep Escape Hatches: Some legacy services needed to deviate from the Golden Path temporarily

Codifying the Learnings: A Claude Code Skill

After encountering the “you can’t just rename things” problem multiple times, we decided to codify these learnings into a Claude Code Skill — a reusable set of instructions that guides Claude when working on CDK infrastructure migrations.

The skill lives in our CDK library at `.claude/skills/cdk-infrastructure-migration/` and includes:

.claude/skills/cdk-development/
├── SKILL.md                    # Main skill with rules and patterns
└── references/
    └── migration-patterns.md   # Detailed migration scenarios

What the skill covers:

Core Principles: Construct ID safety, resource naming best practices, the blast radius problem
Pre-Deployment Validation: Multi-layer strategy with cdk-nag, synthesis checks, and AWS CLI comparison
Migration Patterns: Migrating to L3 constructs, keeping logical IDs during refactoring, blue/green infrastructure
Common Pitfalls: Renaming stacks, changing construct hierarchy, updating L3 libraries
ECS-Specific Guidance: Native Blue/Green deployments, task definition changes, IAM role updates
Checklists: Pre-deployment checklist, testing strategy

Key rules encoded:

Construct IDs are sacred: Never rename without understanding CloudFormation implications
Always run `cdk diff` first: Look for replace operations before deploying
Compare with live resources: Use AWS CLI to verify expected vs. actual state
Understand the blast radius: Infrastructure bugs affect entire services, not single requests

The skill also includes a pre-deployment hook that warns before running cdk deploy:

hooks:
  PreToolUse:
    - matcher: Bash(cdk deploy*)
      command: echo "⚠️ STOP - Have you reviewed 'cdk diff' output for resource replacements?"

This turns our hard-learned lessons into guardrails that protect future development sessions — whether with Claude or human developers following the same patterns.

Lessons Learned

Establish guardrails early: Create explicit rules about what AI can and cannot modify (Construct IDs, resource names, etc.)
Always verify with real deployments: npm test passing means nothing if cdk deploy fails
Leverage AI for documentation: Claude wrote excellent JSDoc comments and README content
Have AI generate comparison commands: Using AWS CLI to verify expected vs. actual state
Encode opinions, but allow escape hatches: Strong defaults, optional overrides
Test with real services: Our test suite includes actual service deployments, not just unit tests
Document the “why”: Every default should have a documented rationale
Version carefully: Breaking changes in shared constructs affect all consumers

Conclusion

Building ECS Golden Path constructs with Claude Code was productive — we shipped faster than we would have alone. But it also highlighted that infrastructure-as-code requires a fundamentally different approach to AI-assisted development.

The code runs in production, not in isolation. Mistakes aren’t caught by compilers or linters — they’re caught by failed deployments, downtime, or worse. AI can accelerate the journey, but humans must remain firmly in the driver’s seat for infrastructure decisions.

Our Golden Path is now in production, powering multiple services at ZAR. The constructs encode months of operational learning into reusable patterns.