Building an ECS Golden Path with Claude Code and AWS CDK

7 min read Original article ↗

Adrian Taut

How AI-assisted infrastructure development requires a different mindset than typical application coding

At ZAR, we recently embarked on creating a Golden Path for ECS deployments: a set of opinionated, reusable AWS CDK constructs that encode our best practices for running containerized services. What made this project unique wasn’t just the technical challenge, but the approach. Pair-programming with Claude Code, Anthropic’s AI coding assistant, marked a departure from previous iterations on this theme.

This article shares our experience, the surprises we encountered, and why infrastructure-as-code presents unique challenges for AI-assisted development.

Our CDK Library: A Home for L3 Constructs

Before diving into the Golden Path, it’s worth explaining where these constructs live. We maintain an internal CDK library that serves as our repository for Layer 3 (L3) constructs.

Why a dedicated library?
AWS CDK provides three levels of constructs:

  • L1 (CloudFormation Resources): Direct mappings to CloudFormation resources
  • L2 (Curated Constructs): AWS-provided abstractions with sensible defaults
  • L3 (Patterns): Opinionated, higher-level abstractions combining multiple resources

L3 constructs encode organizational knowledge including our security policies, naming conventions, tagging standards, and operational patterns. Storing them in a versioned, shared library means:

  1. Consistency: Every team uses the same patterns
  2. Governance: Security and compliance requirements are baked in
  3. Velocity: New services get production-ready infrastructure in minutes
  4. Evolution: Improvements benefit all consumers automatically

Our library includes constructs for VPCs, Route53 Hosted Zones, shared ALBs, and now the ECS Golden Path.

What is a Golden Path?

A Golden Path (or “paved road”) is a pre-configured, opinionated way to accomplish a common task. For ECS, our Golden Path includes:

  • EcsFargateCluster: Creates ECS clusters with Container Insights V2, ECS Exec logging, and Fargate capacity providers
  • EcsFargateTaskDefinition: Task definitions with automatic logging, KMS permissions for secrets, and init process handling
  • EcsFargateService: Base Fargate service with auto-scaling and native Blue/Green deployment support
  • SharedAlbEcsFargateService: ALB-fronted service extending EcsFargateService, adding DNS and health checks

The goal? Allow teams to deploy a production-ready ECS service with ~20 lines of CDK code instead of 200+.

How the Constructs Connect

Each layer builds on the one below, adding capabilities while hiding complexity. Teams can use SharedAlbEcsFargateService for most web services, or drop down to EcsFargateService for internal services without ALB exposure.

The AI-Assisted Development Experience

What Worked Well

Claude excelled at several aspects of CDK development:

1. Boilerplate Generation

CDK constructs involve significant boilerplate — interfaces, props validation, JSDoc comments, type exports. Claude generated these consistently and correctly.

export interface EcsFargateTaskDefinitionProps extends EcsAppProps, VpcProps {
/**
* Container image to deploy.
*/
readonly image: ecs.ContainerImage;

/**
* CPU units for the task.
* @default 1024 (1 vCPU)
*/
readonly cpu?: FargateCpuValue;
// ... 20+ more props with defaults and documentation
}

2. Test-Driven Development

We followed a TDD approach, and Claude proved adept at writing CDK assertions:

it('should add KMS decrypt permission when secrets are provided', () => {
const stack = new cdk.Stack();
new EcsFargateTaskDefinition(stack, 'TaskDef', {
appName: 'TestApp',
deployEnvironment: 'staging',
image: ecs.ContainerImage.fromRegistry('nginx'),
secrets: {
DB_PASSWORD: ecs.Secret.fromSecretsManager(secret, 'password'),
},
});

Template.fromStack(stack).hasResourceProperties('AWS::IAM::Policy', {
PolicyDocument: {
Statement: Match.arrayWith([
Match.objectLike({
Action: 'kms:Decrypt',
Effect: 'Allow',
}),
]),
},
});
});

3. AWS CLI for Resource Comparison

When migrating existing infrastructure to use our new constructs, Claude used AWS CLI commands to compare current vs. synthesized resources:

# Check current ECS cluster configuration
aws ecs describe-clusters --clusters ZarECSCluster

# Compare with CDK synth output
cdk synth --quiet && cat cdk.out/ZarCoreStack.template.json | jq '.Resources | to_entries[] | select(.value.Type == "AWS::ECS::Cluster")'

This was invaluable for ensuring migrations wouldn’t cause unexpected resource replacements.

The Hard Parts: Why Infrastructure is Different

Here’s where things got interesting. Writing infrastructure code with AI assistance is fundamentally different from application code, and we learned this the hard way.

1. You Can’t Just “Rename Things”

In a Rails app, renaming a class is a refactor. In CDK, renaming a Construct ID can be catastrophic:

// Before
new ecs.Cluster(this, 'EcsFargateCluster', { ... });

// After - DANGER!
new ecs.Cluster(this, 'Cluster', { ... });

This simple change causes CloudFormation to:

  1. Create a NEW cluster named differently
  2. Delete the OLD cluster
  3. Which deletes all services, tasks, and causes downtime

Claude initially suggested these kinds of “cleanup” refactors. We had to explicitly establish rules: never rename Construct IDs for existing infrastructure without understanding the replacement implications.

2. Testing Requires Real AWS Environments

Unlike unit tests that run in milliseconds, validating ECS constructs requires:

  • Actual cdk deploy operations (5–15 minutes)
  • Blue/Green deployments that fail at runtime, not during cdk synth
  • Load balancer health checks that timeout
  • Secrets Manager permissions that only manifest at container startup

We couldn’t just run npm test and ship. Every significant change required deployment to a staging environment.

3. The Blast Radius is Different

A bug in application code might affect one user’s request. A bug in infrastructure code can:

  • Take down an entire service
  • Orphan resources costing money
  • Create security vulnerabilities across all environments
  • Cause data loss if databases are accidentally replaced

This required a different review process. We had Claude generate cdk diff outputs and analyzed them carefully before any deployment.

4. Native ECS Blue/Green Deployment Complexity

We chose native ECS Blue/Green deployments over CodeDeploy-backed deployments for simplicity. But even “native” comes with complexity:

deploymentController: {
type: ecs.DeploymentControllerType.ECS,
},
circuitBreaker: { enable: true, rollback: true },

Native ECS Blue/Green still has constraints:

  • Deployment circuit breaker behavior affects rollback timing
  • Health check grace periods interact with deployment timeouts
  • Minimum/maximum healthy percent settings can cause deployment stalls
  • ALB target group draining impacts deployment speed

Claude could write the code, but understanding the operational implications required human judgment and real-world testing.

Migration Strategy

Migrating existing infrastructure to Golden Path constructs required careful planning:

  1. Inventory Current Resources: Used AWS CLI to document all existing resource names and configurations
  2. Match Logical IDs: Ensured our constructs generated the same CloudFormation logical IDs where resource replacement was not an option
  3. Use `cdk diff` Religiously: Every change was diffed before deployment
  4. Incremental Migration: Migrated one service at a time, not everything at once
  5. Keep Escape Hatches: Some legacy services needed to deviate from the Golden Path temporarily

Codifying the Learnings: A Claude Code Skill

After encountering the “you can’t just rename things” problem multiple times, we decided to codify these learnings into a Claude Code Skill — a reusable set of instructions that guides Claude when working on CDK infrastructure migrations.

The skill lives in our CDK library at `.claude/skills/cdk-infrastructure-migration/` and includes:

.claude/skills/cdk-development/
├── SKILL.md # Main skill with rules and patterns
└── references/
└── migration-patterns.md # Detailed migration scenarios

What the skill covers:

  • Core Principles: Construct ID safety, resource naming best practices, the blast radius problem
  • Pre-Deployment Validation: Multi-layer strategy with cdk-nag, synthesis checks, and AWS CLI comparison
  • Migration Patterns: Migrating to L3 constructs, keeping logical IDs during refactoring, blue/green infrastructure
  • Common Pitfalls: Renaming stacks, changing construct hierarchy, updating L3 libraries
  • ECS-Specific Guidance: Native Blue/Green deployments, task definition changes, IAM role updates
  • Checklists: Pre-deployment checklist, testing strategy

Key rules encoded:

  • Construct IDs are sacred: Never rename without understanding CloudFormation implications
  • Always run `cdk diff` first: Look for replace operations before deploying
  • Compare with live resources: Use AWS CLI to verify expected vs. actual state
  • Understand the blast radius: Infrastructure bugs affect entire services, not single requests

The skill also includes a pre-deployment hook that warns before running cdk deploy:

hooks:
PreToolUse:
- matcher: Bash(cdk deploy*)
command: echo "⚠️ STOP - Have you reviewed 'cdk diff' output for resource replacements?"

This turns our hard-learned lessons into guardrails that protect future development sessions — whether with Claude or human developers following the same patterns.

Lessons Learned

  1. Establish guardrails early: Create explicit rules about what AI can and cannot modify (Construct IDs, resource names, etc.)
  2. Always verify with real deployments: npm test passing means nothing if cdk deploy fails
  3. Leverage AI for documentation: Claude wrote excellent JSDoc comments and README content
  4. Have AI generate comparison commands: Using AWS CLI to verify expected vs. actual state
  5. Encode opinions, but allow escape hatches: Strong defaults, optional overrides
  6. Test with real services: Our test suite includes actual service deployments, not just unit tests
  7. Document the “why”: Every default should have a documented rationale
  8. Version carefully: Breaking changes in shared constructs affect all consumers

Conclusion

Building ECS Golden Path constructs with Claude Code was productive — we shipped faster than we would have alone. But it also highlighted that infrastructure-as-code requires a fundamentally different approach to AI-assisted development.

The code runs in production, not in isolation. Mistakes aren’t caught by compilers or linters — they’re caught by failed deployments, downtime, or worse. AI can accelerate the journey, but humans must remain firmly in the driver’s seat for infrastructure decisions.

Our Golden Path is now in production, powering multiple services at ZAR. The constructs encode months of operational learning into reusable patterns.