Why Blue/Green, Not Rolling
Rolling deployments are seductive. They’re the default in most platforms and they feel safe — you’re incrementally replacing instances, so surely something will catch a bad deploy. In practice, rolling updates mean your users are hitting two versions simultaneously. For stateful services or APIs with schema changes, this window is a minefield.
Blue/green changes the contract: you build the entire new environment, verify it, and then flip a single traffic switch. Either the deploy succeeds completely or you’re back to the previous version in seconds. No half-states, no mixed-version sessions.
Blue/green is most valuable when your service is stateful, has external consumers with SLAs, or deploys database migrations alongside code. If you’re deploying a simple stateless job, rolling is usually fine.
The CDK Architecture
We’ll build on top of three AWS primitives: ECS for compute, CodeDeploy for the traffic switching logic, and CloudWatch Alarms as the gate that triggers automatic rollback.
Stack Structure
Keep blue/green infrastructure in its own CDK stack, separate from your core network stack. This lets you destroy and recreate the deployment machinery without touching your VPC or shared resources.
from aws_cdk import (
Stack, Duration,
aws_ecs as ecs,
aws_codedeploy as codedeploy,
aws_cloudwatch as cw,
)
class BlueGreenStack(Stack):
def __init__(self, scope, construct_id, *, vpc, **kwargs):
super().__init__(scope, construct_id, **kwargs)
cluster = ecs.Cluster(
self, "AppCluster",
vpc=vpc,
container_insights=True,
)
Task Definition and Container
One pattern worth establishing early: keep your task definition separate from your service definition. This lets CDK track task revision history properly and makes rollbacks referencing specific revisions much cleaner.
task_def = ecs.FargateTaskDefinition(
self, "TaskDef",
cpu=512,
memory_limit_mib=1024,
)
container = task_def.add_container(
"AppContainer",
image=ecs.ContainerImage.from_ecr_repository(repo),
health_check=ecs.HealthCheck(
command=["CMD-SHELL", "curl -f http://localhost/health || exit 1"],
interval=Duration.seconds(30),
timeout=Duration.seconds(5),
),
)
CodeDeploy Configuration
The secret to reliable blue/green is treating your CodeDeploy lifecycle hooks seriously. Most tutorials skip past them, but they’re where you validate your new environment before it receives any real traffic.
deployment_group = codedeploy.EcsDeploymentGroup(
self, "DeploymentGroup",
service=service,
blue_green_deployment_config=codedeploy.EcsBlueGreenDeploymentConfig(
blue_target_group=blue_tg,
green_target_group=green_tg,
listener=production_listener,
test_listener=test_listener,
termination_wait_time=Duration.minutes(10),
),
alarms=[p95_alarm, error_rate_alarm],
auto_rollback=codedeploy.AutoRollbackConfig(
failed_deployment=True,
deployment_in_alarm=True,
),
)
The Alarm Gate
This is the most underrated part of a good blue/green setup. Your CloudWatch alarms are the canary. If error rates or p95 latency spike within your bake time window, CodeDeploy automatically rolls back without human intervention.
Target two alarms at minimum: one on HTTP 5xx error rate from the ALB, and one on p95 latency. Wire them to the deployment group and set a bake time of at least 10 minutes.
Rollback Testing
The worst time to discover your rollback is broken is during an incident. Build rollback verification into your CI pipeline: deploy a deliberately broken version, verify that alarms fire and CodeDeploy rolls back within your SLA window, then confirm production traffic returned to the blue environment.
An untested rollback is not a rollback. It’s a plan that might work.
Wrapping Up
Blue/green on ECS is not simple to set up, but once it’s running you’ll wonder how you shipped without it. The full CDK code from this article is in the awsomecloud/examples repo under blue-green-ecs/.