feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5
feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5mabadir wants to merge 3 commits into
Conversation
…en/linear/canary) Replace the CodeDeploy/external-controller blue/green wiring with the ECS deployment controller's built-in strategies: - deployment_type now accepts rolling | blue_green | linear | canary; the deployment controller is always ECS (CODE_DEPLOY mapping removed) - new deployment_strategy_config seeds bake time + canary/linear tuning at create time; deployment_configuration is in ignore_changes because the Flightcontrol deploy manager passes the authoritative config (including pause lifecycle hooks) on every UpdateService call - native strategies wire load_balancer.advanced_configuration: alternate target group (tg-2), production listener rule (first ALB rule or the NLB listener), optional test_listener_rule_arn, and a new ECS infrastructure role (AmazonECSInfrastructureRolePolicyForLoadBalancers) - target groups tg-1/tg-2 are now production/alternate and gate on any native traffic-shift strategy, not just blue_green; outputs renamed accordingly (production_/alternate_target_group_*) and ecs_infrastructure_role_arn added - provider floor bumped to aws >= 6.21 (linear/canary configuration) Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Always provision the production/alternate target-group pair, the ECS infrastructure role, and the service's load_balancer advanced_configuration whenever a load balancer is attached — not just for native traffic-shift deployment_types. Rolling deployments serve from the production target group only, so any service can now switch between rolling / blue_green / linear / canary on a single UpdateService call with zero Terraform changes; deployment_type is purely the create-time seed. Also fixes the module test suite: mock the iam_policy_document / partition / region / vpc data sources and computed ARNs so all 11 runs execute (basic_service was failing on main and skipping the rest). Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) |
There was a problem hiding this comment.
Only the first ALB listener rule participates in traffic shifting
production_listener_rule is hardcoded to alb["0"].arn. The AWS ECS advanced_configuration API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (alb["1"], alb["2"], …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires length(var.load_balancer_attachment.listener_rules) <= 1 when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 102-106
Comment:
**Only the first ALB listener rule participates in traffic shifting**
`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.
How can I resolve this? If you propose a fix, please make it concise.| dynamic "load_balancer" { | ||
| for_each = local.enable_load_balancer && var.deployment_type == "blue_green" ? [1] : [] | ||
| for_each = local.enable_load_balancer ? [1] : [] | ||
| content { | ||
| target_group_arn = aws_lb_target_group.tg_1[0].arn | ||
| container_name = local.lb_container_name | ||
| container_port = local.lb_container_port | ||
|
|
||
| advanced_configuration { | ||
| alternate_target_group_arn = aws_lb_target_group.tg_2[0].arn | ||
| production_listener_rule = ( | ||
| local.enable_nlb_listener | ||
| ? aws_lb_listener.nlb[0].arn | ||
| : aws_lb_listener_rule.alb["0"].arn | ||
| ) | ||
| test_listener_rule = var.test_listener_rule_arn | ||
| role_arn = aws_iam_role.ecs_infrastructure[0].arn | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
advanced_configuration wired unconditionally for rolling deployments — not validated against real AWS API
The whole design bet is that AWS silently accepts advanced_configuration (alternate TG, production listener rule, infrastructure role) when deployment_configuration is absent (the rolling default). All tests run with command = plan against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects advanced_configuration on a CreateService call that carries no deployment_configuration, every rolling service with a load balancer (the module default) would fail to provision. A real tofu apply against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 93-111
Comment:
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**
The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.
How can I resolve this? If you propose a fix, please make it concise.| output "target_group_arn" { | ||
| description = "The ARN of the target group (null if load balancer disabled or blue/green deployment)." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn : null | ||
| description = "The ARN of the production target group the service serves from (alias of production_target_group_arn; null if load balancer disabled)." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn : null | ||
| } | ||
|
|
||
| output "target_group_arn_suffix" { | ||
| description = "The ARN suffix of the target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn_suffix : null | ||
| description = "The ARN suffix of the production target group for CloudWatch metrics." | ||
| value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn_suffix : null | ||
| } |
There was a problem hiding this comment.
target_group_name dropped without a backward-compatible alias
target_group_arn was deliberately kept as an alias for production_target_group_arn, but target_group_name — the natural companion — was removed entirely. Callers that reference module.xxx.target_group_name (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null } would match the alias pattern used for the ARN.
Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/outputs.tf
Line: 96-104
Comment:
**`target_group_name` dropped without a backward-compatible alias**
`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
- Reject >1 ALB listener rule for native traffic-shift strategies: advanced_configuration only rewrites a single production listener rule, so extra rules would keep serving the old revision for the whole deployment. Documented the constraint on listener_rules since the strategy is a per-deployment decision the deploy manager can change without Terraform. - Restore the target_group_name output as an alias of production_target_group_name, matching the target_group_arn alias. - Make TestEcsServiceWithAlb assert the deployed rolling service carries advanced_configuration (alternate TG, production listener rule, infrastructure role), validating that the real AWS API accepts it on CreateService without a deployment_configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Summary
Replaces the dual-service / CodeDeploy-era blue/green design with the native ECS deployment controller strategies:
rolling,blue_green,linear, andcanary— and makes the strategy a per-deployment decision that requires no Terraform changes to switch.Native strategies (b4ee71a)
deployment_controller.type = "ECS"always — the CodeDeploy/EXTERNAL mapping is removeddeployment_typevalueslinear/canaryalongsiderolling/blue_greendeployment_strategy_configvariable (bake time, canary %, linear step %) that seedsdeployment_configurationat create timetest_listener_rule_arnvariable for blue/green test-traffic validation (TEST_TRAFFIC_SHIFTstages)AmazonECSInfrastructureRolePolicyForLoadBalancers— ECS assumes it to rewrite listener rules and (de)register targets during traffic shiftsload_balancer.advanced_configurationwiring (production/alternate target group, production + optional test listener rule, role)deployment_configurationadded toignore_changes: the Flightcontrol deploy manager passes the authoritative configuration (strategy, bake times, pause lifecycle hooks) on everyUpdateServicecall, so Terraform must not fight itlinear_configuration/canary_configuration)Strategy is per-deployment (3c72a84)
Whenever a load balancer is attached, the module now always provisions:
tg-1) + alternate (tg-2) target-group pair (the rolling-onlyaws_lb_target_group.thisis removed)advanced_configurationRolling deployments serve from the production target group only and never touch the alternate, so any service can switch between all four strategies on a single
UpdateServicecall.deployment_typeis purely the create-time seed.Trade-off: every load-balanced rolling service carries one idle target group and one unused IAM role.
Outputs
production_target_group_arn/alternate_target_group_arn(+ names),ecs_infrastructure_role_arntarget_group_arnretained as an alias of the production target groupTests
Fixed the module test suite so it actually runs:
basic_servicewas failing onmain(mock provider emits invalid JSON policies / ARNs), which skipped every other run. Addedmock_data/mock_resourcedefaults and fixed two stale assertions.tofu validate✅ —tofu test✅ 11/11 passedBreaking changes
deployment_controlleris alwaysECS; services created with the old CodeDeploy blue/green mapping are not migrated (pre-GA, clean removal)aws_lb_target_group.thisremoved — existing load-balanced services will see target-group recreation on upgrade🤖 Generated with Claude Code
Greptile Summary
This PR replaces the CodeDeploy/dual-service blue-green design with native ECS deployment controller strategies (rolling, blue_green, linear, canary) and restructures target-group provisioning so the strategy is a per-deployment decision rather than a Terraform-level choice.
load_balancer.advanced_configurationare provisioned for every LB-attached service, regardless ofdeployment_type;deployment_configurationandload_balancerare placed inignore_changesso the Flightcontrol deploy manager owns the authoritative configuration.deployment_strategy_config,test_listener_rule_arn, and new named outputs (production_target_group_arn,alternate_target_group_arn,ecs_infrastructure_role_arn) alongside backward-compatible aliases for existing callers;deployment_typenow acceptslinearandcanaryin addition torollingandblue_green.canary_deploymentandlinear_deploymenttest cases added (all plan-only against mocked providers).Confidence Score: 3/5
Two unresolved questions about real-world AWS API behavior make this risky to merge without additional validation.
The core architectural bet — that AWS accepts
advanced_configurationon aCreateServicecall wheredeployment_configurationis absent (rolling default) — is not exercised by any real apply; all 11 tests run against mocked providers withcommand = plan. If AWS rejects this combination, every load-balanced rolling service (the module default) would fail to provision. Separately,production_listener_ruleis hardcoded to the first ALB rule, so services with multiple listener rules silently get incomplete traffic shifting during blue_green/linear/canary deployments without any validation or warning. The infrastructure and IAM changes themselves are clean and the structural refactoring is well-thought-out.compute/ecs_service/ecs_service.tf — the
advanced_configurationblock and the hardcodedalb["0"]reference are the two areas that need the most scrutiny before this lands in production.Important Files Changed
advanced_configurationnow always wired for every LB-attached service;production_listener_rulehardcoded to first ALB rule (alb["0"]), which silently excludes additional rules from traffic shiftingaws_lb_target_group.this; tg_1/tg_2 now always provisioned when LB is attached — clean and consistentAmazonECSInfrastructureRolePolicyForLoadBalancers; correctly gated onenable_load_balancer; trust policy and attachment look correcttarget_group_arnretained as alias;target_group_namedropped without a backward-compatible alias despite being the matching pair to the retained ARN outputdeployment_strategy_config,test_listener_rule_arnvariables and expandeddeployment_typevalidation look correct;optional()defaults are saneplanonly with mocked providers, so no real AWS API validation is exercisedFlowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[terraform apply\necs_service module] --> B{enable_load_balancer?} B -- No --> C[No TGs, no\nadvanced_configuration] B -- Yes --> D[Create tg-1 production\n+ tg-2 alternate] D --> E[Create ECS\ninfrastructure IAM role] E --> F[Wire advanced_configuration\nalways — all strategies] F --> G{deployment_type seed} G -- rolling --> H[No deployment_configuration block\nignore_changes covers it] G -- blue_green / linear / canary --> I[deployment_configuration block\nstrategy + bake/canary/linear tuning] H --> J[ECS service created\nserves from tg-1 only] I --> J J --> K[Deploy Manager calls UpdateService\nwith authoritative strategy\n+ pause lifecycle hooks] K -- rolling --> L[ECS: task replacement\nvia min/max healthy pct\ntg-1 only] K -- blue_green/linear/canary --> M[ECS: rewrites production_listener_rule\nbetween tg-1 and tg-2\nvia infrastructure role]Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "feat(ecs_service): make deployment strat..." | Re-trigger Greptile