Skip to content

feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5

Open
mabadir wants to merge 3 commits into
mainfrom
native-ecs-deployment-strategies
Open

feat(ecs_service): native ECS deployment strategies (rolling/blue_green/linear/canary)#5
mabadir wants to merge 3 commits into
mainfrom
native-ecs-deployment-strategies

Conversation

@mabadir
Copy link
Copy Markdown
Contributor

@mabadir mabadir commented Jun 4, 2026

Summary

Replaces the dual-service / CodeDeploy-era blue/green design with the native ECS deployment controller strategies: rolling, blue_green, linear, and canary — and makes the strategy a per-deployment decision that requires no Terraform changes to switch.

Native strategies (b4ee71a)

  • deployment_controller.type = "ECS" always — the CodeDeploy/EXTERNAL mapping is removed
  • New deployment_type values linear / canary alongside rolling / blue_green
  • New deployment_strategy_config variable (bake time, canary %, linear step %) that seeds deployment_configuration at create time
  • New test_listener_rule_arn variable for blue/green test-traffic validation (TEST_TRAFFIC_SHIFT stages)
  • ECS infrastructure IAM role with AmazonECSInfrastructureRolePolicyForLoadBalancers — ECS assumes it to rewrite listener rules and (de)register targets during traffic shifts
  • load_balancer.advanced_configuration wiring (production/alternate target group, production + optional test listener rule, role)
  • deployment_configuration added to ignore_changes: the Flightcontrol deploy manager passes the authoritative configuration (strategy, bake times, pause lifecycle hooks) on every UpdateService call, so Terraform must not fight it
  • Requires AWS provider >= 6.21 (linear_configuration / canary_configuration)

Strategy is per-deployment (3c72a84)

Whenever a load balancer is attached, the module now always provisions:

  • the production (tg-1) + alternate (tg-2) target-group pair (the rolling-only aws_lb_target_group.this is removed)
  • the ECS infrastructure role
  • the service's advanced_configuration

Rolling deployments serve from the production target group only and never touch the alternate, so any service can switch between all four strategies on a single UpdateService call. deployment_type is purely the create-time seed.

Trade-off: every load-balanced rolling service carries one idle target group and one unused IAM role.

Outputs

  • production_target_group_arn / alternate_target_group_arn (+ names), ecs_infrastructure_role_arn
  • target_group_arn retained as an alias of the production target group
  • Removed: blue/green dual-service outputs

Tests

Fixed the module test suite so it actually runs: basic_service was failing on main (mock provider emits invalid JSON policies / ARNs), which skipped every other run. Added mock_data / mock_resource defaults and fixed two stale assertions.

tofu validate ✅ — tofu test ✅ 11/11 passed

Breaking changes

  • deployment_controller is always ECS; services created with the old CodeDeploy blue/green mapping are not migrated (pre-GA, clean removal)
  • aws_lb_target_group.this removed — existing load-balanced services will see target-group recreation on upgrade
  • Blue/green dual-service variables/outputs removed

🤖 Generated with Claude Code

Greptile Summary

This PR replaces the CodeDeploy/dual-service blue-green design with native ECS deployment controller strategies (rolling, blue_green, linear, canary) and restructures target-group provisioning so the strategy is a per-deployment decision rather than a Terraform-level choice.

  • Always-on infra: production (tg-1) + alternate (tg-2) target groups, an ECS infrastructure IAM role, and load_balancer.advanced_configuration are provisioned for every LB-attached service, regardless of deployment_type; deployment_configuration and load_balancer are placed in ignore_changes so the Flightcontrol deploy manager owns the authoritative configuration.
  • New variables and outputs: deployment_strategy_config, test_listener_rule_arn, and new named outputs (production_target_group_arn, alternate_target_group_arn, ecs_infrastructure_role_arn) alongside backward-compatible aliases for existing callers; deployment_type now accepts linear and canary in addition to rolling and blue_green.
  • Test suite fixed: mock provider defaults and two stale assertions corrected; new canary_deployment and linear_deployment test cases added (all plan-only against mocked providers).

Confidence Score: 3/5

Two unresolved questions about real-world AWS API behavior make this risky to merge without additional validation.

The core architectural bet — that AWS accepts advanced_configuration on a CreateService call where deployment_configuration is absent (rolling default) — is not exercised by any real apply; all 11 tests run against mocked providers with command = plan. If AWS rejects this combination, every load-balanced rolling service (the module default) would fail to provision. Separately, production_listener_rule is hardcoded to the first ALB rule, so services with multiple listener rules silently get incomplete traffic shifting during blue_green/linear/canary deployments without any validation or warning. The infrastructure and IAM changes themselves are clean and the structural refactoring is well-thought-out.

compute/ecs_service/ecs_service.tf — the advanced_configuration block and the hardcoded alb["0"] reference are the two areas that need the most scrutiny before this lands in production.

Important Files Changed

Filename Overview
compute/ecs_service/ecs_service.tf Core ECS service resource: advanced_configuration now always wired for every LB-attached service; production_listener_rule hardcoded to first ALB rule (alb["0"]), which silently excludes additional rules from traffic shifting
compute/ecs_service/target_groups.tf Removed aws_lb_target_group.this; tg_1/tg_2 now always provisioned when LB is attached — clean and consistent
compute/ecs_service/iam_infrastructure.tf New ECS infrastructure role using AmazonECSInfrastructureRolePolicyForLoadBalancers; correctly gated on enable_load_balancer; trust policy and attachment look correct
compute/ecs_service/outputs.tf target_group_arn retained as alias; target_group_name dropped without a backward-compatible alias despite being the matching pair to the retained ARN output
compute/ecs_service/variables.tf New deployment_strategy_config, test_listener_rule_arn variables and expanded deployment_type validation look correct; optional() defaults are sane
compute/ecs_service/tests/basic.tftest.hcl Mock provider defaults fixed; new canary/linear test cases added; all tests run as plan only with mocked providers, so no real AWS API validation is exercised

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[terraform apply\necs_service module] --> B{enable_load_balancer?}
    B -- No --> C[No TGs, no\nadvanced_configuration]
    B -- Yes --> D[Create tg-1 production\n+ tg-2 alternate]
    D --> E[Create ECS\ninfrastructure IAM role]
    E --> F[Wire advanced_configuration\nalways — all strategies]
    F --> G{deployment_type seed}
    G -- rolling --> H[No deployment_configuration block\nignore_changes covers it]
    G -- blue_green / linear / canary --> I[deployment_configuration block\nstrategy + bake/canary/linear tuning]
    H --> J[ECS service created\nserves from tg-1 only]
    I --> J
    J --> K[Deploy Manager calls UpdateService\nwith authoritative strategy\n+ pause lifecycle hooks]
    K -- rolling --> L[ECS: task replacement\nvia min/max healthy pct\ntg-1 only]
    K -- blue_green/linear/canary --> M[ECS: rewrites production_listener_rule\nbetween tg-1 and tg-2\nvia infrastructure role]
Loading
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
compute/ecs_service/ecs_service.tf:102-106
**Only the first ALB listener rule participates in traffic shifting**

`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.

### Issue 2 of 3
compute/ecs_service/ecs_service.tf:93-111
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**

The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.

### Issue 3 of 3
compute/ecs_service/outputs.tf:96-104
**`target_group_name` dropped without a backward-compatible alias**

`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.

Reviews (1): Last reviewed commit: "feat(ecs_service): make deployment strat..." | Re-trigger Greptile

Greptile also left 3 inline comments on this PR.

mabadir and others added 2 commits June 4, 2026 00:25
…en/linear/canary)

Replace the CodeDeploy/external-controller blue/green wiring with the
ECS deployment controller's built-in strategies:

- deployment_type now accepts rolling | blue_green | linear | canary;
  the deployment controller is always ECS (CODE_DEPLOY mapping removed)
- new deployment_strategy_config seeds bake time + canary/linear tuning
  at create time; deployment_configuration is in ignore_changes because
  the Flightcontrol deploy manager passes the authoritative config
  (including pause lifecycle hooks) on every UpdateService call
- native strategies wire load_balancer.advanced_configuration:
  alternate target group (tg-2), production listener rule (first ALB
  rule or the NLB listener), optional test_listener_rule_arn, and a new
  ECS infrastructure role (AmazonECSInfrastructureRolePolicyForLoadBalancers)
- target groups tg-1/tg-2 are now production/alternate and gate on any
  native traffic-shift strategy, not just blue_green; outputs renamed
  accordingly (production_/alternate_target_group_*) and
  ecs_infrastructure_role_arn added
- provider floor bumped to aws >= 6.21 (linear/canary configuration)

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Always provision the production/alternate target-group pair, the ECS
infrastructure role, and the service's load_balancer
advanced_configuration whenever a load balancer is attached — not just
for native traffic-shift deployment_types. Rolling deployments serve
from the production target group only, so any service can now switch
between rolling / blue_green / linear / canary on a single
UpdateService call with zero Terraform changes; deployment_type is
purely the create-time seed.

Also fixes the module test suite: mock the iam_policy_document /
partition / region / vpc data sources and computed ARNs so all 11 runs
execute (basic_service was failing on main and skipping the rest).

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
Comment on lines +102 to +106
production_listener_rule = (
local.enable_nlb_listener
? aws_lb_listener.nlb[0].arn
: aws_lb_listener_rule.alb["0"].arn
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Only the first ALB listener rule participates in traffic shifting

production_listener_rule is hardcoded to alb["0"].arn. The AWS ECS advanced_configuration API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (alb["1"], alb["2"], …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires length(var.load_balancer_attachment.listener_rules) <= 1 when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 102-106

Comment:
**Only the first ALB listener rule participates in traffic shifting**

`production_listener_rule` is hardcoded to `alb["0"].arn`. The AWS ECS `advanced_configuration` API only accepts a single production listener rule ARN, so during blue_green/linear/canary deployments ECS rewrites only this one rule. Any additional listener rules (`alb["1"]`, `alb["2"]`, …) continue forwarding to tg_1 for the entire deployment — traffic on those rules never shifts to the new revision. Consider adding a validation block that requires `length(var.load_balancer_attachment.listener_rules) <= 1` when using a native traffic-shift strategy, or document this constraint prominently so callers with multi-rule services aren't caught off guard.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 93 to 111
dynamic "load_balancer" {
for_each = local.enable_load_balancer && var.deployment_type == "blue_green" ? [1] : []
for_each = local.enable_load_balancer ? [1] : []
content {
target_group_arn = aws_lb_target_group.tg_1[0].arn
container_name = local.lb_container_name
container_port = local.lb_container_port

advanced_configuration {
alternate_target_group_arn = aws_lb_target_group.tg_2[0].arn
production_listener_rule = (
local.enable_nlb_listener
? aws_lb_listener.nlb[0].arn
: aws_lb_listener_rule.alb["0"].arn
)
test_listener_rule = var.test_listener_rule_arn
role_arn = aws_iam_role.ecs_infrastructure[0].arn
}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 advanced_configuration wired unconditionally for rolling deployments — not validated against real AWS API

The whole design bet is that AWS silently accepts advanced_configuration (alternate TG, production listener rule, infrastructure role) when deployment_configuration is absent (the rolling default). All tests run with command = plan against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects advanced_configuration on a CreateService call that carries no deployment_configuration, every rolling service with a load balancer (the module default) would fail to provision. A real tofu apply against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/ecs_service.tf
Line: 93-111

Comment:
**`advanced_configuration` wired unconditionally for rolling deployments — not validated against real AWS API**

The whole design bet is that AWS silently accepts `advanced_configuration` (alternate TG, production listener rule, infrastructure role) when `deployment_configuration` is absent (the rolling default). All tests run with `command = plan` against mocked providers, so the AWS API's acceptance of this combination has not been exercised. If AWS rejects `advanced_configuration` on a `CreateService` call that carries no `deployment_configuration`, every rolling service with a load balancer (the module default) would fail to provision. A real `tofu apply` against AWS — even in a scratch account — should be added to the test suite to confirm this assumption before merging.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 96 to 104
output "target_group_arn" {
description = "The ARN of the target group (null if load balancer disabled or blue/green deployment)."
value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn : null
description = "The ARN of the production target group the service serves from (alias of production_target_group_arn; null if load balancer disabled)."
value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn : null
}

output "target_group_arn_suffix" {
description = "The ARN suffix of the target group for CloudWatch metrics."
value = local.enable_load_balancer && var.deployment_type == "rolling" ? aws_lb_target_group.this[0].arn_suffix : null
description = "The ARN suffix of the production target group for CloudWatch metrics."
value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].arn_suffix : null
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 target_group_name dropped without a backward-compatible alias

target_group_arn was deliberately kept as an alias for production_target_group_arn, but target_group_name — the natural companion — was removed entirely. Callers that reference module.xxx.target_group_name (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null } would match the alias pattern used for the ARN.

Prompt To Fix With AI
This is a comment left during a code review.
Path: compute/ecs_service/outputs.tf
Line: 96-104

Comment:
**`target_group_name` dropped without a backward-compatible alias**

`target_group_arn` was deliberately kept as an alias for `production_target_group_arn`, but `target_group_name` — the natural companion — was removed entirely. Callers that reference `module.xxx.target_group_name` (e.g., in CloudWatch dashboards or external scripts) will get a Terraform error on upgrade. This is not listed as a named breaking change in the PR description. Adding `output "target_group_name" { value = local.enable_load_balancer ? aws_lb_target_group.tg_1[0].name : null }` would match the alias pattern used for the ARN.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

- Reject >1 ALB listener rule for native traffic-shift strategies:
  advanced_configuration only rewrites a single production listener
  rule, so extra rules would keep serving the old revision for the
  whole deployment. Documented the constraint on listener_rules since
  the strategy is a per-deployment decision the deploy manager can
  change without Terraform.
- Restore the target_group_name output as an alias of
  production_target_group_name, matching the target_group_arn alias.
- Make TestEcsServiceWithAlb assert the deployed rolling service
  carries advanced_configuration (alternate TG, production listener
  rule, infrastructure role), validating that the real AWS API accepts
  it on CreateService without a deployment_configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <[email protected]>
@mabadir mabadir added the run-terratest Run Terratest integration tests (real AWS apply) on this PR label Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-terratest Run Terratest integration tests (real AWS apply) on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant