Skip to content

fix: add task-level CPU and memory limits#773

Open
igor-soldev wants to merge 1 commit into
code4romania:mainfrom
igor-soldev:fix/infrascan-ecs-task-limits
Open

fix: add task-level CPU and memory limits#773
igor-soldev wants to merge 1 commit into
code4romania:mainfrom
igor-soldev:fix/infrascan-ecs-task-limits

Conversation

@igor-soldev

Copy link
Copy Markdown

What does this PR do?

This PR introduces explicit, task-level CPU and memory limits to our ECS task definitions.

This addresses the ECS Task Definition Without CPU/Memory Limits issue discovered by InfraScan during the setup in #772.

Previously, we were assigning memory = var.container_memory_hard_limit directly on the task level, and we omitted the cpu limit entirely. While this worked for single-container tasks, it lacked flexibility (e.g., if we ever wanted to add a sidecar container, the task memory needs to be the sum of all containers). Furthermore, omitting the task-level cpu makes it difficult for the ECS scheduler to place tasks efficiently and can lead to resource overprovisioning.

To fix this and make the module more robust, two new variables were introduced:

  • task_cpu
  • task_memory

Default values (Action Required)

To ensure backward compatibility and prevent production outages during deployment, safe default values have been applied:

  • task_cpu = 1024 (1 vCPU) – Prevents unbounded CPU bursts, but will throttle the task if the app suddenly demands more power.
  • task_memory = Falls back to container_memory_hard_limit – Ensures we don't accidentally cut memory off from existing setups (currently resulting in 3072 MiB for the API).

Request for metrics review

Before merging, we should verify our actual resource usage in CloudWatch Container Insights to avoid deployment blockages or production starvation:

  1. CPU Usage: Are 1024 CPU units sufficient for our historical traffic spikes? If it regularly exceeds this, we need to manually override task_cpu when calling the module (e.g., to 2048).
  2. EC2 Capacity: Will our current Auto Scaling Group instances have enough free CPU capacity to accommodate new tasks during a rolling deployment, now that 1024 units are strictly reserved per task?

Let's discuss the metrics and adjust the defaults if necessary before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant