How To Fix AWS ECS Fargate MinimumHealthyPercent Violation

Select Language:

If you’re using AWS Fargate and suddenly notice that your service drops to half capacity during a platform update, you’re not alone. Recently, some users experienced this exact problem where 4 out of 8 tasks became unhealthy simultaneously, causing a temporary service drop. Here’s how to understand what’s happening and what you can do to prevent it.

When updating the Fargate platform, it’s common to worry about maintaining service availability. During a recent update, the process didn’t follow the usual pattern seen with manual deployments. Typically, when you do a manual deployment with the force option, new tasks are spun up first, and once they’re healthy, the old ones are shut down. This approach keeps the number of tasks steady, respecting the minimum healthy percentage.

However, during an automatic platform update, AWS started by making the existing tasks unhealthy before spinning up new ones. This resulted in only half the tasks running while waiting for the new tasks to become healthy, which temporarily dropped the capacity. Notably, the deployment configuration specifying that 100% of tasks should be healthy at all times wasn’t honored during this process, leading to questions about whether such behavior is expected.

A key reason why only 4 tasks went down at once relates to how the platform processes updates. It appears that platform updates may be handled in batches, with the size of each batch influenced by the maximum percentage setting. During the update, the service affected all four tasks at the same time, possibly because the batch size was set to 50% (which matches the 8-task setup). This batch processing can cause a temporary capacity dip if the unhealthy tasks aren’t replaced quickly enough.

To avoid these issues, consider implementing some mitigation strategies:

Adjust the deployment settings: Lower the maximumPercent from 200% to a smaller value, like 125%, so that fewer tasks are replaced at once—ideally limiting it to 2 at a time.
Use a Circuit Breaker: Implement this pattern to prevent cascading failures during update periods.
Manual deployment during critical updates: When you receive alerts from the AWS Health Dashboard, do manual deployments. This way, you have more control over how many tasks are replaced at once.

While AWS provides documentation on task maintenance and platform versions, it’s less clear how deployment configurations behave during platform updates. Many users are still figuring out how to guarantee minimum healthy tasks during such updates.

If you’re managing a high-availability environment, especially in your region (like Asia-Pacific North, or ap-northeast-2), it’s smart to review these strategies. Reducing the batch size during updates and performing manual deployments when needed can help maintain your service’s reliability.

Has anyone else encountered similar behavior? Exploring these options and sharing your experiences can help everyone better understand how to keep their services running smoothly during AWS platform updates.