How to Fix AWS EKS Worker Node Failures and Auto Scaling Terminations

Select Language:

When troubleshooting EC2 instance issues, it’s important to understand the different types of status failures, as they come from different parts of the system and require different approaches.

First, there are system versus instance status checks. A failed system status check usually points to a problem with the underlying hardware or hypervisor in AWS. These issues are rare and often unavoidable because they relate to physical infrastructure problems. On the other hand, an instance status check failing indicates an issue within the guest operating system. This could be caused by kernel panics, out-of-memory errors, or network stack failures. If the Kubernetes Kubelet stops reporting status but the OS is still running, it suggests that resources might be limited, causing the Kubelet to become unresponsive.

You can learn more about these checks through AWS’s official monitoring guide on system and instance status.

Sometimes, the problem isn’t with hardware or the OS itself but with resource allocation. If you don’t see an Out-Of-Memory kill event in CloudWatch, it could be a sign that CPU resources are being stolen (a common issue in multi-tenant environments) or there’s memory fragmentation. Without proper reservations set up, a runaway pod can cause the Kubelet process to become starved of CPU or memory, preventing it from sending regular heartbeat signals to the control plane.

For more detailed analysis, especially after an instance has been terminated, check specific AWS artifacts. First, look at the last screenshot of the console’s VGA output. AWS sometimes captures this snapshot to show the kernel panic or crash screen that occurred very quickly, which CloudWatch logs might miss. Additionally, review the EBS burst balance if you’re using gp2 or small gp3 volumes; if the burst balance hits zero, it indicates I/O bottlenecks that can cause processes like the Kubelet to become unresponsive due to I/O stalls.

To prevent these issues, consider implementing Node Allocatable constraints. This strategy reserves a portion of system resources specifically for the OS and Kubelet, ensuring they always have enough resources to operate smoothly even under heavy load.

Finally, an often-overlooked cause of an “Unknown” status in EKS clusters is certificate expiration or clock skew on worker nodes. If the node’s clock is out of sync or certificates are expired, it can prevent the Kubelet from authenticating properly with the API server, leading to status issues. Regularly checking and maintaining accurate system time and certificates can help avoid these problems.