NotReady
state due to issues like network outages, unresponsive kubelet, or hardware problems.
When a node becomes NotReady, its Pods may be evicted or stuck terminating, and new Pods won’t schedule there. Strategies to detect and remediate these NotReady
nodes are thus extremely important for maintaining cluster health and application availability. In the case of ML deployments, where the nodes might require large GPUs,
and thus might be pretty expensive, timely handling of unhealthy NotReady nodes can also lead to huge cost savings in cloud bills.
This post explores three approaches to handle NotReady nodes in an Amazon EKS cluster, especially when using Karpenter for autoscaling:
node_status_condition_ready
, which is 1 when a node is Ready
and 0 when NotReady
.
You can set up a CloudWatch alarm to trigger when the node_status_condition_ready
metric drops to 0 for any node — indicating it has entered a NotReady state.
CloudWatch Metric Insights simplify monitoring across dynamic clusters, and can be used to aggregate this metric across all nodes.
For instance, taking the minimum value across all nodes helps detect if even a single node becomes NotReady
.
This approach eliminates the need to manually manage alarms as nodes scale in and out.
When the alarm fires, it can publish to an SNS topic, which then sends an email alert to your team.
node_status_condition_ready
metricnode_status_condition_ready
metric and an alarm through CloudWatch.
However, instead of (or in addition to) sending an email, the alarm’s action targets a Lambda via the SNS trigger.
The Lambda function contains logic to delete the NotReady node. This in turn triggers replacement capacity.
For example, the Lambda can call the Kubernetes API to cordon and delete the node. Deleting the Node object evicts
any remaining pods and frees up the name so that the cluster autoscaler can provision a new node if needed.
NotReady
node affects the cluster by programmatically replacing it by removing any dependency on human intervention.NotReady
nodes from piling up.v1.1.0
, allows Karpenter’s controller to monitor node health conditions and automatically terminate and
recreate nodes that remain unhealthy beyond a certain time window
In essence, Karpenter acts as the “self-healing” agent, so you don’t need external Lambdas or custom alarms for node health.
Under the hood, Karpenter watches Kubernetes Node conditions. It needs a node monitoring agent to be installed to observe the node statuses.
So if a node’s Ready status turns to False (NotReady) or Unknown and stays that way for more than 30 minutes, Karpenter will initiate node repair.
“Repair” in this context means Karpenter will forcefully terminate the node’s instance and remove the Node object, bypassing normal graceful termination.
This is done to ensure the unhealthy node is quickly taken out of service and that scheduling can happen on new nodes
When Karpenter deletes the Node (and the associated NodeClaim/instance), any Pods that were on it will be rescheduled according to their disruption policies.
If those Pods require a node (and the cluster is under-provisioned), Karpenter’s provisioning logic will launch a new replacement instance to satisfy the
pending Pods. In many cases, Karpenter can even anticipate this by seeing the node is unschedulable and starting a replacement before the old one is fully gone,
ensuring a smoother transition of workloads.
To use this feature, you need to be running a Karpenter version >=v1.1.0
and enable the NodeRepair feature gate. This typically involves setting the Helm chart value on the Karpenter controller (--settings.feature-gates.nodeRepair=true
).
You also need deploy the AWS Node Monitoring Agent (or Node Problem Detector) if you want Karpenter to react to non-Ready conditions
Once enabled, no further action is needed – Karpenter continuously checks node health and will log and take action when a node is deemed unhealthy.
To prevent cascading failures, Karpenter will halt auto-repairs if more than 20% of the nodes in the cluster (or in the specific provisioning group) are unhealthy.
Aspect | Email Notification | Lambda Deletion | Karpenter Node Repair |
---|---|---|---|
Mechanism | CloudWatch alarm notifies humans via SNS email. | CloudWatch alarm triggers Lambda to programmatically fix the node. | Karpenter controller detects and replaces unhealthy nodes in-cluster. |
Automation | Notification only (manual fix). | Automatic remediation (node removal). | Automatic remediation (node removal and replacement). |
Response Time | Human-dependent (minutes to hours). | Near real-time (Lambda runs within seconds of alarm). | Semi real-time (reacts after ~10–30 mins of unhealthy status. Thresholds are unconfigurable right now). |
Implementation Effort | Very low (configure alarm & SNS topic). | Medium (write & deploy Lambda + alarm). | Medium (install/upgrade karpenter + enable feature gate + install health monitoring agent). |
Scope of Health Signals | Basic Ready status only. | Basic Ready status only. | Wide node conditions (e.g., Ready, GPU health, network). |
Alerting Visibility | Direct alert to on-call (via email/SNS). | Can alert/log action via SNS or logs. | Relies on cluster logs/events (add alerts manually if needed). |