Sometimes, nodes in the cluster can fail and enter a NotReady state. This may prevent them from scaling down. In the case of GPU nodes, this can cause your cloud costs to explode. Thus it is very important to have checks in place to deal with such situations. Tensorfuse creates a stack of monitoring resources to deal with node failures and nodes entering the NotReady state.

Tensorufuse uses the following two mechanisms to deal with node failures:

1. Node Auto Repair

Tensorfuse uses the AWS Node Monitoring Agent alongside Karpenter’s NodeAutoRepair feature to keep track of node health and to identify when a node fails. Depending on the reason for failure, Karpenter can wait upto 30m before deleting the node. In case more than 20% of nodes in a Nodepool fail, this node repair is blocked. This is done to avoid a circular creation and failure cycle in case of a faulty deployment. These limits are also unconfigurable.

2. Custom CloudWatch Alarm and State Machine

While logically it makes sense for the nodes to not be deleted in case of a circular cycle; as stated before, practically, if failed GPU nodes are not deleted, it can incur a huge cloud bill. To avoid this, we configure a custom CloudWatch Alarm. If there are failed nodes that don’t go down within 45 minutes, an e-mail is sent to your configured alert email and a lambda is run that automatically deletes any NotReady nodes. At this point, a state machine is also started that checks and re-sends the e-mail and triggers the lambda every 15 minutes untill there are no failed nodes in the cluster.

TL;DR

What time do I receive an alert ?

You recieve an alert if failed nodes persist for 45 minutes and every 15 minutes after that if the nodes still haven’t gone down

What time do nodes start cleaning up ?

If less than 20% of nodes have failed, a node is deleted 30m after it fails. If there are more than 20% failed nodes in the NodePool, all nodes are deleted 45 minutes after the first node first failed.