Troubleshooting
Handling Node Failures
Understanding how Tensorfuse deals with node failures
Sometimes, nodes in the cluster can fail and enter a NotReady state. This may prevent them from scaling down. In the case of GPU nodes,
this can cause your cloud costs to explode. Thus it is very important to have checks in place to deal with such situations.
Tensorfuse creates a stack of monitoring resources to deal with node failures and nodes entering the NotReady state.
Tensorufuse uses the following two mechanisms to deal with node failures:

