Troubleshooting
Guidelines and best practices for deploying applications on Tensorkube
Here are the main guidelines to follow when deploying apps on tensorkube.
The Readiness Probe
When deploying applications to Tensorkube, the readiness probe is essential to ensure your services remain stable and provide a consistent user experience.
What is a Readiness Probe?
A readiness probe is a health check mechanism that determines when your application is ready to start accepting traffic. Unlike a running app which may still be initializing, a “ready” container can properly handle incoming requests. Without readiness probes, there is no way to figure out if traffic can be routed to your app or not. Routing traffic to it as soon as it starts running can result in failed requests, errors, and poor user experience during deployments or scaling events because it might still be loading configuration files, establishing database connections, warming up caches, or initializing dependencies on other services.
Configuring a readiness probe
You can create a readiness endpoint in a FastAPI as follows:
If you want to define a custom readiness endpoint for your deployment, you can specify it in your deployment configuration file as follows:
And deploy your app using the command
hf_transfer
When deploying ML models on tensorkube, HuggingFace’s hf_transfer
library provides an efficient way to handle model transfers
from Hugging Face Hub to your deployment environment. Leveraging hf_transfer
will allow you to optimize your ML model deployments and ensure faster startup times.
We recommend you download your model during your app startup instead of baking it into your Docker image as the speedup achieved because of hf_transfer
and a smaller Docker image,
easily offsets any slowdown that happens due to model downloading.
What is hf_transfer
?
hf_transfer is a specialized Rust-based library that optimizes the download and transfer of models from the Hugging Face Hub to your deployment environment. It’s designed to improve transfer speeds, reduce deployment times, and ensure reliable model downloads, especially for large language models and other transformer-based architectures. It works by optimizing the download process through several key mechanisms:
- Parallel Processing: The library implements efficient multi-threading to download multiple chunks of a model simultaneously, significantly increasing throughput compared to sequential downloads.
- Optimized Network Utilization: The library removes bandwidth caps that typically limit standard downloads to around 10.4MB/s, and uses the full bandwidth available. This allows it to achieve speeds exceeding 1GB/s on high-bandwidth connections.
Using hf_transfer
Switching to hf_transfer
is extremely easy. All you need to do is install the hf-transfer
python package and set the HF_HUB_ENABLE_HF_TRANSFER
environment variable to 1
in your deployment.
This can be achieved using the commands
This can also be achieved by adding the following lines to your Dockerfile
Root Access/ The nvidia-smi
command
Tensorkube enforces strict security policies that prevent containers from running as root users. This is a critical security measure that significantly reduces the attack surface and protects your applications and infrastructure.
Why Root Access is Restricted
Running containers as root is a serious security risk that has been demonstrated repeatedly through various container escape vulnerabilities. When containers run as root, hackers can potentially escape container isolation and gain unfettered access to the host. This means a vulnerability in your application can end up compromising your entire infrastructure. A compromised container with root privileges can access sensitive information from all other containers on the node and attackers could potentially access cloud credentials and use your resources for malicious purposes.
Impact on GPU Operations
One common issue that arises from non-root restrictions is the inability to directly access GPU devices with commands like nvidia-smi
.
This happens because GPU device files typically belong to the root user and a specific group. The NVIDIA Management Library (NVML)
requires specific permissions to initialize properly and without proper permissions, commands like nvidia-smi
will fail.
How This Affects Your Deployments
If your deployments attempt to run GPU commands with root privileges, your deployment will fail and your nodes might become unresponsive or stuck as GPU processes hang and prevent the node from being scaled down automatically.