Here are the main guidelines to follow when deploying apps on tensorkube.

The Readiness Probe

When deploying applications to Tensorkube, the readiness probe is essential to ensure your services remain stable and provide a consistent user experience.

What is a Readiness Probe?

A readiness probe is a health check mechanism that determines when your application is ready to start accepting traffic. Unlike a running app which may still be initializing, a “ready” container can properly handle incoming requests. Without readiness probes, there is no way to figure out if traffic can be routed to your app or not. Routing traffic to it as soon as it starts running can result in failed requests, errors, and poor user experience during deployments or scaling events because it might still be loading configuration files, establishing database connections, warming up caches, or initializing dependencies on other services.

Configuring a readiness probe

You can create a readiness endpoint in a FastAPI as follows:

app.py
from fastapi import FastAPI

app = FastAPI()

@app.get("/readiness")
def readiness():
    return {"status": "ready"}

@app.get("/")
def read_root():
    return {"message": "Hello, World!"}

If you want to define a custom readiness endpoint for your deployment, you can specify it in your deployment configuration file as follows:

config.yaml
gpus: 1
gpu_type: v100
readiness:
  httpGet:
    path: /custom-readiness
    port: 8123

And deploy your app using the command

tensorkube deploy --config-file ./config.yaml

hf_transfer

When deploying ML models on tensorkube, HuggingFace’s hf_transfer library provides an efficient way to handle model transfers from Hugging Face Hub to your deployment environment. Leveraging hf_transfer will allow you to optimize your ML model deployments and ensure faster startup times. We recommend you download your model during your app startup instead of baking it into your Docker image as the speedup achieved because of hf_transfer and a smaller Docker image, easily offsets any slowdown that happens due to model downloading.

What is hf_transfer?

hf_transfer is a specialized Rust-based library that optimizes the download and transfer of models from the Hugging Face Hub to your deployment environment. It’s designed to improve transfer speeds, reduce deployment times, and ensure reliable model downloads, especially for large language models and other transformer-based architectures. It works by optimizing the download process through several key mechanisms:

  • Parallel Processing: The library implements efficient multi-threading to download multiple chunks of a model simultaneously, significantly increasing throughput compared to sequential downloads.
  • Optimized Network Utilization: The library removes bandwidth caps that typically limit standard downloads to around 10.4MB/s, and uses the full bandwidth available. This allows it to achieve speeds exceeding 1GB/s on high-bandwidth connections.

Using hf_transfer

Switching to hf_transfer is extremely easy. All you need to do is install the hf-transfer python package and set the HF_HUB_ENABLE_HF_TRANSFER environment variable to 1 in your deployment.

This can be achieved using the commands

pip install --no-cache-dir  hf-transfer
export HF_HUB_ENABLE_HF_TRANSFER=1

This can also be achieved by adding the following lines to your Dockerfile

RUN pip install --no-cache-dir  hf-transfer
ENV HF_HUB_ENABLE_HF_TRANSFER=1

Root Access/ The nvidia-smi command

Tensorkube enforces strict security policies that prevent containers from running as root users. This is a critical security measure that significantly reduces the attack surface and protects your applications and infrastructure.

Why Root Access is Restricted

Running containers as root is a serious security risk that has been demonstrated repeatedly through various container escape vulnerabilities. When containers run as root, hackers can potentially escape container isolation and gain unfettered access to the host. This means a vulnerability in your application can end up compromising your entire infrastructure. A compromised container with root privileges can access sensitive information from all other containers on the node and attackers could potentially access cloud credentials and use your resources for malicious purposes.

Impact on GPU Operations

One common issue that arises from non-root restrictions is the inability to directly access GPU devices with commands like nvidia-smi. This happens because GPU device files typically belong to the root user and a specific group. The NVIDIA Management Library (NVML) requires specific permissions to initialize properly and without proper permissions, commands like nvidia-smi will fail.

How This Affects Your Deployments

If your deployments attempt to run GPU commands with root privileges, your deployment will fail and your nodes might become unresponsive or stuck as GPU processes hang and prevent the node from being scaled down automatically.