Built with developer experience in mind, Tensorkube simplifies the process of deploying serverless GPU apps. In this guide, we will walk you through the process of deploying Pixtral-12B (4-bit-quantized) on your private cloud.

Prerequisites

Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Pixtral-12B with Tensorfuse

Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile). While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin. To enable this, in addition to a FastAPI app and a dockerfile, we will also write code to download the model and place it in our image file.

Download the model

We will write a small script that downloads the Pixtral model from the Hugging Face model hub and saves it in the /models directory.

download_model.py
import os

from huggingface_hub import snapshot_download
access_token = '<YOUR-HUGGINGFACE_TOKEN>'


if __name__ == "__main__":
    # download the meta/llama3 model
    os.makedirs("./models", exist_ok=True)
    snapshot_download(
        repo_id="SeanScripts/pixtral-12b-nf4",
        local_dir="models",
        ignore_patterns=["*.pt", "*.bin"],
        token=access_token,
    )
    move_cache()

Code files

We will write a small FastAPI app that loads the model and serves predictions. The FastAPI app will have three endpoints - /readiness, /, and /generate. Remember that the /readiness endpoint is used by Tensorkube to check the health of your deployments.

main.py
import torch
from transformers import (
    LlavaForConditionalGeneration,
    AutoProcessor,
    BitsAndBytesConfig,
)
from fastapi import FastAPI
import os
from pydantic import BaseModel
from typing import List
import time

app = FastAPI()
model_dir = "models"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load model
model_id = "SeanScripts/pixtral-12b-nf4"
model = LlavaForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype=torch.float16, use_safetensors=True, low_cpu_mem_usage=True
)
# Load tokenizer
processor = AutoProcessor.from_pretrained(model_id)

@app.get("/")
async def root():
    is_cuda_available = torch.cuda.is_available()
    return {
        "message": "Hello World",
        "cuda_available": is_cuda_available,
    }

@app.get("/readiness")
async def readiness():
    return {"status": "ready"}

class GenerateRequest(BaseModel):
    prompt: str
    images: List[str] = []

@app.post("/generate")
async def generate_text(request: GenerateRequest):
    prompt = request.prompt
    IMG_URLS = request.images

    if not prompt:
        return {"error": "prompt field is required"}

    formatted_prompt = f"<s>[INST]{prompt}\n" + "[IMG]" * len(IMG_URLS) + "[/INST]"
    inputs = processor(images=IMG_URLS, text=formatted_prompt, return_tensors="pt").to(device) if IMG_URLS else processor(text=formatted_prompt, return_tensors="pt").to(device)

    prompt_tokens = len(inputs["input_ids"][0])
    print(f"Prompt tokens: {prompt_tokens}")

    t0 = time.time()
    generate_ids = model.generate(**inputs, max_new_tokens=512)
    t1 = time.time()

    total_time = t1 - t0
    generated_tokens = len(generate_ids[0]) - prompt_tokens
    time_per_token = generated_tokens / total_time
    print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

    output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return {"generated_text": output}

Environment files (Dockerfile)

Next, create a Dockerfile for your FastAPI app. Given below is a simple Dockerfile that you can use:

Dockerfile
# Use the nvidia cuda base image
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

# Update and install required packages
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3.10-dev \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set Python 3.10 as the default Python version
RUN ln -s /usr/bin/python3.10 /usr/bin/python

# Upgrade pip
RUN pip3 install --no-cache-dir --upgrade pip && pip install transformers && pip install torch fastapi uvicorn pydantic bitsandbytes accelerate pillow

# Set working directory
WORKDIR /code

# Copy the code files
COPY main.py /code/main.py
COPY download_model.py /code/download_model.py

# Run the downloader script to download the model
RUN python download_model.py

EXPOSE 80

# Start a uvicorn server on port 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Deploying the app

Llama-3.1-8b-instruct is now ready to be deployed on Tensorkube. Navigate to your project root and run the following command:

tensorkube deploy --gpus 1 --gpu-type a10g

Llama-3.1-8b-instruct is now deployed on your AWS account. You can access your app at the URL provided in the output or using the following command:

tensorkube list deployments

And that’s it! You have successfully deployed llama-3.1-8b-instruct on serverless GPUs using Tensorkube. 🚀

To test it out you can run the following command by replacing the URL with the one provided in the output:

curl -X POST <YOUR_APP_URL_HERE>/generate -H "Content-Type: application/json" -d '{"text":"Describe the image.", images=["https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"]}' 

You can also use the readiness endpoint to wake up your nodes in case you are expecting incoming traffic

curl <YOUR_APP_URL_HERE>/readiness