Built with developer experience in mind, Tensorkube simplifies the process of deploying serverless GPU apps. In this guide,
we will walk you through the process of deploying Pixtral-12B (4-bit-quantized) on your private cloud.
Prerequisites
Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.
Deploying Pixtral-12B with Tensorfuse
Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile).
While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin.
To enable this, in addition to a FastAPI app and a dockerfile, we will also write code to download the model and place it in our image file.
Download the model
We will write a small script that downloads the Pixtral model from the Hugging Face model hub and saves it in the /models
directory.
import os
from huggingface_hub import snapshot_download
access_token = '<YOUR-HUGGINGFACE_TOKEN>'
if __name__ == "__main__":
os.makedirs("./models", exist_ok=True)
snapshot_download(
repo_id="SeanScripts/pixtral-12b-nf4",
local_dir="models",
ignore_patterns=["*.pt", "*.bin"],
token=access_token,
)
move_cache()
Code files
We will write a small FastAPI app that loads the model and serves predictions. The FastAPI app will have three endpoints - /readiness
, /
, and /generate
. Remember that the /readiness
endpoint is used by Tensorkube to check the health of your deployments.
import torch
from transformers import (
LlavaForConditionalGeneration,
AutoProcessor,
BitsAndBytesConfig,
)
from fastapi import FastAPI
import os
from pydantic import BaseModel
from typing import List
import time
app = FastAPI()
model_dir = "models"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "SeanScripts/pixtral-12b-nf4"
model = LlavaForConditionalGeneration.from_pretrained(
model_dir, torch_dtype=torch.float16, use_safetensors=True, low_cpu_mem_usage=True
)
processor = AutoProcessor.from_pretrained(model_id)
@app.get("/")
async def root():
is_cuda_available = torch.cuda.is_available()
return {
"message": "Hello World",
"cuda_available": is_cuda_available,
}
@app.get("/readiness")
async def readiness():
return {"status": "ready"}
class GenerateRequest(BaseModel):
prompt: str
images: List[str] = []
@app.post("/generate")
async def generate_text(request: GenerateRequest):
prompt = request.prompt
IMG_URLS = request.images
if not prompt:
return {"error": "prompt field is required"}
formatted_prompt = f"<s>[INST]{prompt}\n" + "[IMG]" * len(IMG_URLS) + "[/INST]"
inputs = processor(images=IMG_URLS, text=formatted_prompt, return_tensors="pt").to(device) if IMG_URLS else processor(text=formatted_prompt, return_tensors="pt").to(device)
prompt_tokens = len(inputs["input_ids"][0])
print(f"Prompt tokens: {prompt_tokens}")
t0 = time.time()
generate_ids = model.generate(**inputs, max_new_tokens=512)
t1 = time.time()
total_time = t1 - t0
generated_tokens = len(generate_ids[0]) - prompt_tokens
time_per_token = generated_tokens / total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
return {"generated_text": output}
Environment files (Dockerfile)
Next, create a Dockerfile for your FastAPI app. Given below is a simple Dockerfile that you can use:
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3.10 \
python3.10-dev \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3.10 /usr/bin/python
RUN pip3 install --no-cache-dir --upgrade pip && pip install transformers && pip install torch fastapi uvicorn pydantic bitsandbytes accelerate pillow
WORKDIR /code
COPY main.py /code/main.py
COPY download_model.py /code/download_model.py
RUN python download_model.py
EXPOSE 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
Deploying the app
Llama-3.1-8b-instruct is now ready to be deployed on Tensorkube. Navigate to your project root and run the following command:
tensorkube deploy --gpus 1 --gpu-type a10g
Llama-3.1-8b-instruct is now deployed on your AWS account. You can access your app at the URL provided in the output or using the following command:
tensorkube list deployments
And that’s it! You have successfully deployed llama-3.1-8b-instruct on serverless GPUs using Tensorkube. 🚀
To test it out you can run the following command by replacing the URL with the one provided in the output:
curl -X POST <YOUR_APP_URL_HERE>/generate -H "Content-Type: application/json" -d '{"text":"Describe the image.", images=["https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"]}'
You can also use the readiness endpoint to wake up your nodes in case you are expecting incoming traffic
curl <YOUR_APP_URL_HERE>/readiness