Built with developer experience in mind, Tensorkube simplifies the process of deploying serverless GPU apps. In this guide, we will walk you through the process of deploying Meta-Llama-3.2-11B-Instruct on your private cloud.

Prerequisites

Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Meta-Llama-3.2-11B-Instruct with Tensorfuse

Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile). While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin. To enable this, in addition to a FastAPI app and a dockerfile, we will also write code to download the model and place it in our image file. Learn more about llama 3.2 and gradio by visiting the docs. Also we are using a flask server with a mounted gradio app to serve the model requests. For more information about this you can refer the gradio apps here.

Download the model

We will write a small script that downloads the Llama-3.2-Vision-Instruct model from the Hugging Face model hub and saves it in the /models directory. Note: Since llama3.2 is a gated repo, you will need to request the authors of repo here for the access to the model.

download_model.py
import os

from huggingface_hub import snapshot_download
access_token = '<YOUR-HUGGINGFACE_TOKEN>'


if __name__=='__main__':
    # download the meta/llama3 model
    os.makedirs('./models', exist_ok=True)
    snapshot_download(repo_id="meta-llama/Llama-3.2-11B-Vision-Instruct", local_dir="models",ignore_patterns=["*.pt", "*.bin"], token=access_token)

Code files

We will write a small FastAPI app that loads the model and serves predictions by mounting the gradio app. The FastAPI app will have three endpoints - /readiness, /, and /gradio. Remember that the /readiness endpoint is used by Tensorkube to check the health of your deployments.

main.py
import torch
import transformers
from fastapi import FastAPI
from transformers import MllamaForConditionalGeneration, AutoProcessor
import gradio as gr
import os

app = FastAPI()

model_dir = "models"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = MllamaForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    local_files_only=True
)

@app.get("/")
async def root():
    is_cuda_available = torch.cuda.is_available()
    return {
        "message": "Hello World",
        "cuda_available": is_cuda_available,
    }

@app.get("/readiness")
async def readiness():
    return {"status": "ready"}


def generate_image(prompt, image):
    # Generate the image
    messages = [
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", 
            "text": f"{prompt}"}
        ]}
    ]
    input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
    if image:
        inputs = processor(
            image, input_text, add_special_tokens=False, return_tensors="pt"
        ).to(model.device)
    else:
        inputs = processor( 
            input_text, add_special_tokens=False, return_tensors="pt"
        ).to(model.device)
    output = model.generate(**inputs, max_new_tokens=30)
    return processor.decode(output[0])

# Create the Gradio interface for the VLm model llama 3.2 instruct
interface = gr.Interface(
    fn=generate_image,
    inputs=[
        gr.Textbox(label="Prompt"),
        gr.Image(type="pil", label="Input Image")
    ],
    outputs="text",
    title="Run Llama 3.2 Vision Instruct using Tensorfuse",
    description="Generate images with VLM Model Llama 3.2 Instruct. Provide a prompt and optionally an input image.",
)

app = gr.mount_gradio_app(app, interface, path="/gradio")

Environment files (Dockerfile)

Next, create a Dockerfile for your FastAPI app. Given below is a simple Dockerfile that you can use:

Dockerfile
# Use the nvidia cuda base image
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04

# Update and install required packages
RUN apt-get update && apt-get install -y \
    ffmpeg \
    python3.10 \
    python3.10-dev \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set Python 3.10 as the default Python version
RUN ln -s /usr/bin/python3.10 /usr/bin/python

# Upgrade pip
RUN pip3 install --no-cache-dir --upgrade pip && pip install transformers torch accelerate fastapi huggingface_hub[cli] gradio hf_transfer

# Set working directory
WORKDIR /code

# Copy the code files
COPY main.py /code/main.py
COPY download_model.py /code/download_model.py

RUN HF_HUB_ENABLE_HF_TRANSFER=1 python download_model.py

EXPOSE 80

# Start a uvicorn server on port 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]

Deploying the app

Llama-3.2-11b-instruct is now ready to be deployed on Tensorkube. Navigate to your project root and run the following command:

tensorkube deploy --gpus 1 --gpu-type a10g

Llama-3.2-11b-instruct is now deployed on your AWS account. You can access your app at the URL provided in the output or using the following command:

tensorkube list deployments

And that’s it! You have successfully deployed llama-3.2-11b-instruct on serverless GPUs using Tensorkube. 🚀

To test the app, visit the following link for the gradio deployment.

curl -X POST <YOUR_APP_URL_HERE>/gradio

You can also use the readiness endpoint to wake up your nodes in case you are expecting incoming traffic

curl <YOUR_APP_URL_HERE>/readiness