Built with developer experience in mind, Tensorkube simplifies the process of deploying serverless GPU apps. In this guide,
we will walk you through the process of deploying 4-bit quantized version of llama3.1-70b-instruct on your private cloud.
Prerequisites
Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.
Deploying Int4 Quantized Llama-3.1-70B-Instruct with Tensorfuse
Each tensorkube deployment requires two things - your code and your environment (as a Dockerfile).
While deploying machine learning models, it is beneficial if your model is also a part of your container image. This reduces cold-start times by a significant margin.
To enable this, in addition to a FastAPI app and a dockerfile, we will also write code to download the model and place it in our image file.
Download the model
We will write a small script that downloads the hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 model from the Hugging Face model hub and saves it in the /models
directory.
import os
from huggingface_hub import snapshot_download
access_token = '<YOUR-HUGGINGFACE_TOKEN>'
if __name__=='__main__':
os.makedirs('./models', exist_ok=True)
snapshot_download(repo_id="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4", local_dir="models",ignore_patterns=["*.pt", "*.bin"], token=access_token)
Code files
We will write a small FastAPI app that loads the model and serves predictions. The FastAPI app will have three endpoints - /readiness
, /
, and /generate
. Remember that the /readiness
endpoint is used by Tensorkube to check the health of your deployments.
from fastapi import FastAPI
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
app = FastAPI()
model_dir = "models"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
quantization_config = AwqConfig(
bits=4,
fuse_max_seq_len=512,
do_fuse=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map="auto",
quantization_config=quantization_config
)
@app.get("/")
async def root():
is_cuda_available = torch.cuda.is_available()
return {
"message": "Hello World",
"cuda_available": is_cuda_available,
}
@app.get("/readiness")
async def readiness():
return {"status": "ready"}
@app.post("/generate")
async def generate_text(data: dict):
text = data.get("text")
if not text:
return {"error": "text field is required"}
prompt = text
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
response = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:], skip_special_tokens=True)[0]
return {"generated_text": response}
Environment files (Dockerfile)
Next, create a Dockerfile for your FastAPI app. Given below is a simple Dockerfile that you can use:
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3.10 \
python3.10-dev \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN ln -s /usr/bin/python3.10 /usr/bin/python
RUN pip3 install --no-cache-dir --upgrade pip && pip install transformers && pip install torch fastapi uvicorn pydantic accelerate && pip install -q --upgrade transformers autoawq accelerate
WORKDIR /code
COPY main.py /code/main.py
COPY download_model.py /code/download_model.py
RUN python download_model.py
EXPOSE 80
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
Deploying the app
Our model is now ready to be deployed on AWS. Navigate to your project root and run the following command:
tensorkube deploy --gpus 1 --gpu-type a100
Int4 Quantized Llama-3.1-70B-Instruct is now deployed on your AWS account. You can access your app at the URL provided in the output or using the following command:
tensorkube list deployments
And that’s it! You have successfully deployed the quantized llama-3.1-70b-instruct on serverless GPUs using Tensorkube. 🚀
To test it out you can run the following command by replacing the URL with the one provided in the output:
curl -X POST <YOUR_APP_URL_HERE>/generate -H "Content-Type: application/json" -d '{"text":"Name the intern at Crowstrike who pushed the buggy update"}'
You can also use the readiness endpoint to wake up your nodes in case you are expecting incoming traffic
curl <YOUR_APP_URL_HERE>/readiness