Deploying Llama-3.3-70B-Instruct on Serverless GPUs
Deploy serverless GPU applications on your AWS account
Built with developer experience in mind, Tensorfuse simplifies the process of deploying serverless GPU apps. In this guide, we will walk you through the process of deploying Llama 3.3 70B Instruct model using Tensorfuse on your cloud account.
We will be using L40S
GPUs as each L40S GPU has about 40GB of GPU memory and we need around 140GB of GPU memory to run the model in float16
. Therefore we will be running
Llama 3.3 on 4
L40S GPUs.
We will also add token based authentication to our service which is compatible with OpenAI client libraries.
vLLM server is essentially a FastAPI app and it can be extended to support middlewares and other features of FastAPI. In this guide, we will see how we can support authentication using a Bearer token. If you need more information on how to add more features to vLLM, feel free to ask us in our Slack Community
Prerequisites
Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.
Deploying Llama-3.3-70B-Instruct with Tensorfuse
Each tensorkube deployment requires three things - your code, your environment (as a Dockerfile) and a deployment configuration.
We also need to provide huggingface-token
to download the model from the huggingface hub and also an authentication-token
that our vLLM service
will use to authenticate incoming requests. vLLM provides a straightforward way to add authentication to your service via their --api-key
flag.
We need to store both these tokens as Tensorfuse secrets.
Step 1: Setting up the secrets
Access to Llama 3.3
Llama-3.3 requires a license agreement. Visit the Llama 3.3 huggingface repo to ensure that you have signed the agreement and have access to the model.
Set huggingface token
Get a READ
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.
Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.
Set your API authentication token
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below.
For the purpose of this demo, we will be using vllm-key
as your api-key.
Ensure that in production you use a randomly generated token. You can quickly generate one using openssl rand -base64 32
and remember to keep it safe as
tensorfuse secrets are opaque.
Step 2 : Prepare the Dockerfile
We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.
We have used a lot of CLI flags in order to align the vLLM server for our usecase. All the other vLLM flags are listed here. If you are confused on what flags to use for your production deployment, please ask your query in the Tensorfuse Community.
Step 3: Deployment config
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.
We set up the basic infra configuration such as the number of gpus and the type of gpu in deployment.yaml
. You can go through all the configurable options
in the config file guide.
Remember to always include a readiness
endpoint in your deployment config. This is used by the Tensorkube controller to check if your app is ready to serve traffic.
If no readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path.
We are now all set to deploy Llama 3.3 70B Instruct on serverless GPUs using Tensorfuse. Run the below command to start the build and wait for your deployment to get ready.
Step 4: Accessing the deployed app
Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint.You can list your deployments using the below command
And that’s it! You have successfully deployed the quantized llama-3.1-70b-instruct on serverless GPUs using Tensorkube. 🚀
Remember to configure a TLS endpoint with a custom domain before going to production.
To test it out you can run the following command by replacing the URL with the one provided in the output:
Since vLLM is compatible with the OpenAI API you can query the other endpoints present here.
You can also use the OpenAI Python SDK to query your deployment as shown below:
You can also use the readiness endpoint to wake up your models in case you are expecting incoming traffic
In this case you need to hit the /health
url.