L40S
GPUs as each L40S GPU has about 40GB of GPU memory and we need around 140GB of GPU memory to run the model in float16
. Therefore we will be running
Llama 3.3 on 4
L40S GPUs.
We will also add token based authentication to our service which is compatible with OpenAI client libraries.
vLLM server is essentially a FastAPI app and it can be extended to support middlewares and other features of FastAPI. In this guide,
we will see how we can support authentication using a Bearer token. If you need more information on how to add more features
to vLLM, feel free to ask us in our Slack Community
Prerequisites
Before you begin, ensure you have configured Tensorkube on your AWS account. If you haven’t done that yet, follow the Getting Started guide.Deploying Llama-3.3-70B-Instruct with Tensorfuse
Each tensorkube deployment requires three things - your code, your environment (as a Dockerfile) and a deployment configuration. We also need to providehuggingface-token
to download the model from the huggingface hub and also an authentication-token
that our vLLM service
will use to authenticate incoming requests. vLLM provides a straightforward way to add authentication to your service via their --api-key
flag.
We need to store both these tokens as Tensorfuse secrets.
Step 1: Setting up the secrets
1
Access to Llama 3.3
Llama-3.3 requires a license agreement. Visit the Llama 3.3 huggingface repo to ensure that
you have signed the agreement and have access to the model.
2
Set huggingface token
Get a Ensure that the key for your secret is
READ
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.3
Set your API authentication token
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below.
For the purpose of this demo, we will be using Ensure that in production you use a randomly generated token. You can quickly generate one using
vllm-key
as your api-key.openssl rand -base64 32
and remember to keep it safe as
tensorfuse secrets are opaque.Step 2 : Prepare the Dockerfile
We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.Dockerfile
Step 3: Deployment config
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment. We set up the basic infra configuration such as the number of gpus and the type of gpu indeployment.yaml
. You can go through all the configurable options
in the config file guide.
deployment.yaml
readiness
endpoint in your deployment config. This is used by the Tensorkube controller to check if your app is ready to serve traffic.
If no
readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path.Step 4: Accessing the deployed app
Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint. You can list your deployments using the below commandRemember to configure a TLS endpoint with a custom domain before going to production.
You can also use the readiness endpoint to wake up your models in case you are expecting incoming traffic
/health
url.