Deploy serverless GPU applications on your AWS account
L40S
GPUs as each L40S GPU has about 40GB of GPU memory and we need around 140GB of GPU memory to run the model in float16
. Therefore we will be running
Llama 3.3 on 4
L40S GPUs.
We will also add token based authentication to our service which is compatible with OpenAI client libraries.
huggingface-token
to download the model from the huggingface hub and also an authentication-token
that our vLLM service
will use to authenticate incoming requests. vLLM provides a straightforward way to add authentication to your service via their --api-key
flag.
We need to store both these tokens as Tensorfuse secrets.
Access to Llama 3.3
Set huggingface token
READ
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.Set your API authentication token
vllm-key
as your api-key.openssl rand -base64 32
and remember to keep it safe as
tensorfuse secrets are opaque.deployment.yaml
. You can go through all the configurable options
in the config file guide.
readiness
endpoint in your deployment config. This is used by the Tensorkube controller to check if your app is ready to serve traffic.
readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path./health
url.