You can deploy other GGUF quant models by modifying the
entrypoint.sh
script below. You can also tinker with the number of GPUs and GPU type to deploy on other GPU combinations.
If you have more GPU memory than 192 GB, I would also recommend playing around with the --ctx-size
parameter.Prerequisites
Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.Deploying Deepseek-R1-671B with Tensorfuse
Each Tensorkube deployment requires:- Your environment (as a Dockerfile).
- Your code (in this example, the entrypoint.sh script).
- A deployment configuration (
deployment.yaml
).
Step 1: Prepare the Dockerfile
We will use the official llama.cpp image as our base image. This image comes with all the necessary dependencies to run llama.cpp. The image is present on the Github Registry as gerganov/llama.cpp. We will then set the environment variables required to run the model and install the necessary huggingface dependencies to download the model. We will then copy our code and set the permissions for the entrypoint script.Dockerfile
Step 2: Prepare the entrypoint script
In this step, we first download the GGUF model usingsnapshot_download
from huggingface_hub. We then start the llama server with the necessary flags to run the model.
You can deploy other GGUF quant models by modifying the
entrypoint.sh
script below. You will have to change the repo_id
and local_dir
flag in the
snapshot_download
parameters and change the --model
flag in the llama-server command.entrypoint.sh
llama-server
flags is available for further reference, and if you have questions about selecting flags for production, the Tensorfuse Community is an excellent place to seek guidance.
Step 3: Deployment config
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.deployment.yaml
readiness
endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.
llama-server
exposes readiness by default on the /health
endpoint. Remember that we have set port to 8080
in deployment.yaml as llama-server
runs on that port and we have exposed 8080
in the Dockerfile.
If no
readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path.Step 4: Accessing the deployed app
Voila! Your autoscaling production llama.cpp service is ready. Once the deployment is successful, you can see the status of your app by running:Remember to configure a TLS endpoint with a custom domain before going to production.
YOUR_APP_URL
with the endpoint shown in the output of the above command and run:
llama-server
is compatible with the OpenAI API, you can use OpenAI’s client libraries
as well. Here’s a sample snippet using Python: