Although we are using L4s here as they are the cheapest option for deploying Qwen QwQ 32B, you can easily adapt the instructions to deploy on other GPUs
such as A10G, L40S or A100s just by modifying the
gpu_type
option below.Why Build with Qwen QwQ 32B?
Qwen QwQ 32B offers:- Comparable Performance to Larger Models: Matches or beats DeepSeek-R1 on key benchmarks while using only ~5% of the parameters.
- Cost Efficiency: Lower parameter count means reduced computation costs without sacrificing quality.
- Strong Reasoning Capabilities: Excels at complex reasoning tasks, coding, and mathematical problems.
- Open-Source Access: Fully available for deployment on your own infrastructure.
Benchmark | Qwen QwQ (32B) | Deepseek-R1 (671B) | Remarks |
---|---|---|---|
AIME 2024 (Pass@1) | 79.5% | 79.8% | Mathematical and reasoning abilities |
LiveCodeBench (Pass@1-COT) | 63.4% | 65.9% | Excels at multi-step reasoning |
Prerequisites
Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.Deploying Qwen QwQ 32B with Tensorfuse
Each Tensorkube deployment requires:- Your code (in this example, vLLM API server code is used from the Docker image).
- Your environment (as a Dockerfile).
- A deployment configuration (
deployment.yaml
).
VLLM_API_KEY
) as a Tensorfuse secret. Unlike some other models, Qwen QwQ 32B does not require a separate Hugging Face token, so we can skip that step.
Step 1: Set your API authentication token
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be usingvllm-key
as your API key.
openssl rand -base64 32
and remember to keep it safe as Tensorfuse secrets are opaque.
Step 2: Prepare the Dockerfile
We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.Dockerfile
Step 3: Deployment config
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.deployment.yaml
readiness
endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.
If no
readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path.Step 4: Accessing the deployed app
Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint. Once the deployment is successful, you can see the status of your app by running:Remember to configure a TLS endpoint with a custom domain before going to production.
YOUR_APP_URL
with the endpoint shown in the output of the above command and run:
Technical Specifications and Optimization
Memory vs Context Size
Context length has a quadratic relationship with memory usage - doubling the context requires approximately 4 times more memory. For example:- 2048 tokens of context: ~0.7 GB of memory
- 16384 tokens of context: ~68 GB of memory
GPU Comparison: L4 vs L40S vs A10G
Specification | L4 | L40S | A10G |
---|---|---|---|
VRAM | 24 GB | 48 GB | 24 GB |
Performance Score | 13.44 | 42.25 | - |
TFLOPS (FP32) | 30.29 | 91.6 | 31.2 |
Power Consumption | 72W | 350W | 150W |
Cost-Efficiency | High | Medium | Medium |
Optimal Configuration Settings
Based on official recommendations and community testing, here are the optimal parameters for Qwen QwQ 32B: Recommended Inference Settings:- Temperature: 0.6 (0.7-0.75 for creative tasks)
- Top_p: 0.95
- Top_k: 40 (range 20-40)
- Min_p: 0.01 (optional, helps prevent language switching)
- Repetition_penalty: 1.0-1.1
Setting Up a Frontend with Open WebUI
For a user-friendly interface to interact with your deployed model, you can set up Open WebUI:- Open your browser and navigate to http://localhost:3000
- Set up your administrator account
- Configure the model by adding your Tensorfuse endpoint and API key
Troubleshooting Common Issues
Infinite Generations or Repetitions
If you experience infinite loops or repetitive output, try:- Increasing repetition_penalty to 1.1
- Setting min_p to 0.01
- Using Q4_K_M quantization instead of Q6K
Out of Memory Errors
If encountering OOM errors:- Reduce context window size
- Decrease the batch size