Deploy Qwen QwQ 32B on Serverless GPUs
Deploy Qwen QwQ 32B using Tensorfuse
Qwen QwQ 32B is a groundbreaking large language model that delivers exceptional reasoning capabilities while being significantly smaller than other high-performance models. Released on March 5th, 2025, this model has shocked the AI community by matching or even outperforming DeepSeek-R1 (671B parameters) across several benchmarks despite being 20 times smaller. In this guide, we’ll walk through deploying Qwen QwQ 32B on L4 GPUs using Tensorfuse.
Although we are using L4s here as they are the cheapest option for deploying Qwen QwQ 32B, you can easily adapt the instructions to deploy on other GPUs
such as A10G, L40S or A100s just by modifying the gpu_type
option below.
Why Build with Qwen QwQ 32B?
Qwen QwQ 32B offers:
- Comparable Performance to Larger Models: Matches or beats DeepSeek-R1 on key benchmarks while using only ~5% of the parameters.
- Cost Efficiency: Lower parameter count means reduced computation costs without sacrificing quality.
- Strong Reasoning Capabilities: Excels at complex reasoning tasks, coding, and mathematical problems.
- Open-Source Access: Fully available for deployment on your own infrastructure.
Below is a quick snapshot of benchmark scores for QwQ 32B:
Benchmark | Qwen QwQ (32B) | Deepseek-R1 (671B) | Remarks |
---|---|---|---|
AIME 2024 (Pass@1) | 79.5% | 79.8% | Mathematical and reasoning abilities |
LiveCodeBench (Pass@1-COT) | 63.4% | 65.9% | Excels at multi-step reasoning |
The combination of these strengths makes Qwen QwQ 32B an excellent choice for production-ready applications, from chatbots to enterprise-level data analytics.
Prerequisites
Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.
Deploying Qwen QwQ 32B with Tensorfuse
Each Tensorkube deployment requires:
- Your code (in this example, vLLM API server code is used from the Docker image).
- Your environment (as a Dockerfile).
- A deployment configuration (
deployment.yaml
).
We will also add token-based authentication to our service, compatible with OpenAI client libraries. We will store the authentication token (VLLM_API_KEY
) as a Tensorfuse secret. Unlike some other models, Qwen QwQ 32B does not require a separate Hugging Face token, so we can skip that step.
Step 1: Set your API authentication token
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using vllm-key
as your API key.
Ensure that in production you use a randomly generated token. You can quickly generate one
using openssl rand -base64 32
and remember to keep it safe as Tensorfuse secrets are opaque.
Step 2: Prepare the Dockerfile
We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.
We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. A comprehensive list of all other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the Tensorfuse Community is an excellent place to seek guidance.
Step 3: Deployment config
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.
Don’t forget the readiness
endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.
If no readiness
endpoint is configured, Tensorfuse tries the /readiness
path on port 80 by default which can cause issues if your app is not listening on that path.
Now you can deploy your service using the following command:
Step 4: Accessing the deployed app
Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint.
Once the deployment is successful, you can see the status of your app by running:
And that’s it! You have successfully deployed the world’s strongest Open Source Reasoning Model
Remember to configure a TLS endpoint with a custom domain before going to production.
To test it out, replace YOUR_APP_URL
with the endpoint shown in the output of the above command and run:
Because vllm is compatible with the OpenAI API, you can useOpenAI’s client libraries as well. Here’s a sample snippet using Python:
Technical Specifications and Optimization
Memory vs Context Size
Context length has a quadratic relationship with memory usage - doubling the context requires approximately 4 times more memory. For example:
- 2048 tokens of context: ~0.7 GB of memory
- 16384 tokens of context: ~68 GB of memory
This relationship is crucial when choosing your deployment configuration. For L4 GPUs with 24GB VRAM each, a 4-GPU setup offers up to 96GB total VRAM, allowing for contexts of 32-64K tokens depending on quantization.
GPU Comparison: L4 vs L40S vs A10G
Specification | L4 | L40S | A10G |
---|---|---|---|
VRAM | 24 GB | 48 GB | 24 GB |
Performance Score | 13.44 | 42.25 | - |
TFLOPS (FP32) | 30.29 | 91.6 | 31.2 |
Power Consumption | 72W | 350W | 150W |
Cost-Efficiency | High | Medium | Medium |
L40S offers 214% higher performance than L4, but at significantly higher power consumption. For Qwen QwQ 32B, 4×L4 GPUs offer an excellent balance of cost and performance.
Optimal Configuration Settings
Based on official recommendations and community testing, here are the optimal parameters for Qwen QwQ 32B:
Recommended Inference Settings:
- Temperature: 0.6 (0.7-0.75 for creative tasks)
- Top_p: 0.95
- Top_k: 40 (range 20-40)
- Min_p: 0.01 (optional, helps prevent language switching)
- Repetition_penalty: 1.0-1.1
System Prompt: For best results, use a concise system prompt like: “You are a helpful assistant developed by Qwen. You should think step-by-step.”
Setting Up a Frontend with Open WebUI
For a user-friendly interface to interact with your deployed model, you can set up Open WebUI:
After running this command:
- Open your browser and navigate to http://localhost:3000
- Set up your administrator account
- Configure the model by adding your Tensorfuse endpoint and API key
Troubleshooting Common Issues
Infinite Generations or Repetitions
If you experience infinite loops or repetitive output, try:
- Increasing repetition_penalty to 1.1
- Setting min_p to 0.01
- Using Q4_K_M quantization instead of Q6K
Out of Memory Errors
If encountering OOM errors:
- Reduce context window size
- Decrease the batch size
Click here to get started with Tensorfuse.