Qwen QwQ 32B is a groundbreaking large language model that delivers exceptional reasoning capabilities while being significantly smaller
than other high-performance models. Released on March 5th, 2025, this model has shocked the AI community by matching or even outperforming
DeepSeek-R1 (671B parameters) across several benchmarks despite being 20 times smaller. In this guide, we’ll walk through deploying
Qwen QwQ 32B on L4 GPUs using Tensorfuse.
Although we are using L4s here as they are the cheapest option for deploying Qwen QwQ 32B, you can easily adapt the instructions to deploy on other GPUs
such as A10G, L40S or A100s just by modifying the gpu_type option below.
Comparable Performance to Larger Models: Matches or beats DeepSeek-R1 on key benchmarks while using only ~5% of the parameters.
Cost Efficiency: Lower parameter count means reduced computation costs without sacrificing quality.
Strong Reasoning Capabilities: Excels at complex reasoning tasks, coding, and mathematical problems.
Open-Source Access: Fully available for deployment on your own infrastructure.
Below is a quick snapshot of benchmark scores for QwQ 32B:
Benchmark
Qwen QwQ (32B)
Deepseek-R1 (671B)
Remarks
AIME 2024 (Pass@1)
79.5%
79.8%
Mathematical and reasoning abilities
LiveCodeBench (Pass@1-COT)
63.4%
65.9%
Excels at multi-step reasoning
The combination of these strengths makes Qwen QwQ 32B an excellent choice for production-ready applications, from chatbots to enterprise-level data analytics.
Your code (in this example, vLLM API server code is used from the Docker image).
Your environment (as a Dockerfile).
A deployment configuration (deployment.yaml).
We will also add token-based authentication to our service, compatible with OpenAI client libraries. We will store the authentication token (VLLM_API_KEY) as a Tensorfuse secret. Unlike some other models, Qwen QwQ 32B does not require a separate Hugging Face token, so we can skip that step.
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using vllm-key as your API key.
Ensure that in production you use a randomly generated token. You can quickly generate one
using openssl rand -base64 32 and remember to keep it safe as Tensorfuse secrets are opaque.
We will use the official vLLM Openai image as our base image. This image comes with all the necessary
dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.
Dockerfile
Copy
Ask AI
# Dockerfile for Qwen QwQ 32BFROM tensorfuse/vllm-openai:v0.8.4-patched# Enable HF Hub Transfer for faster downloadsENV HF_HUB_ENABLE_HF_TRANSFER 1# Configure PyTorch memory allocation to avoid fragmentationENV PYTORCH_CUDA_ALLOC_CONF expandable_segments:True# Expose port 80EXPOSE 80# Entrypoint with API keyENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \ --model Qwen/QwQ-32B \ --dtype bfloat16 \ --trust-remote-code \ --tensor-parallel-size 4 \ --max-model-len 8192 \ --port 80 \ --gpu-memory-utilization 0.95 \ --api-key ${VLLM_API_KEY}# Qwen QwQ-32B model configuration# - Using Qwen/QwQ-32B model with bfloat16 dtype# - Running on 4 GPUs with tensor parallelism# - Max 8192 tokens to avoid OOM errors# - Using 95% GPU memory utilization# - Server runs on port 80# - API key from environment variable for authentication
We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. A comprehensive list of all
other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the Tensorfuse Community is an excellent place to seek guidance.
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so
that you can follow a GitOps approach to deployment.
deployment.yaml
Copy
Ask AI
# deployment.yaml for Qwen QwQ 32B on L4 GPUsgpus: 4gpu_type: l4secret: - vllm-tokenmin_scale: 1readiness: httpGet: path: /health port: 80
Don’t forget the readiness endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.
If no readiness endpoint is configured, Tensorfuse tries the /readiness path on port 80 by default which can cause issues if your app is not listening on that path.
Now you can deploy your service using the following command:
Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint.Once the deployment is successful, you can see the status of your app by running:
Copy
Ask AI
tensorkube deployment list
And that’s it! You have successfully deployed the world’s strongest Open Source Reasoning Model
Remember to configure a TLS endpoint with a custom domain before going to production.
To test it out, replace YOUR_APP_URL with the endpoint shown in the output of the above command and run:
Copy
Ask AI
curl --request POST \ --url YOUR_APP_URL/v1/completions \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer vllm-key' \ --data '{ "model": "Qwen/QwQ-32B", "prompt": "Explain the relationship between quantum mechanics and general relativity.", "max_tokens": 200, "temperature": 0.6, "top_p": 0.95, "top_k": 40}'
Because vllm is compatible with the OpenAI API, you can useOpenAI’s client libraries
as well. Here’s a sample snippet using Python:
Copy
Ask AI
import openai# Replace with your actual URL and tokenbase_url = "YOUR_APP_URL/v1"api_key = "vllm-key"openai.api_base = base_urlopenai.api_key = api_keyresponse = openai.Completion.create( model="Qwen/QwQ-32B", prompt="Hello, Qwen! How are you today?", max_tokens=200)print(response)
Context length has a quadratic relationship with memory usage - doubling the context requires approximately 4 times more memory. For example:
2048 tokens of context: ~0.7 GB of memory
16384 tokens of context: ~68 GB of memory
This relationship is crucial when choosing your deployment configuration. For L4 GPUs with 24GB VRAM each, a 4-GPU setup offers up to 96GB total
VRAM, allowing for contexts of 32-64K tokens depending on quantization.
L40S offers 214% higher performance than L4, but at significantly higher power consumption. For Qwen QwQ 32B,
4×L4 GPUs offer an excellent balance of cost and performance.