Qwen QwQ 32B is a groundbreaking large language model that delivers exceptional reasoning capabilities while being significantly smaller than other high-performance models. Released on March 5th, 2025, this model has shocked the AI community by matching or even outperforming DeepSeek-R1 (671B parameters) across several benchmarks despite being 20 times smaller. In this guide, we’ll walk through deploying Qwen QwQ 32B on L4 GPUs using Tensorfuse.

Although we are using L4s here as they are the cheapest option for deploying Qwen QwQ 32B, you can easily adapt the instructions to deploy on other GPUs such as A10G, L40S or A100s just by modifying the gpu_type option below.

Why Build with Qwen QwQ 32B?

Qwen QwQ 32B offers:

  • Comparable Performance to Larger Models: Matches or beats DeepSeek-R1 on key benchmarks while using only ~5% of the parameters.
  • Cost Efficiency: Lower parameter count means reduced computation costs without sacrificing quality.
  • Strong Reasoning Capabilities: Excels at complex reasoning tasks, coding, and mathematical problems.
  • Open-Source Access: Fully available for deployment on your own infrastructure.

Below is a quick snapshot of benchmark scores for QwQ 32B:

BenchmarkQwen QwQ (32B)Deepseek-R1 (671B)Remarks
AIME 2024 (Pass@1)79.5%79.8%Mathematical and reasoning abilities
LiveCodeBench (Pass@1-COT)63.4%65.9%Excels at multi-step reasoning

The combination of these strengths makes Qwen QwQ 32B an excellent choice for production-ready applications, from chatbots to enterprise-level data analytics.


Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Qwen QwQ 32B with Tensorfuse

Each Tensorkube deployment requires:

  1. Your code (in this example, vLLM API server code is used from the Docker image).
  2. Your environment (as a Dockerfile).
  3. A deployment configuration (deployment.yaml).

We will also add token-based authentication to our service, compatible with OpenAI client libraries. We will store the authentication token (VLLM_API_KEY) as a Tensorfuse secret. Unlike some other models, Qwen QwQ 32B does not require a separate Hugging Face token, so we can skip that step.

Step 1: Set your API authentication token

Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using vllm-key as your API key.

tensorkube secret create vllm-token VLLM_API_KEY=vllm-key --env default

Ensure that in production you use a randomly generated token. You can quickly generate one using openssl rand -base64 32 and remember to keep it safe as Tensorfuse secrets are opaque.

Step 2: Prepare the Dockerfile

We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.

Dockerfile

# Dockerfile for Qwen QwQ 32B

FROM vllm/vllm-openai:latest

# Enable HF Hub Transfer for faster downloads
ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Expose port 80
EXPOSE 80

# Entrypoint with API key
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", \
            # name of the model
           "--model", "Qwen/QwQ-32B", \
           # set the data type to bfloat16 - requires ~1400GB GPU memory
           "--dtype", "bfloat16", \
           "--trust-remote-code", \
           # below runs the model on 4 GPUs
           "--tensor-parallel-size","4", \
           # Maximum number of tokens, can lead to OOM if overestimated
           "--max-model-len", "8192", \
           # Port on which to run the vLLM server
           "--port", "80", \
           # CPU offload in GB. Need this as 8 H100s are not sufficient
           "--cpu-offload-gb", "80", \
           "--gpu-memory-utilization", "0.95", \
           # API key for authentication to the server stored in Tensorfuse secrets
           "--api-key", "${VLLM_API_KEY}"]

We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. A comprehensive list of all other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the Tensorfuse Community is an excellent place to seek guidance.

Step 3: Deployment config

Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.

deployment.yaml
# deployment.yaml for Qwen QwQ 32B on L4 GPUs

gpus: 4
gpu_type: l4
secret:
  - vllm-token
min-scale: 1
readiness:
  httpGet:
    path: /health
    port: 80

Don’t forget the readiness endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.

If no readiness endpoint is configured, Tensorfuse tries the /readiness path on port 80 by default which can cause issues if your app is not listening on that path.

Now you can deploy your service using the following command:

tensorkube deploy --config-file ./deployment.yaml

Step 4: Accessing the deployed app

Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint.

Once the deployment is successful, you can see the status of your app by running:

tensorkube deployment list

And that’s it! You have successfully deployed the world’s strongest Open Source Reasoning Model

Remember to configure a TLS endpoint with a custom domain before going to production.

To test it out, replace YOUR_APP_URL with the endpoint shown in the output of the above command and run:

curl --request POST \
  --url YOUR_APP_URL/v1/completions \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer vllm-key' \
  --data '{
    "model": "Qwen/QwQ-32B",
    "prompt": "Explain the relationship between quantum mechanics and general relativity.",
    "max_tokens": 200,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 40
}'

Because vllm is compatible with the OpenAI API, you can useOpenAI’s client libraries as well. Here’s a sample snippet using Python:

import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"
api_key = "vllm-key"

openai.api_base = base_url
openai.api_key = api_key

response = openai.Completion.create(
    model="Qwen/QwQ-32B",
    prompt="Hello, Qwen! How are you today?",
    max_tokens=200
)

print(response)

Technical Specifications and Optimization

Memory vs Context Size

Context length has a quadratic relationship with memory usage - doubling the context requires approximately 4 times more memory. For example:

  • 2048 tokens of context: ~0.7 GB of memory
  • 16384 tokens of context: ~68 GB of memory

This relationship is crucial when choosing your deployment configuration. For L4 GPUs with 24GB VRAM each, a 4-GPU setup offers up to 96GB total VRAM, allowing for contexts of 32-64K tokens depending on quantization.

GPU Comparison: L4 vs L40S vs A10G

SpecificationL4L40SA10G
VRAM24 GB48 GB24 GB
Performance Score13.4442.25-
TFLOPS (FP32)30.2991.631.2
Power Consumption72W350W150W
Cost-EfficiencyHighMediumMedium

L40S offers 214% higher performance than L4, but at significantly higher power consumption. For Qwen QwQ 32B, 4×L4 GPUs offer an excellent balance of cost and performance.

Optimal Configuration Settings

Based on official recommendations and community testing, here are the optimal parameters for Qwen QwQ 32B:

Recommended Inference Settings:

  • Temperature: 0.6 (0.7-0.75 for creative tasks)
  • Top_p: 0.95
  • Top_k: 40 (range 20-40)
  • Min_p: 0.01 (optional, helps prevent language switching)
  • Repetition_penalty: 1.0-1.1

System Prompt: For best results, use a concise system prompt like: “You are a helpful assistant developed by Qwen. You should think step-by-step.”

Setting Up a Frontend with Open WebUI

For a user-friendly interface to interact with your deployed model, you can set up Open WebUI:

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="<YOUR_TENSORFUSE_URL>/v1" \
  -e OPENAI_API_KEYS="EMPTY" \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

After running this command:

  1. Open your browser and navigate to http://localhost:3000
  2. Set up your administrator account
  3. Configure the model by adding your Tensorfuse endpoint and API key

Troubleshooting Common Issues

Infinite Generations or Repetitions

If you experience infinite loops or repetitive output, try:

  • Increasing repetition_penalty to 1.1
  • Setting min_p to 0.01
  • Using Q4_K_M quantization instead of Q6K

Out of Memory Errors

If encountering OOM errors:

  • Reduce context window size
  • Decrease the batch size

Click here to get started with Tensorfuse.