Deploy Qwen QwQ 32B on Serverless GPUs

Qwen QwQ 32B is a groundbreaking large language model that delivers exceptional reasoning capabilities while being significantly smaller than other high-performance models. Released on March 5th, 2025, this model has shocked the AI community by matching or even outperforming DeepSeek-R1 (671B parameters) across several benchmarks despite being 20 times smaller. In this guide, we’ll walk through deploying Qwen QwQ 32B on L4 GPUs using Tensorfuse.

Although we are using L4s here as they are the cheapest option for deploying Qwen QwQ 32B, you can easily adapt the instructions to deploy on other GPUs such as A10G, L40S or A100s just by modifying the gpu_type option below.

Why Build with Qwen QwQ 32B?

Qwen QwQ 32B offers:

Comparable Performance to Larger Models: Matches or beats DeepSeek-R1 on key benchmarks while using only ~5% of the parameters.
Cost Efficiency: Lower parameter count means reduced computation costs without sacrificing quality.
Strong Reasoning Capabilities: Excels at complex reasoning tasks, coding, and mathematical problems.
Open-Source Access: Fully available for deployment on your own infrastructure.

Below is a quick snapshot of benchmark scores for QwQ 32B:

Benchmark	Qwen QwQ (32B)	Deepseek-R1 (671B)	Remarks
AIME 2024 (Pass@1)	79.5%	79.8%	Mathematical and reasoning abilities
LiveCodeBench (Pass@1-COT)	63.4%	65.9%	Excels at multi-step reasoning

The combination of these strengths makes Qwen QwQ 32B an excellent choice for production-ready applications, from chatbots to enterprise-level data analytics.

Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Qwen QwQ 32B with Tensorfuse

Each Tensorkube deployment requires:

Your code (in this example, vLLM API server code is used from the Docker image).
Your environment (as a Dockerfile).
A deployment configuration (deployment.yaml).

We will also add token-based authentication to our service, compatible with OpenAI client libraries. We will store the authentication token (VLLM_API_KEY) as a Tensorfuse secret. Unlike some other models, Qwen QwQ 32B does not require a separate Hugging Face token, so we can skip that step.

Step 1: Set your API authentication token

Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using vllm-key as your API key.

tensorkube secret create vllm-token VLLM_API_KEY=vllm-key --env default

Ensure that in production you use a randomly generated token. You can quickly generate one using openssl rand -base64 32 and remember to keep it safe as Tensorfuse secrets are opaque.

Step 2: Prepare the Dockerfile

We will use the official vLLM Openai image as our base image. This image comes with all the necessary dependencies to run vLLM. The image is present on DockerHub as vllm/vllm-openai.

Dockerfile

# Dockerfile for Qwen QwQ 32B

FROM tensorfuse/vllm-openai:v0.8.4-patched

# Enable HF Hub Transfer for faster downloads
ENV HF_HUB_ENABLE_HF_TRANSFER 1

# Configure PyTorch memory allocation to avoid fragmentation
ENV PYTORCH_CUDA_ALLOC_CONF expandable_segments:True

# Expose port 80
EXPOSE 80

# Entrypoint with API key
ENTRYPOINT python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/QwQ-32B \
  --dtype bfloat16 \
  --trust-remote-code \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --port 80 \
  --gpu-memory-utilization 0.95 \
  --api-key ${VLLM_API_KEY}
# Qwen QwQ-32B model configuration
# - Using Qwen/QwQ-32B model with bfloat16 dtype
# - Running on 4 GPUs with tensor parallelism
# - Max 8192 tokens to avoid OOM errors
# - Using 95% GPU memory utilization
# - Server runs on port 80
# - API key from environment variable for authentication

We’ve configured the vLLM server with numerous CLI flags tailored to our specific use case. A comprehensive list of all other vLLM flags is available for further reference, and if you have questions about selecting flags for production, the Tensorfuse Community is an excellent place to seek guidance.

Step 3: Deployment config

Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.

deployment.yaml

# deployment.yaml for Qwen QwQ 32B on L4 GPUs

gpus: 4
gpu_type: l4
secret:
  - vllm-token
min_scale: 1
readiness:
  httpGet:
    path: /health
    port: 80

Don’t forget the readiness endpoint in your config. Tensorfuse uses this endpoint to ensure that your service is healthy.

If no readiness endpoint is configured, Tensorfuse tries the /readiness path on port 80 by default which can cause issues if your app is not listening on that path.

Now you can deploy your service using the following command:

tensorkube deploy --config-file ./deployment.yaml

Step 4: Accessing the deployed app

Voila! Your autoscaling production LLM service is ready. Only authenticated requests will be served by your endpoint. Once the deployment is successful, you can see the status of your app by running:

tensorkube deployment list

And that’s it! You have successfully deployed the world’s strongest Open Source Reasoning Model

Remember to configure a TLS endpoint with a custom domain before going to production.

To test it out, replace YOUR_APP_URL with the endpoint shown in the output of the above command and run:

curl --request POST \
  --url YOUR_APP_URL/v1/completions \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer vllm-key' \
  --data '{
    "model": "Qwen/QwQ-32B",
    "prompt": "Explain the relationship between quantum mechanics and general relativity.",
    "max_tokens": 200,
    "temperature": 0.6,
    "top_p": 0.95,
    "top_k": 40
}'

Because vllm is compatible with the OpenAI API, you can useOpenAI’s client libraries as well. Here’s a sample snippet using Python:

import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"
api_key = "vllm-key"

openai.api_base = base_url
openai.api_key = api_key

response = openai.Completion.create(
    model="Qwen/QwQ-32B",
    prompt="Hello, Qwen! How are you today?",
    max_tokens=200
)

print(response)

Technical Specifications and Optimization

Memory vs Context Size

Context length has a quadratic relationship with memory usage - doubling the context requires approximately 4 times more memory. For example:

2048 tokens of context: ~0.7 GB of memory
16384 tokens of context: ~68 GB of memory

This relationship is crucial when choosing your deployment configuration. For L4 GPUs with 24GB VRAM each, a 4-GPU setup offers up to 96GB total VRAM, allowing for contexts of 32-64K tokens depending on quantization.

GPU Comparison: L4 vs L40S vs A10G

Specification	L4	L40S	A10G
VRAM	24 GB	48 GB	24 GB
Performance Score	13.44	42.25	-
TFLOPS (FP32)	30.29	91.6	31.2
Power Consumption	72W	350W	150W
Cost-Efficiency	High	Medium	Medium

L40S offers 214% higher performance than L4, but at significantly higher power consumption. For Qwen QwQ 32B, 4×L4 GPUs offer an excellent balance of cost and performance.

Optimal Configuration Settings

Based on official recommendations and community testing, here are the optimal parameters for Qwen QwQ 32B: Recommended Inference Settings:

Temperature: 0.6 (0.7-0.75 for creative tasks)
Top_p: 0.95
Top_k: 40 (range 20-40)
Min_p: 0.01 (optional, helps prevent language switching)
Repetition_penalty: 1.0-1.1

System Prompt: For best results, use a concise system prompt like: “You are a helpful assistant developed by Qwen. You should think step-by-step.”

Setting Up a Frontend with Open WebUI

For a user-friendly interface to interact with your deployed model, you can set up Open WebUI:

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="<YOUR_TENSORFUSE_URL>/v1" \
  -e OPENAI_API_KEYS="EMPTY" \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

After running this command:

Open your browser and navigate to http://localhost:3000
Set up your administrator account
Configure the model by adding your Tensorfuse endpoint and API key

Troubleshooting Common Issues

Infinite Generations or Repetitions

If you experience infinite loops or repetitive output, try:

Increasing repetition_penalty to 1.1
Setting min_p to 0.01
Using Q4_K_M quantization instead of Q6K

Out of Memory Errors

If encountering OOM errors:

Reduce context window size
Decrease the batch size

Click here to get started with Tensorfuse.

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

Deploy Qwen QwQ 32B on Serverless GPUs

Why Build with Qwen QwQ 32B?

Prerequisites

Deploying Qwen QwQ 32B with Tensorfuse

Step 1: Set your API authentication token

Step 2: Prepare the Dockerfile

Step 3: Deployment config

Step 4: Accessing the deployed app

Technical Specifications and Optimization

Memory vs Context Size

GPU Comparison: L4 vs L40S vs A10G

Optimal Configuration Settings

Setting Up a Frontend with Open WebUI

Troubleshooting Common Issues

Infinite Generations or Repetitions

Out of Memory Errors

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

​Why Build with Qwen QwQ 32B?

​Prerequisites

​Deploying Qwen QwQ 32B with Tensorfuse

​Step 1: Set your API authentication token

​Step 2: Prepare the Dockerfile

​Step 3: Deployment config

​Step 4: Accessing the deployed app

​Technical Specifications and Optimization

​Memory vs Context Size

​GPU Comparison: L4 vs L40S vs A10G

​Optimal Configuration Settings

​Setting Up a Frontend with Open WebUI

​Troubleshooting Common Issues

​Infinite Generations or Repetitions

​Out of Memory Errors

Why Build with Qwen QwQ 32B?

Prerequisites

Deploying Qwen QwQ 32B with Tensorfuse

Step 1: Set your API authentication token

Step 2: Prepare the Dockerfile

Step 3: Deployment config

Step 4: Accessing the deployed app

Technical Specifications and Optimization

Memory vs Context Size

GPU Comparison: L4 vs L40S vs A10G

Optimal Configuration Settings

Setting Up a Frontend with Open WebUI

Troubleshooting Common Issues

Infinite Generations or Repetitions

Out of Memory Errors