Learning

Boost LLM Throughput: vLLM vs. Sglang and Other Serving Frameworks

Simran Verma

Feb 13, 2025

6 mins

Serving open-source Large Language Models (LLMs) efficiently requires optimizing across hardware, software, and inference techniques. In this blog, we’ll explore the best strategies for improving LLM inference speed, covering real-world optimizations like continuous batching, quantization, and memory-efficient caching. We’ll also compare different inference frameworks, debunk common myths, and provide a crisp structured guide for optimization choices.

Framework Benchmarking: Which Inference Engine is Fastest?

When serving an LLM, throughput (measured in tokens per second, TPS) is a key metric. Below is a comparative analysis of top inference frameworks running 7B models on an Nvidia A100 (FP16):

FrameworkTPS (7B Model)Key Features & Strengths
vLLM130-1800PagedAttention, dynamic batching
SGLang~180-5000RadixAttention, prefix-sharing
TensorRT-LLM220-743Nvidia-optimized, FP8 support
Triton Server160-200Dynamic batching, multi-framework
Llama.cpp20-90CPU support, lightweight
mistral.rs~150-200Rust-based, CPU/GPU efficiency
TGI (HF)180-220Hugging Face integration, multi-GPU
  • High throughput needs? vLLM, SGLang, and TensorRT-LLM are top choices.
  • CPU-based inference? Llama.cpp or mistral.rs are strong contenders.
  • Seamless Hugging Face model serving? TGI is well-integrated.
  • Enterprise-scale NVIDIA deployments? TensorRT-LLM + Triton offers peak performance.

Key Optimization Techniques for LLM Inference

1. In-Flight Continuous Batching

In-flight continuous batching merges new requests into a batch mid-generation, maximizing GPU efficiency. It prevents GPUs from sitting idle between requests. This results in up to 3.5x throughput increase over naive request processing.

2. FastAPI Alone Won’t Cut It

FastAPI is great for APIs but struggles with high-throughput inference because Python’s GIL introduces latency, manual batching is inefficient, no built-in GPU scheduling.

vLLM, TGI, or Triton Server optimize inference with native batching, efficient memory management, and lower latency.

3. KV Cache Optimization

If your use case involves repetitive prompts or multi-turn interactions with shared context, consider prefix-caching strategies.

If multiple queries have the same prefix, you can do the heavy computation for that prefix once, and reuse it for subsequent requests. This is all about the KV cache – the key/value tensors that store the model’s attention history.

4. Quantization: More Speed, Minimal Accuracy Loss

If you are deploying a model and concerned about latency, definitely consider using an quantised 4-bit version (if available) or run the AWQ tooling on it.

Reducing precision (e.g., 4-bit instead of 16-bit) shrinks memory usage and boosts speed.

AWQ (Activation-Aware Weight Quantization) achieves up to 1.7x speedup vs GPTQ, with <1% accuracy loss.

5. hf-transfer: Speed Up Model Downloads (3-5x Faster)

Hugging Face downloads are limited to Python’s single-threaded requests, causing bottlenecks.

hf-transfer leverages Rust-based parallel downloads, achieving 500MB/s+ speeds on high-bandwidth networks. This doesn’t directly improve inference speed, but it improves deployment agility – spinning up new instances or switching models becomes less painful. For anyone frequently downloading from the Hub, it’s a useful trick

Debunking Common Inference Optimization Myths

Myth 1: “Higher Precision = Higher Accuracy”

  • FP16 matches FP32 accuracy for LLMs.
  • AWQ 4-bit quantization achieves ~99% of FP16 quality.

Myth 2: “Batching = Higher Latency”

  • Modern frameworks batch without increasing latency via continuous in-flight batching.

Myth 3: “Adding More GPUs = Faster Inference”

  • More GPUs only help for large models (e.g., Llama2-70B, GPT-3).
  • Smaller models run best on a single optimised GPU.

Choosing the Right Optimization Strategy

Based on Model Size & Hardware Constraints

  • ≤7B: Can run on a single GPU (or even CPU with llama.cpp).
  • 13B-30B: Requires 8-bit or 4-bit quantization for single GPU.
  • 65B+: Requires multi-GPU or aggressive quantization (AWQ 4-bit).

Based on Use Case

Use caseStrategy
High Throughput (API, chatbot serving)Use vLLM/TGI with batching and quantization
Long Contexts (Document QA, Code analysis)Use frameworks like vLLM that handle large KV cache efficiently
Latency-Critical (Autocompletion, low-latency responses)Prioritise single-GPU optimisation, avoid excessive batching.

Final Thoughts

Optimizing LLM inference is a balancing act between speed, memory, and accuracy. The best approach depends on your use case, hardware, and performance needs.

Key Takeaways:

  • Use in-flight batching (vLLM, TGI) for maximum TPS.
  • Quantize models (AWQ, GPTQ) for 4-bit gains without accuracy loss.
  • Optimize KV cache (cut latency by ~40%).
  • Avoid naive FastAPI deployment – use inference-optimized servers.
  • Choose the right strategy based on model size and use case.

By applying these strategies, you can achieve blazing-fast inference speeds while keeping your costs low. 🚀

Deploy in minutes, scale in seconds

Get started for free or contact us to get a custom demo tailored to your needs.

Deploy in minutes, scale in seconds

Get started for free or contact us to get a custom demo tailored to your needs.

Deploy in minutes, scale in seconds

Get started for free or contact us to get a custom demo tailored to your needs.

© 2024. All rights reserved.

Join our Newsletter

Sign up to our mailing list below and be the first to know about updates and founder’s notes. Don't worry, we hate spam too.

© 2024. All rights reserved.

Join our Newsletter

Sign up to our mailing list below and be the first to know about updates and founder’s notes. Don't worry, we hate spam too.

© 2024. All rights reserved.