Learning
Boost LLM Throughput: vLLM vs. Sglang and Other Serving Frameworks
Simran Verma
Feb 13, 2025
6 mins
Serving open-source Large Language Models (LLMs) efficiently requires optimizing across hardware, software, and inference techniques. In this blog, we’ll explore the best strategies for improving LLM inference speed, covering real-world optimizations like continuous batching, quantization, and memory-efficient caching. We’ll also compare different inference frameworks, debunk common myths, and provide a crisp structured guide for optimization choices.

Framework Benchmarking: Which Inference Engine is Fastest?
When serving an LLM, throughput (measured in tokens per second, TPS) is a key metric. Below is a comparative analysis of top inference frameworks running 7B models on an Nvidia A100 (FP16):
- High throughput needs? vLLM, SGLang, and TensorRT-LLM are top choices.
- CPU-based inference? Llama.cpp or mistral.rs are strong contenders.
- Seamless Hugging Face model serving? TGI is well-integrated.
- Enterprise-scale NVIDIA deployments? TensorRT-LLM + Triton offers peak performance.
Key Optimization Techniques for LLM Inference
1. In-Flight Continuous Batching
In-flight continuous batching merges new requests into a batch mid-generation, maximizing GPU efficiency. It prevents GPUs from sitting idle between requests. This results in up to 3.5x throughput increase over naive request processing.
2. FastAPI Alone Won’t Cut It
FastAPI is great for APIs but struggles with high-throughput inference because Python’s GIL introduces latency, manual batching is inefficient, no built-in GPU scheduling.
vLLM, TGI, or Triton Server optimize inference with native batching, efficient memory management, and lower latency.
3. KV Cache Optimization
If your use case involves repetitive prompts or multi-turn interactions with shared context, consider prefix-caching strategies.
If multiple queries have the same prefix, you can do the heavy computation for that prefix once, and reuse it for subsequent requests. This is all about the KV cache – the key/value tensors that store the model’s attention history.
4. Quantization: More Speed, Minimal Accuracy Loss
If you are deploying a model and concerned about latency, definitely consider using an quantised 4-bit version (if available) or run the AWQ tooling on it.
Reducing precision (e.g., 4-bit instead of 16-bit) shrinks memory usage and boosts speed.
AWQ (Activation-Aware Weight Quantization) achieves up to 1.7x speedup vs GPTQ, with <1% accuracy loss.
5. hf-transfer: Speed Up Model Downloads (3-5x Faster)
Hugging Face downloads are limited to Python’s single-threaded requests, causing bottlenecks.
hf-transfer
leverages Rust-based parallel downloads, achieving 500MB/s+ speeds on high-bandwidth networks.
This doesn’t directly improve inference speed, but it improves deployment agility – spinning up new instances or switching models becomes less painful. For anyone frequently downloading from the Hub, it’s a useful trick
Debunking Common Inference Optimization Myths
Myth 1: “Higher Precision = Higher Accuracy”
- FP16 matches FP32 accuracy for LLMs.
- AWQ 4-bit quantization achieves ~99% of FP16 quality.
Myth 2: “Batching = Higher Latency”
- Modern frameworks batch without increasing latency via continuous in-flight batching.
Myth 3: “Adding More GPUs = Faster Inference”
- More GPUs only help for large models (e.g., Llama2-70B, GPT-3).
- Smaller models run best on a single optimised GPU.
Choosing the Right Optimization Strategy
Based on Model Size & Hardware Constraints
- ≤7B: Can run on a single GPU (or even CPU with llama.cpp).
- 13B-30B: Requires 8-bit or 4-bit quantization for single GPU.
- 65B+: Requires multi-GPU or aggressive quantization (AWQ 4-bit).
Based on Use Case
Final Thoughts
Optimizing LLM inference is a balancing act between speed, memory, and accuracy. The best approach depends on your use case, hardware, and performance needs.
Key Takeaways:
- Use in-flight batching (vLLM, TGI) for maximum TPS.
- Quantize models (AWQ, GPTQ) for 4-bit gains without accuracy loss.
- Optimize KV cache (cut latency by ~40%).
- Avoid naive FastAPI deployment – use inference-optimized servers.
- Choose the right strategy based on model size and use case.
By applying these strategies, you can achieve blazing-fast inference speeds while keeping your costs low. 🚀