Skip to main content
March 14, 2025
Samagra Sharma profile
Samagra Sharma
Founder
Serving open-source Large Language Models (LLMs) efficiently requires optimizing across hardware, software, and inference techniques. In this blog, we’ll explore the best strategies for improving LLM inference speed, covering real-world optimizations like continuous batching, quantization, and memory-efficient caching. We’ll also compare different inference frameworks, debunk common myths, and provide a crisp structured guide for optimization choices.

Framework Benchmarking: Which Inference Engine is Fastest?

When serving an LLM, throughput (measured in tokens per second, TPS) is a key metric. Below is a comparative analysis of top inference frameworks running 7B models on an Nvidia A100 (FP16):
FrameworkTPS (7B Model)Key Features & Strengths
vLLM130-1800PagedAttention, dynamic batching
SGLang~180-5000RadixAttention, prefix-sharing
TensorRT-LLM220-743Nvidia-optimized, FP8 support
Triton Server160-200Dynamic batching, multi-framework
Llama.cpp20-90CPU support, lightweight
mistral.rs~150-200Rust-based, CPU/GPU efficiency
TGI (HF)180-220Hugging Face integration, multi-GPU
  • High throughput needs? vLLM, SGLang, and TensorRT-LLM are top choices.
  • CPU-based inference? Llama.cpp or mistral.rs are strong contenders.
  • Seamless Hugging Face model serving? TGI is well-integrated.
  • Enterprise-scale NVIDIA deployments? TensorRT-LLM + Triton offers peak performance.

Key Optimization Techniques for LLM Inference

  1. In-Flight Continuous Batching In-flight continuous batching merges new requests into a batch mid-generation, maximizing GPU efficiency. It prevents GPUs from sitting idle between requests. This results in up to 3.5x throughput increase over naive request processing.
  2. FastAPI Alone Won’t Cut It FastAPI is great for APIs but struggles with high-throughput inference because Python’s GIL introduces latency, manual batching is inefficient, no built-in GPU scheduling.
vLLM, TGI, or Triton Server optimize inference with native batching, efficient memory management, and lower latency.
  1. KV Cache Optimization If your use case involves repetitive prompts or multi-turn interactions with shared context, consider prefix-caching strategies.
If multiple queries have the same prefix, you can do the heavy computation for that prefix once, and reuse it for subsequent requests. This is all about the KV cache – the key/value tensors that store the model’s attention history.

The right space at the right time

Mintlify caught my attention because of its enormous potential in this new world of AI. Documentation sits right at the middle of where humans and AI interacts.