meta-llama/Llama-3.1-8B-Instruct
model running on L40S
GPUs to provide concrete performance numbers.$0.30/GB-month
and the data transfer cost is $0.03/GB
. For example, caching a 16GB model that cold starts 40 times a month would cost approximately $24.load_format
parameter in vLLM, including extensions like fastsafetensors
or run-ai
. You can find the list of supported formats in the vLLM documentation.
fastsafetensors
torch.compile
is a just-in-time (JIT) compiler that dramatically speeds up model execution at runtime. However, this performance comes at the cost of an initial compilation step that takes about 52 seconds for our example model.
Fortunately, torch.compile
includes a built-in caching system. In a Kubernetes environment, you can persist this cache by using a shared volume. The first pod will perform the compilation and save the cache, making it instantly available to all subsequent pods.
flashinfer
cache.
Step | Time Taken |
---|---|
Model download | 61 seconds |
Weight loading | 33 seconds |
Dynamo bytecode transformation | 10 seconds |
Graph compilation | 42 seconds |
Graph capture | 54 seconds |
Init engine and imports | 94 seconds |
/root/.cache
. This will cache the model download, torch.compile
results, and other initialization artifacts.1, 2, 4, 8, 16, 24, 32, 64
).Step | Time Taken |
---|---|
Model download | 0 seconds |
Weight loading | 18 seconds |
Dynamo bytecode transformation | 10 seconds |
Graph compilation | 13 seconds |
Graph capture | 7 seconds |
Init engine and imports | 34 seconds |
/root/.cache
in both these cases.TORCH_CUDA_ARCH_LIST
environment variable to match the compute capability of your target GPU. This ensures torch.compile
generates the most optimized code. Here’s a quick reference for supported AWS instances:
GPU type | AWS Instance Type | TORCH_CUDA_ARCH_LIST | AWS Link |
---|---|---|---|
V100 | p3 | 7.0 | Link |
A10G | g5 | 8.6 | Link |
T4 | g4 | 7.5 | Link |
L4 | g6 | 8.9 | Link |
L40S | g6e | 8.9 | Link |
A100 | p4 | 8.0 | Link |
H100 | p5 | 9.0 | Link |
Network Bandwidth
for your GPU, please scroll down until you find the Product details
table and scroll to the right.