Deploy Llama 4 Scout and Maverick from Meta models using Tensorfuse
v0.8.3
.
Benchmark | Llama 4 Scout | Llama 4 Maverick | Industry Leader | Remarks |
---|---|---|---|---|
MMLU Pro | 74.3% | 80.5% | 86.1% (GPT-4) | Reasoning and knowledge benchmark |
GPQA Diamond | 57.2% | 69.8% | 73.5% (Claude 3) | Scientific reasoning capabilities |
ChartQA | 82.3% | 90% | 92.3% (GPT-4V) | Visual understanding of charts |
MT-Bench | 7.89 | 8.84 | 8.95 (Claude 3) | Conversational abilities |
deployment.yaml
)Access to Llama 4
Set huggingface token
READ
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.Set your API authentication token
vllm-key
as your api-key.openssl rand -base64 32
) and keep it secure, as Tensorfuse secrets are opaque. Your API service will be publicly accessible, so a strong authentication mechanism is essential.readiness
endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness
on port 80.YOUR_APP_URL
with the endpoint from the command output and run:
Model | GPU Configuration | Context Length | Tokens/sec (batch=32) |
---|---|---|---|
Scout | 8x H100 | Up to 1M tokens | ~180 |
Scout | 8x H200 | Up to 3.6M tokens | ~260 |
Scout | Multi-node setup | Up to 10M tokens | Varies by setup |
Maverick | 8x H100 | Up to 430K tokens | ~150 |
Maverick | 8x H200 | Up to 1M tokens | ~210 |
--kv-cache-dtype fp8
to potentially double the usable context window and gain a performance boost with minimal accuracy impact:
--override-generation-config='{"attn_temperature_tuning": true}'
to improve accuracy.
Model | A100 Configuration | Practical Context Length |
---|---|---|
Scout | 8x A100 (80GB) | Up to 160K tokens |
Maverick | 8x A100 (80GB) | Up to 90K tokens |
Llama 4's Mixture-of-Experts Architecture