Deploy Llama 4 Models on your AWS account
Deploy Llama 4 Scout and Maverick from Meta models using Tensorfuse
Llama 4 herd represents Meta’s newest generation of large language models, featuring
Scout and Maverick variants.
These models introduce architecture innovations like Mixture of Experts (MoE) and Interleaved RoPE
(iRoPE) that enable exceptional performance with massive context lengths while maintaining reasonable inference costs.
In this guide, we’ll walk you through deploying these state-of-the-art models on your cloud account using Tensorfuse and vLLM v0.8.3
.
Why Build with Llama 4
Llama 4 offers several compelling advantages that make it an excellent choice for production applications:
- Native Multimodality: Early fusion architecture seamlessly integrates text and images (up to 10 images per request)
- Massive Context Windows: Up to 10 million tokens for Scout, enabling multi-document summarization and reasoning over vast codebases
- Mixture of Experts (MoE) Architecture: More compute-efficient models that activate only a subset of parameters per token
- Interleaved RoPE (iRoPE): Novel attention mechanism that efficiently handles long sequences by alternating between global and local attention
- State-of-the-Art Performance: Competitive with or exceeding proprietary models like GPT-4o and Gemini 2.0
- Multilingual Support: Pre-trained on 200 languages, with over 100 languages having more than 1 billion tokens each
- Responsible AI: Meta’s advanced safety training and protections to prevent harmful, unsafe, and unethical outputs
Here’s a snapshot of benchmark scores for Llama 4:
Benchmark | Llama 4 Scout | Llama 4 Maverick | Industry Leader | Remarks |
---|---|---|---|---|
MMLU Pro | 74.3% | 80.5% | 86.1% (GPT-4) | Reasoning and knowledge benchmark |
GPQA Diamond | 57.2% | 69.8% | 73.5% (Claude 3) | Scientific reasoning capabilities |
ChartQA | 82.3% | 90% | 92.3% (GPT-4V) | Visual understanding of charts |
MT-Bench | 7.89 | 8.84 | 8.95 (Claude 3) | Conversational abilities |
Prerequisites
Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.
Deploying Llama 4 Models with Tensorfuse
Each Tensorkube deployment requires:
- Your code (in this example, vLLM API server code from Docker image)
- Your environment (as a Dockerfile)
- A deployment configuration (
deployment.yaml
)
We will also add token-based authentication to our service, compatible with OpenAI client libraries.
Step 1: Setting up the secrets
Access to Llama 4
Llama-4 requires a license agreement. Visit the Llama 4 huggingface repo to ensure that you have signed the agreement and have access to the model.
Set huggingface token
Get a READ
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.
Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.
Set your API authentication token
Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below.
For the purpose of this demo, we will be using vllm-key
as your api-key.
In production, use a randomly generated token (e.g., using openssl rand -base64 32
) and keep it secure, as Tensorfuse secrets are opaque. Your API service will be publicly accessible, so a strong authentication mechanism is essential.
Step 2: Prepare the Dockerfiles
Let’s create separate Dockerfiles for Scout and Maverick models:
Remember that in the below Dockerfiles we are deploying Scout with a context length of 1 million tokens and Maverick with a context length of 430k tokens. This is because 8xH100 GPUs have limited memory and we need to ensure that the model fits in the GPU memory. If you are using A100s, H100s, L40s or any other GPUs with more than 80GB of memory, you can experiment with the context length.
We’ve configured the vLLM server with various CLI flags tailored to each model. For a comprehensive list of vLLM flags, refer to the vLLM documentation.
Step 3: Deployment Configuration
Create model-specific configuration files to optimize for each model’s requirements. This will be the same for both models.
Don’t forget the readiness
endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness
on port 80.
Step 4: Deploy your models
Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either Maverick or Scout).
Step 5: Accessing the deployed app
Voila! Your autoscaling production Llama 4 service is ready. Only authenticated requests will be served.
Once deployment is successful, check the status:
To test your deployment, replace YOUR_APP_URL
with the endpoint from the command output and run:
Since vLLM is compatible with the OpenAI API you can query the other endpoints present here.
You can also use the OpenAI Python SDK to query your deployment as shown below:
Remember to configure a TLS endpoint with a custom domain before going to production for security and compatibility with modern clients.
Multimodal Capabilities
Llama 4 shines with its early fusion multimodal architecture, allowing it to process text and images simultaneously. Here’s how to use multimodal capabilities with your deployment:
Llama 4 models work best with up to 8-10 images per request. For optimal performance, keep image sizes under 2048x2048 pixels. The models can interpret charts, diagrams, screenshots, photos, and even complex visual information including mathematical equations and code screenshots.
Context Length Capabilities
Llama 4 models offer impressive context length capabilities across different hardware configurations:
Model | GPU Configuration | Context Length | Tokens/sec (batch=32) |
---|---|---|---|
Scout | 8x H100 | Up to 1M tokens | ~180 |
Scout | 8x H200 | Up to 3.6M tokens | ~260 |
Scout | Multi-node setup | Up to 10M tokens | Varies by setup |
Maverick | 8x H100 | Up to 430K tokens | ~150 |
Maverick | 8x H200 | Up to 1M tokens | ~210 |
These massive context windows enable entirely new use cases:
Document Analysis
Process and reason across hundreds of pages of legal documents, technical manuals, or research papers in a single request
Code Repository Understanding
Analyze entire codebases to debug complex issues or generate comprehensive documentation
Long-Form Writing
Generate or edit lengthy content like novels, technical reports, or academic papers
Multi-Document Synthesis
Summarize and synthesize information across multiple documents, such as research papers or business reports
To reach Scout’s maximum 10M context window, you’ll need to use distributed inference across multiple nodes with tensor parallelism or pipeline parallelism. Join our Slack community to learn more about this feature.
Advanced Configurations and Optimization Tips
Performance Optimization
-
FP8 KV Cache: Add
--kv-cache-dtype fp8
to potentially double the usable context window and gain a performance boost with minimal accuracy impact:- Before optimization: ~90 tokens/sec on 8x H100
- After optimization: ~180 tokens/sec on 8x H100
-
Long Context Accuracy: For contexts longer than 32K tokens, include
--override-generation-config='{"attn_temperature_tuning": true}'
to improve accuracy. -
Continuous Batching: vLLM already implements continuous batching by default, maximizing throughput for multiple concurrent users.
-
Quantization: For Maverick, use the FP8 model variant which provides excellent performance with minimal accuracy drop.
Hardware Compatibility
- A100 GPUs: The BF16 versions of both models work well on A100 GPUs but with reduced context lengths:
Model | A100 Configuration | Practical Context Length |
---|---|---|
Scout | 8x A100 (80GB) | Up to 160K tokens |
Maverick | 8x A100 (80GB) | Up to 90K tokens |
-
INT4 Quantization: For Scout, an INT4 quantization that allows the model to fit on a single H100 GPU is in development.
-
AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building vLLM from source, with nearly identical accuracy.
Key Architectural Innovations
Llama 4's Mixture-of-Experts Architecture
Llama 4’s architecture enables efficient long-context inference through several innovations:
Mixture of Experts (MoE): Instead of activating all parameters for every token, Llama 4 models use a “router” to select which expert(s) should process each token:
- Scout has 16 experts (109B total parameters)
- Maverick has 128 experts (400B total parameters)
- Only 1-2 experts are activated per token (17B active parameters)
- This approach dramatically reduces computational costs while maintaining quality
Interleaved RoPE (iRoPE): Llama 4 alternates between global attention (without RoPE) and chunked local attention (with RoPE) in a 1:3 ratio:
- Global layers capture document-level patterns
- Local layers process detailed information within chunks
- This combination significantly reduces the quadratic complexity of attention
Early Fusion Multimodality: Rather than using separate encoders, Llama 4 integrates text and vision tokens directly into its core architecture:
- Images are processed through a vision encoder and then projected into the same embedding space as text tokens
- The model can attend to both text and image tokens simultaneously
- This enables deeper multimodal reasoning than late-fusion approaches
Llama 4 was trained on more than 30 trillion tokens across text, image, and video datasets. This is more than double the Llama 3 pre-training mixture and includes data from over 200 languages, with 100+ languages having substantial representation (1B+ tokens each).
For optimal storage and faster downloads, Llama 4 models on Hugging Face use the Xet storage backend, achieving ~25% deduplication for the main models and ~40% for derivative models, saving time and bandwidth.
Conclusion
With this guide, you’ve successfully deployed Llama 4 models on serverless GPUs using Tensorfuse. These models represent the cutting edge of open-source AI, offering capabilities that rival or exceed proprietary alternatives at a fraction of the cost.
Whether you’re building a sophisticated chatbot, a multimodal content creation tool, or an enterprise knowledge system, Llama 4 provides the foundation for building AI applications that were previously only possible with proprietary models.
Click here to get started with Tensorfuse.
You can also explore the Tensorfuse examples repository for more deployment configurations and use cases.