Deploy Llama 4 Models on your AWS account

Llama 4 herd represents Meta’s newest generation of large language models, featuring Scout and Maverick variants. These models introduce architecture innovations like Mixture of Experts (MoE) and Interleaved RoPE (iRoPE) that enable exceptional performance with massive context lengths while maintaining reasonable inference costs. In this guide, we’ll walk you through deploying these state-of-the-art models on your cloud account using Tensorfuse and vLLM v0.8.3.

Why Build with Llama 4

Llama 4 offers several compelling advantages that make it an excellent choice for production applications:

Native Multimodality: Early fusion architecture seamlessly integrates text and images (up to 10 images per request)
Massive Context Windows: Up to 10 million tokens for Scout, enabling multi-document summarization and reasoning over vast codebases
Mixture of Experts (MoE) Architecture: More compute-efficient models that activate only a subset of parameters per token
Interleaved RoPE (iRoPE): Novel attention mechanism that efficiently handles long sequences by alternating between global and local attention
State-of-the-Art Performance: Competitive with or exceeding proprietary models like GPT-4o and Gemini 2.0
Multilingual Support: Pre-trained on 200 languages, with over 100 languages having more than 1 billion tokens each
Responsible AI: Meta’s advanced safety training and protections to prevent harmful, unsafe, and unethical outputs

Here’s a snapshot of benchmark scores for Llama 4:

Benchmark	Llama 4 Scout	Llama 4 Maverick	Industry Leader	Remarks
MMLU Pro	74.3%	80.5%	86.1% (GPT-4)	Reasoning and knowledge benchmark
GPQA Diamond	57.2%	69.8%	73.5% (Claude 3)	Scientific reasoning capabilities
ChartQA	82.3%	90%	92.3% (GPT-4V)	Visual understanding of charts
MT-Bench	7.89	8.84	8.95 (Claude 3)	Conversational abilities

Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying Llama 4 Models with Tensorfuse

Each Tensorkube deployment requires:

Your code (in this example, vLLM API server code from Docker image)
Your environment (as a Dockerfile)
A deployment configuration (deployment.yaml)

We will also add token-based authentication to our service, compatible with OpenAI client libraries.

Step 1: Setting up the secrets

Access to Llama 4

Llama-4 requires a license agreement. Visit the Llama 4 huggingface repo to ensure that you have signed the agreement and have access to the model.

Set huggingface token

Get a READ token from your huggingface profile and store it as a secret in Tensorfuse using the command below.

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env default

Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN as vLLM assumes the same.

Set your API authentication token

Generate a random string that will be used as your API authentication token. Store it as a secret in Tensorfuse using the command below. For the purpose of this demo, we will be using vllm-key as your api-key.

tensorkube secret create vllm-token VLLM_API_KEY=vllm-key --env default

In production, use a randomly generated token (e.g., using openssl rand -base64 32) and keep it secure, as Tensorfuse secrets are opaque. Your API service will be publicly accessible, so a strong authentication mechanism is essential.

Step 2: Prepare the Dockerfiles

Let’s create separate Dockerfiles for Scout and Maverick models:

Remember that in the below Dockerfiles we are deploying Scout with a context length of 1 million tokens and Maverick with a context length of 430k tokens. This is because 8xH100 GPUs have limited memory and we need to ensure that the model fits in the GPU memory. If you are using A100s, H100s, L40s or any other GPUs with more than 80GB of memory, you can experiment with the context length.

FROM tensorfuse/vllm-openai:v0.8.4-patched

    # Enable HF Hub Transfer
ENV HF_HUB_ENABLE_HF_TRANSFER=1

# Expose port 80
EXPOSE 80

# Entrypoint with API key
ENTRYPOINT [
  "python3", "-m", "vllm.entrypoints.openai.api_server",
  "--model", "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  "--dtype", "bfloat16",
  "--trust-remote-code",
  "--tensor-parallel-size", "8",
  "--max-model-len", "1000000",
  "--port", "80",
  "--override-generation-config", "{\"attn_temperature_tuning\": true}",
  "--limit-mm-per-prompt", "image=10",
  "--kv-cache-dtype", "fp8",
  "--api-key", "${VLLM_API_KEY}"
]

We’ve configured the vLLM server with various CLI flags tailored to each model. For a comprehensive list of vLLM flags, refer to the vLLM documentation.

Step 3: Deployment Configuration

Create model-specific configuration files to optimize for each model’s requirements. This will be the same for both models.

deployment.yaml

gpus: 8
gpu_type: h100
secret:
  - hugging-face-secret
  - vllm-token
min-scale: 1
readiness:
    httpGet:
        path: /health
        port: 80

Don’t forget the readiness endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness on port 80.

Step 4: Deploy your models

Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either Maverick or Scout).

tensorkube deploy --config-file ./deployment.yaml

Step 5: Accessing the deployed app

Voila! Your autoscaling production Llama 4 service is ready. Only authenticated requests will be served. Once deployment is successful, check the status:

tensorkube deployment list

To test your deployment, replace YOUR_APP_URL with the endpoint from the command output and run:

curl --request POST
--url YOUR_APP_URL/v1/completions
--header 'Content-Type: application/json'
--header 'Authorization: Bearer vllm-key'
--data '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"prompt": "Earth to Llama 4. What can you do?",
"max_tokens": 5000
}'

Since vLLM is compatible with the OpenAI API you can query the other endpoints present here. You can also use the OpenAI Python SDK to query your deployment as shown below:

import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"
api_key = "vllm-key"

client = openai.OpenAI(
    base_url=base_url,
    api_key=api_key
)

response = client.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    prompt="Hello, Llama 4! What can you do today?",
    max_tokens=200
)

print(response)

Remember to configure a TLS endpoint with a custom domain before going to production for security and compatibility with modern clients.

Multimodal Capabilities

Llama 4 shines with its early fusion multimodal architecture, allowing it to process text and images simultaneously. Here’s how to use multimodal capabilities with your deployment:

import openai
import base64

# Example 1: Single Image Chat Completion

# Load image as base64
with open("chart.jpg", "rb") as image_file:
    image_base64 = base64.b64encode(image_file.read()).decode('utf-8')

# Prepare the API client
client = openai.OpenAI(
    base_url="YOUR_APP_URL/v1",
    api_key="vllm-key"
)

# Create chat completion with the image
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this chart and explain the key trends:"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }
    ],
    max_tokens=300
)

print(response.choices.message.content)

# ------------------------------------------------------------------

# Example 2: Multiple Images Chat Completion

# Load multiple images as base64 and prepare image list
images = []
for filename in ["image1.jpg", "image2.jpg", "image3.jpg"]:
    with open(filename, "rb") as image_file:
        image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
    images.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}})

# Combine text and images into the message content
message_content = [
    {"type": "text", "text": "Compare these three images and tell me the differences:"}
]
message_content.extend(images)

# Prepare the API client (if needed, reusing the previous client instance)
client = openai.OpenAI(
    base_url="YOUR_APP_URL/v1",
    api_key="vllm-key"
)

# Make the API call with multiple images
response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {"role": "user", "content": message_content}
    ],
    max_tokens=500
)

print(response.choices.message.content)

Llama 4 models work best with up to 8-10 images per request. For optimal performance, keep image sizes under 2048x2048 pixels. The models can interpret charts, diagrams, screenshots, photos, and even complex visual information including mathematical equations and code screenshots.

Context Length Capabilities

Llama 4 models offer impressive context length capabilities across different hardware configurations:

Model	GPU Configuration	Context Length	Tokens/sec (batch=32)
Scout	8x H100	Up to 1M tokens	~180
Scout	8x H200	Up to 3.6M tokens	~260
Scout	Multi-node setup	Up to 10M tokens	Varies by setup
Maverick	8x H100	Up to 430K tokens	~150
Maverick	8x H200	Up to 1M tokens	~210

These massive context windows enable entirely new use cases:

Document Analysis

Process and reason across hundreds of pages of legal documents, technical manuals, or research papers in a single request

Code Repository Understanding

Analyze entire codebases to debug complex issues or generate comprehensive documentation

Long-Form Writing

Generate or edit lengthy content like novels, technical reports, or academic papers

Multi-Document Synthesis

Summarize and synthesize information across multiple documents, such as research papers or business reports

To reach Scout’s maximum 10M context window, you’ll need to use distributed inference across multiple nodes with tensor parallelism or pipeline parallelism. Join our Slack community to learn more about this feature.

Advanced Configurations and Optimization Tips

Performance Optimization

FP8 KV Cache: Add --kv-cache-dtype fp8 to potentially double the usable context window and gain a performance boost with minimal accuracy impact:
- Before optimization: ~90 tokens/sec on 8x H100
- After optimization: ~180 tokens/sec on 8x H100
Long Context Accuracy: For contexts longer than 32K tokens, include --override-generation-config='{"attn_temperature_tuning": true}' to improve accuracy.
Continuous Batching: vLLM already implements continuous batching by default, maximizing throughput for multiple concurrent users.
Quantization: For Maverick, use the FP8 model variant which provides excellent performance with minimal accuracy drop.

Hardware Compatibility

A100 GPUs: The BF16 versions of both models work well on A100 GPUs but with reduced context lengths:

Model	A100 Configuration	Practical Context Length
Scout	8x A100 (80GB)	Up to 160K tokens
Maverick	8x A100 (80GB)	Up to 90K tokens

INT4 Quantization: For Scout, an INT4 quantization that allows the model to fit on a single H100 GPU is in development.
AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building vLLM from source, with nearly identical accuracy.

Key Architectural Innovations

Llama 4's Mixture-of-Experts Architecture

Llama 4’s architecture enables efficient long-context inference through several innovations: Mixture of Experts (MoE): Instead of activating all parameters for every token, Llama 4 models use a “router” to select which expert(s) should process each token:

Scout has 16 experts (109B total parameters)
Maverick has 128 experts (400B total parameters)
Only 1-2 experts are activated per token (17B active parameters)
This approach dramatically reduces computational costs while maintaining quality

Interleaved RoPE (iRoPE): Llama 4 alternates between global attention (without RoPE) and chunked local attention (with RoPE) in a 1:3 ratio:

Global layers capture document-level patterns
Local layers process detailed information within chunks
This combination significantly reduces the quadratic complexity of attention

Early Fusion Multimodality: Rather than using separate encoders, Llama 4 integrates text and vision tokens directly into its core architecture:

Images are processed through a vision encoder and then projected into the same embedding space as text tokens
The model can attend to both text and image tokens simultaneously
This enables deeper multimodal reasoning than late-fusion approaches

Llama 4 was trained on more than 30 trillion tokens across text, image, and video datasets. This is more than double the Llama 3 pre-training mixture and includes data from over 200 languages, with 100+ languages having substantial representation (1B+ tokens each).

For optimal storage and faster downloads, Llama 4 models on Hugging Face use the Xet storage backend, achieving ~25% deduplication for the main models and ~40% for derivative models, saving time and bandwidth.

Conclusion

With this guide, you’ve successfully deployed Llama 4 models on serverless GPUs using Tensorfuse. These models represent the cutting edge of open-source AI, offering capabilities that rival or exceed proprietary alternatives at a fraction of the cost. Whether you’re building a sophisticated chatbot, a multimodal content creation tool, or an enterprise knowledge system, Llama 4 provides the foundation for building AI applications that were previously only possible with proprietary models. Click here to get started with Tensorfuse. You can also explore the Tensorfuse examples repository for more deployment configurations and use cases.

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

Deploy Llama 4 Models on your AWS account

Why Build with Llama 4

Prerequisites

Deploying Llama 4 Models with Tensorfuse

Step 1: Setting up the secrets

Step 2: Prepare the Dockerfiles

Step 3: Deployment Configuration

Step 4: Deploy your models

Step 5: Accessing the deployed app

Multimodal Capabilities

Context Length Capabilities

Document Analysis

Code Repository Understanding

Long-Form Writing

Multi-Document Synthesis

Advanced Configurations and Optimization Tips

Performance Optimization

Hardware Compatibility

Key Architectural Innovations

Conclusion

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

​Why Build with Llama 4

​Prerequisites

​Deploying Llama 4 Models with Tensorfuse

​Step 1: Setting up the secrets

​Step 2: Prepare the Dockerfiles

​Step 3: Deployment Configuration

​Step 4: Deploy your models

​Step 5: Accessing the deployed app

​Multimodal Capabilities

​Context Length Capabilities

Document Analysis

Code Repository Understanding

Long-Form Writing

Multi-Document Synthesis

​Advanced Configurations and Optimization Tips

​Performance Optimization

​Hardware Compatibility

​Key Architectural Innovations

​Conclusion

Why Build with Llama 4

Prerequisites

Deploying Llama 4 Models with Tensorfuse

Step 1: Setting up the secrets

Step 2: Prepare the Dockerfiles

Step 3: Deployment Configuration

Step 4: Deploy your models

Step 5: Accessing the deployed app

Multimodal Capabilities

Context Length Capabilities

Advanced Configurations and Optimization Tips

Performance Optimization

Hardware Compatibility

Key Architectural Innovations

Conclusion