Deploying FLUX.1-dev on Serverless GPUs

FLUX.1 [dev] is a 12 billion parameter rectified flow transformer that generates images from text. In this guide we will show you how to deploy the FLUX.1-dev model on your cloud account using Tensorfuse. We will be using 1 L40S GPU for this model. We will use nvidia triton server to serve the model. We will also add token-based authentication to our service. We will store the authentication token (FLUX_API_KEY) as a Tensorfuse secret.

Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide.

Deploying FLUX.1-dev with Tensorfuse

Each Tensorkube deployment requires:

Your environment (as a Dockerfile).
Your code (in this example, the models directory).
A deployment configuration (deployment.yaml).

Step 1: Prepare the Dockerfile

We will use the official nvidia triton server image as our base image. This image comes with all the necessary dependencies to run the model. The image tag can be found in nvidia container catalog Additional to base image, we will install couple of python packages, set additonal env and copy the models directory into docker image.

Dockerfile

# Use NVIDIA Triton Inference Server as base image
FROM nvcr.io/nvidia/tritonserver:25.01-pyt-python-py3

RUN pip install --no-cache-dir \
    torch \
    diffusers \
    transformers \
    accelerate \
    safetensors \
    Pillow \
    hf_transfer \
    protobuf \
    bitsandbytes \
    sentencepiece \
    numpy


RUN mkdir -p /models/flux/1

COPY models/flux/1/model.py /models/flux/1
COPY models/flux/config.pbtxt /models/flux/config.pbtxt


# Set environment variables
ENV HF_HUB_ENABLE_HF_TRANSFER=1

# Expose Triton gRPC and HTTP ports
EXPOSE 8000
EXPOSE 8001
EXPOSE 8002

# Start Triton Server
CMD ["tritonserver", "--model-repository=/models", "--allow-gpu-metrics=false", "--allow-metrics=false", "--metrics-port=0", "--http-restricted-api=inference:API_KEY=r2JmQNuD" ]

We’ve configured the triton server with couple of CLI flags tailored to our specific use case. We have disable metrics and have added authentication key for inference requets. For more details on authentication, refer to triton docs .If you have questions about selecting flags for production, reach out to the Tensorfuse Community

Step 2: Prepare the models directory

We will use python backend for tritonserver to serve the model. We will create a models directory and add the model.py and config.pbtxt file in it. For more details about triton python backend refer to triton docs

mkdir -p models/flux/1

models/flux/1/model.py

import triton_python_backend_utils as pb_utils
import numpy as np
import torch
from diffusers import AutoPipelineForText2Image, FluxPipeline
from io import BytesIO

class TritonPythonModel:
    def initialize(self, args):
        """Load the Stable Diffusion model"""
        self.logger = pb_utils.Logger
        self.model_id = "black-forest-labs/FLUX.1-dev"

        try:
            # Load pipeline with fp16 optimization
            self.pipeline = FluxPipeline.from_pretrained(
                self.model_id,
                torch_dtype=torch.bfloat16,
            ).to("cuda")
            self.logger.log_info("Successfully loaded FLUX.1-dev model")

        except Exception as e:
            self.logger.log_error(f"Error initializing model: {str(e)}")
            raise

    def execute(self, requests):
        """Process requests and generate images"""
        responses = []

        for request in requests:
            try:
                # Get input prompt
                prompt = pb_utils.get_input_tensor_by_name(request, "PROMPT")
                prompt_str = prompt.as_numpy()[0].decode()

                
                # Generate image
                image = self.pipeline(
                     prompt=prompt_str,
                     num_inference_steps=25,
                     guidance_scale=7.5,
                     height=512,
                     width=512
                ).images[0]

                # Convert image to byte array
                img_byte_arr = BytesIO()
                image.save(img_byte_arr, format="PNG")
                img_np = np.frombuffer(img_byte_arr.getvalue(), dtype=np.uint8)

                # Create output tensor
                output_tensor = pb_utils.Tensor(
                    "GENERATED_IMAGE",
                    img_np
                )

                responses.append(pb_utils.InferenceResponse([output_tensor]))
                self.logger.log_info("Successfully generated image")

            except Exception as e:
                self.logger.log_error(f"Error processing request: {str(e)}")
                responses.append(pb_utils.InferenceResponse(error=str(e)))

        return responses

    def finalize(self):
        """Cleanup resources"""
        self.pipeline = None
        torch.cuda.empty_cache()

models/flux/config.pbtxt

name: "flux"
backend: "python"
max_batch_size: 0

input [
  {
    name: "PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  }
]

output [
  {
    name: "GENERATED_IMAGE"
    data_type: TYPE_UINT8
    dims: [-1]
  }
]

Step 3: Create Secrets

We will create a secret to store the authentication token. We will use this token to authenticate the inference requests.

tensorkube secret create flux-secret FLUX_API_KEY=r2JmQNuD # this token should be same as the one used in dockerfile

we also need to create a hugging face secret to download model from huggingface hub

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=your_token

Step 4: Deployment config

Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.

deployment.yaml

# deployment.yaml for FLUX.1-dev
gpus: 1 # Number of GPUs
gpu_type: l40s # GPU Type
port: 8000 # Port to expose the service
min_scale: 0
max_scale: 1
secret:
  - hugging-face-secret
  - flux-secret
readiness:
  httpGet:
    path: /v2/health/ready # readiness endpoint for triton server
    port: 8000

Now you can deploy your service using the following command:

tensorkube deploy --config deployment.yaml

Step 4: Accessing the deployed app

Voila! Your autoscaling production text to image service using flux.1-dev is ready. Once the deployment is successful, you can see the status of your app by running:

tensorkube deployment list

And that’s it! You have successfully deployed the flux.1-dev model.

Remember to configure a TLS endpoint with a custom domain before going to production.

To test it out, we have a sample client.py python file. Add your deployment url DEPLOYMENT_URL in the code and set the FLUX_API_KEY as environment variable before running the client.py file.

client.py

import requests
import json
from io import BytesIO
from PIL import Image
import numpy as np
import os
deployment_url = "<DEPLOYMENT_URL>" # replace with your deployment url, remove trailing slash
api_key = os.getenv("FLUX_API_KEY")
inference_endpoint = f"{deployment_url}/v2/models/flux/versions/1/infer"

request_data = {
    "inputs": [
      {
        "name": "PROMPT",
        "shape": [1],
        "datatype": "BYTES",
        "data": ["Generate a golden retriever with a sunset background"]
      }
    ]
}

headers = {"Content-Type": "application/json", "API_KEY": api_key}

# Send POST request
response = requests.post(inference_endpoint, headers=headers, json=request_data)
if response.status_code != 200:
    print(f"Failed to send request to {inference_endpoint}")
    print(f"Response: {response.text}")
    exit()
response_data = response.json()
image_data = response_data["outputs"][0]["data"]
img_np = np.array(image_data, dtype=np.uint8)
byte_data = img_np.tobytes()
# Wrap the bytes in a BytesIO stream
byte_io = BytesIO(byte_data)

# Save the generated image
generated_image = Image.open(byte_io)
generated_image.save("generated_image.png")

Dont forget to install the required python packages before running the client.py file

pillow
numpy
requests

pip install -r requirements.txt

python client.py

Once you run the client.py file, you will see a generated_image.png file in your directory. Thats it, you have successfully generated an image using flux.1-dev model. To get started with Tensorfuse,Click here You can also directly use the Tensorfuse GitHub repository for more details and updates on these Dockerfiles.

Large Language Models

Image and Video Models

Audio Models

Integrations

Miscellaneous

Deploying FLUX.1-dev on Serverless GPUs

Prerequisites

Deploying FLUX.1-dev with Tensorfuse

Step 1: Prepare the Dockerfile

Step 2: Prepare the models directory

Step 3: Create Secrets

Step 4: Deployment config

Step 4: Accessing the deployed app

Large Language Models

Image and Video Models

Audio Models

Integrations

Miscellaneous

​Prerequisites

​Deploying FLUX.1-dev with Tensorfuse

​Step 1: Prepare the Dockerfile

​Step 2: Prepare the models directory

​Step 3: Create Secrets

​Step 4: Deployment config

​Step 4: Accessing the deployed app

Prerequisites

Deploying FLUX.1-dev with Tensorfuse

Step 1: Prepare the Dockerfile

Step 2: Prepare the models directory

Step 3: Create Secrets

Step 4: Deployment config

Step 4: Accessing the deployed app