Deploying Sesame-CSM-1B on Serverless GPUs

The CSM (Conversational Speech Model) by Sesame is a speech generation model designed to generate RVQ audio codes using text and audio inputs. It utilizes a Llama backbone for its architecture, along with a compact audio decoder that outputs Mimi audio codes. Follow this guide to deploy the Sesame-CSM-1B model on your cloud account using Tensorfuse. We will be using 1 A10G GPU for this model. We will use nvidia triton server to serve the model.

Prerequisites

Before you begin, ensure you have configured Tensorfuse on your AWS account. If you haven’t done that yet, follow the Getting Started guide. You will also need access to the Sesame CSM 1B and the Llama-3.2-1B gated models on huggingface.

Deploying Sesame CSM 1B with Tensorfuse

Each Tensorkube deployment requires:

Your environment (as a Dockerfile).
Your code (in this example, the models directory).
A deployment configuration (deployment.yaml).

Step 1: Prepare the Dockerfile

We will use the official nvidia triton server image as our base image. This image comes with all the necessary dependencies to run the model. The image tag can be found in nvidia container catalog We clone the CSM github repository to make deploying the model easier. We will also need the hf-transfer and numpy python packages.

Dockerfile

# Use the NVIDIA Triton Inference Server base image
FROM nvcr.io/nvidia/tritonserver:25.02-pyt-python-py3

# Clone the model repository from GitHub
RUN mkdir -p /model_repository/csm_1b/1
RUN git clone https://github.com/SesameAILabs/csm.git /model_repository/csm_1b/1

# Install Python dependencies
RUN pip install --no-cache-dir --ignore-installed -r /model_repository/csm_1b/1/requirements.txt \
    && pip install --no-cache-dir --ignore-installed numpy hf_transfer

# Copy the code files
COPY model_repository/csm_1b/1/model.py /model_repository/csm_1b/1
COPY model_repository/csm_1b/config.pbtxt /model_repository/csm_1b/config.pbtxt

# Set environment variables
ENV HF_HUB_ENABLE_HF_TRANSFER=1

# Expose Triton gRPC and HTTP ports
EXPOSE 8000
EXPOSE 8001
EXPOSE 8002

# Start Triton Server
CMD ["tritonserver", "--model-repository=/model_repository", "--allow-gpu-metrics=false", "--allow-metrics=false", "--metrics-port=0" ]

We’ve configured the triton server with couple of CLI flags tailored to our specific use case. We have disabled metrics for inference requets. For more details on authentication, refer to triton docs .If you have questions about selecting flags for production, reach out to the Tensorfuse Community

Step 2: Prepare the models directory

We will use a python backend for tritonserver to serve the model. We will create a model_repository directory and add the model.py and config.pbtxt file in it. For more details about triton python backend refer to triton docs

mkdir -p model_repository/csm_1b/1

model_repository/csm_1b/1/model.py

import json
import os
import triton_python_backend_utils as pb_utils
import torch
import numpy as np
import sys
import base64
import torchaudio

class TritonPythonModel:
    def initialize(self, args):
        self.logger = pb_utils.Logger

        self.model_dir = args['model_repository']
        self.model_id = "sesame/csm-1b"

        
        # Import after installation
        from generator import load_csm_1b
        from huggingface_hub._login import _login
        _login(token=os.environ.get("HUGGING_FACE_HUB_TOKEN"), add_to_git_credential=True)

        
        # Load the model
        try:
            self.generator = load_csm_1b(device="cuda" if torch.cuda.is_available() else "cpu")
            self.logger.log_info("Successfully loaded Sesame 1B")
        except Exception as e:
            self.logger.log_error(f"Error initializing model: {str(e)}")
            raise        
        print("CSM-1B model loaded successfully", file=sys.stderr)


    def execute(self, requests):
        responses = []

        from generator import Segment
        
        for request in requests:
            # Extract inputs
            text_tensor = pb_utils.get_input_tensor_by_name(request, "text")
            voice_tensor = pb_utils.get_input_tensor_by_name(request, "speaker")
            context_tensor = pb_utils.get_input_tensor_by_name(request, "context")
            max_size_tensor = pb_utils.get_input_tensor_by_name(request, "max_audio_length_ms")

            text_data = text_tensor.as_numpy()      # shape: (1,)
            voice_data = voice_tensor.as_numpy()    # shape: (1,)
            context_data = context_tensor.as_numpy() # shape: (N,)
            max_size_data = max_size_tensor.as_numpy() # shape: (1,)

            text_value = text_data[0]
            if isinstance(text_value, bytes):
                text_value = text_value.decode("utf-8")

            voice_value = int(voice_data[0])

            context_list = []
            for c in context_data:
                context_item = json.loads(c.decode("utf-8") if isinstance(c, bytes) else str(c))
                audio_tensor = context_item['audio']
                sample_rate = context_item['original_sample_rate']
                
                
                audio_bytes = base64.b64decode(audio_tensor)
                audio_array = np.frombuffer(audio_bytes, dtype=np.float32)
                audio = torch.from_numpy(audio_array)
                audio = torchaudio.functional.resample(audio.squeeze(0), orig_freq=sample_rate, new_freq=self.generator.sample_rate)
                
                segment = Segment(
                    text=context_item['text'],
                    speaker=context_item['speaker'],
                    audio=audio
                )
                context_list.append(segment)

            max_size_value = int(max_size_data[0])

            
            # Generate audio
            audio = self.generator.generate(
                text=text_value,
                speaker=voice_value,
                context=context_list,
                max_audio_length_ms=max_size_value
            )
            
            # Create output tensors
            audio_tensor = pb_utils.Tensor("audio", audio.cpu().numpy())
            sample_rate_tensor = pb_utils.Tensor("sample_rate", 
                                            np.array([self.generator.sample_rate], dtype=np.int32))
            
            # Create and append response
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[audio_tensor, sample_rate_tensor]
            )
            responses.append(inference_response)
        
        return responses

    def finalize(self):
        print("Unloading CSM-1B model", file=sys.stderr)   
        self.generator = None
        torch.cuda.empty_cache() 

model_repository/csm_1b/config.pbtxt

name: "csm_1b"
backend: "python"
max_batch_size: 0
input [
  {
    name: "text"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "speaker"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "context"
    data_type: TYPE_STRING
    dims: [ -1 ]
    optional: true
  },
  {
    name: "max_audio_length_ms"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
  }
]
output [
  {
    name: "audio"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "sample_rate"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]
dynamic_batching {
  max_queue_delay_microseconds: 100
}

Step 3: Create Secrets

We need to create a hugging face secret to download model from huggingface hub

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=your_token

Step 4: Deployment config

Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so that you can follow a GitOps approach to deployment.

config.yaml

# deployment.yaml for FLUX.1-dev
gpus: 1 # Number of GPUs
gpu_type: a10g # GPU Type
port: 8000 # Port to expose the service
min_scale: 0
max_scale: 3
secret:
  - hugging-face-secret
readiness:
  httpGet:
    path: /v2/health/ready # readiness endpoint for triton server
    port: 8000

Now you can deploy your service using the following command:

tensorkube deploy --config config.yaml

Step 4: Accessing the deployed app

Voila! Your autoscaling production text to speech service using sesame-csm-1b is ready. Once the deployment is successful, you can see the status of your app by running:

tensorkube deployment list

And that’s it! You have successfully deployed the sesame-csm-1b model.

Remember to configure a TLS endpoint with a custom domain before going to production.

Testing the model

To test it out, we have a sample streamlit_app.py python file. Add your deployment url DEPLOYMENT_URL in the code before running the streamlit_app.py file using the command streamlit run streamlit_app.py

streamlit_app.py

import streamlit as st
import requests
import json
import numpy as np
import base64
import soundfile as sf
import torchaudio
import torch
import os
from typing import List, Dict, Any
import tempfile
import time

st.set_page_config(page_title="Sesame TTS Interface", layout="wide")

def encode_audio(audio_tensor: torch.Tensor) -> str:
    audio_list = audio_tensor.tolist()
    audio_bytes = torch.tensor(audio_list).numpy().tobytes()
    return base64.b64encode(audio_bytes).decode('utf-8')

def prepare_context(speakers: List[int], transcripts: List[str], audio_files):
    context = []
    for transcript, speaker, audio_file in zip(transcripts, speakers, audio_files):
        # Create a temporary file to save the uploaded file
        with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp_file:
            tmp_file.write(audio_file.getvalue())
            tmp_path = tmp_file.name
        
        audio_tensor, sample_rate = torchaudio.load(tmp_path)
        audio_base64 = encode_audio(audio_tensor)
        
        # Clean up the temporary file
        os.unlink(tmp_path)

        context.append({"text": transcript, "speaker": speaker, "audio": audio_base64, "original_sample_rate": sample_rate})
    
    return context

def prepare_payload(text: str, speaker: int, context: List[Dict[str, Any]], max_audio_length_ms: int) -> Dict[str, Any]:
    return {
        "inputs": [
            {"name": "text", "datatype": "BYTES", "shape": [1], "data": [text]},
            {"name": "speaker", "datatype": "INT32", "shape": [1], "data": [speaker]},
            {"name": "context", "datatype": "BYTES", "shape": [len(context)], "data": [json.dumps(item) for item in context]},
            {"name": "max_audio_length_ms", "datatype": "INT32", "shape": [1], "data": [max_audio_length_ms]}
        ]
    }

def send_request(url: str, payload: Dict[str, Any]) -> requests.Response:
    return requests.post(url, headers={"Content-Type": "application/json"}, json=payload)

# App title and description
st.title("Sesame TTS Model Interface")
st.markdown("Upload audio samples, provide transcripts, and generate speech using the Sesame TTS model.")

# Triton server URL input
triton_url = st.text_input("Triton Server URL", value="<DEPLOYMENT_LINK>/v2/models/csm_1b/infer")

# Context inputs section
st.header("Context Inputs")

# Dynamic context inputs
num_contexts = st.number_input("Number of context examples", min_value=0, max_value=10, value=2)

speakers = []
transcripts = []
audio_files = []

for i in range(num_contexts):
    col1, col2, col3 = st.columns([1, 2, 2])
    
    with col1:
        speaker = st.number_input(f"Speaker ID #{i+1}", min_value=0, value=i % 2)
        speakers.append(speaker)
    
    with col2:
        transcript = st.text_area(f"Transcript #{i+1}", value=f"Example text for speaker {speaker}.", height=100)
        transcripts.append(transcript)
    
    with col3:
        audio_file = st.file_uploader(f"Audio file #{i+1}", type=["wav"])
        if audio_file:
            audio_files.append(audio_file)
            st.audio(audio_file)

# Generation parameters
st.header("Generation Parameters")

col1, col2 = st.columns(2)
with col1:
    gen_text = st.text_area("Text to generate", value="This is the text I want to convert to speech.", height=150)
with col2:
    gen_speaker = st.number_input("Speaker ID for generation", min_value=0, value=0)
    max_audio_length = st.number_input("Max audio length (ms)", min_value=1000, value=5000, step=1000)

# Generate button
generate_button = st.button("Generate Speech")

# Results section
st.header("Results")
result_placeholder = st.empty()
audio_placeholder = st.empty()

if generate_button:
    # Check if all audio files are uploaded
    if len(audio_files) != num_contexts:
        st.error(f"Please upload all {num_contexts} audio files.")
    else:
        with st.spinner("Preparing context and generating speech..."):
            # Prepare context
            context = prepare_context(speakers, transcripts, audio_files)
            
            # Prepare payload
            payload = prepare_payload(gen_text, gen_speaker, context, max_audio_length)
            
            # Send request
            try:
                start_time = time.time()
                response = send_request(triton_url, payload)
                inference_time = time.time() - start_time
                
                # Process response
                if response.status_code == 200:
                    result = response.json()
                    result_placeholder.success(f"Inference successful! (Time: {inference_time:.2f}s)")
                    
                    audio = np.array(result['outputs'][0]['data'], dtype=np.float32)
                    sample_rate = result['outputs'][1]['data'][0]
                    
                    # Create a temporary file for the audio
                    with tempfile.NamedTemporaryFile(delete=False, suffix='.wav') as tmp_file:
                        sf.write(tmp_file.name, audio, sample_rate)
                        
                        # Display audio player
                        with open(tmp_file.name, "rb") as f:
                            audio_bytes = f.read()
                            audio_placeholder.audio(audio_bytes, format="audio/wav")
                        
                        # Offer download button
                        st.download_button(
                            label="Download generated audio",
                            data=audio_bytes,
                            file_name="generated_speech.wav",
                            mime="audio/wav"
                        )
                        
                        # Clean up
                        os.unlink(tmp_file.name)
                        
                    # Display audio details
                    st.text(f"Sample rate: {sample_rate} Hz")
                    st.text(f"Audio length: {len(audio)/sample_rate:.2f} seconds ({len(audio)} samples)")
                else:
                    result_placeholder.error(f"Request failed with status code {response.status_code}")
                    st.code(response.text)
            except Exception as e:
                result_placeholder.error(f"Error: {str(e)}")

# Add some helpful information at the bottom
st.markdown("---")
st.markdown("""
### Tips:
- Make sure the Triton Server URL is correct and accessible
- Upload WAV files for best compatibility
- Speaker IDs should match between context examples and generation if you want to maintain the same voice
""")

Dont forget to install the required python packages before running the client.py file

streamlit
numpy
soundfile
torchaudio
torch

pip install -r requirements.txt

streamlit run streamlit_app.py

Once you run the streamlit app, you will be able to use it to interface with your deployment. To get started with Tensorfuse,Click here You can also directly use the Tensorfuse GitHub repository for more details and updates on these Dockerfiles.

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

Deploying Sesame-CSM-1B on Serverless GPUs

Prerequisites

Deploying Sesame CSM 1B with Tensorfuse

Step 1: Prepare the Dockerfile

Step 2: Prepare the models directory

Step 3: Create Secrets

Step 4: Deployment config

Step 4: Accessing the deployed app

Testing the model

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

​Prerequisites

​Deploying Sesame CSM 1B with Tensorfuse

​Step 1: Prepare the Dockerfile

​Step 2: Prepare the models directory

​Step 3: Create Secrets

​Step 4: Deployment config

​Step 4: Accessing the deployed app

​Testing the model

Prerequisites

Deploying Sesame CSM 1B with Tensorfuse

Step 1: Prepare the Dockerfile

Step 2: Prepare the models directory

Step 3: Create Secrets

Step 4: Deployment config

Step 4: Accessing the deployed app

Testing the model