Deploy a serverless Sesame CSM 1B model on your AWS account
The CSM (Conversational Speech Model) by Sesame is a speech generation model designed to generate RVQ audio codes using text and audio inputs. It utilizes a Llama backbone for its architecture, along with a compact audio decoder that outputs Mimi audio codes.
Follow this guide to deploy the Sesame-CSM-1B model on your cloud account using Tensorfuse. We will be using 1 A10G GPU for this model.
We will use nvidia triton server to serve the model.
We will use the official nvidia triton server image as our base image. This image comes with all the necessary
dependencies to run the model. The image tag can be found in nvidia container catalog
We clone the CSM github repository to make deploying the model easier. We will also need the hf-transfer and numpy python
packages.
Dockerfile
# Use the NVIDIA Triton Inference Server base imageFROM nvcr.io/nvidia/tritonserver:25.02-pyt-python-py3# Clone the model repository from GitHubRUN mkdir -p /model_repository/csm_1b/1RUN git clone https://github.com/SesameAILabs/csm.git /model_repository/csm_1b/1# Install Python dependenciesRUN pip install --no-cache-dir --ignore-installed -r /model_repository/csm_1b/1/requirements.txt \ && pip install --no-cache-dir --ignore-installed numpy hf_transfer# Copy the code filesCOPY model_repository/csm_1b/1/model.py /model_repository/csm_1b/1COPY model_repository/csm_1b/config.pbtxt /model_repository/csm_1b/config.pbtxt# Set environment variablesENV HF_HUB_ENABLE_HF_TRANSFER=1# Expose Triton gRPC and HTTP portsEXPOSE 8000EXPOSE 8001EXPOSE 8002# Start Triton ServerCMD ["tritonserver", "--model-repository=/model_repository", "--allow-gpu-metrics=false", "--allow-metrics=false", "--metrics-port=0" ]
We’ve configured the triton server with couple of CLI flags tailored to our specific use case. We have disabled metrics for inference requets. For more details on authentication, refer to triton docs
.If you have questions about selecting flags for production, reach out to the Tensorfuse Community
We will use a python backend for tritonserver to serve the model. We will create a model_repository directory and add the model.py and config.pbtxt file in it. For more details about triton python backend refer to triton docs
Although you can deploy tensorfuse apps using command line, it is always recommended to have a config file so
that you can follow a GitOps approach to deployment.
config.yaml
# deployment.yaml for FLUX.1-devgpus:1# Number of GPUsgpu_type: a10g # GPU Typeport:8000# Port to expose the servicemin_scale:0max_scale:3secret:- hugging-face-secretreadiness: httpGet:path: /v2/health/ready # readiness endpoint for triton serverport:8000
Now you can deploy your service using the following command:
To test it out, we have a sample streamlit_app.py python file. Add your deployment url DEPLOYMENT_URL in the code before running the streamlit_app.py file using the command streamlit run streamlit_app.py
streamlit_app.py
import streamlit as stimport requestsimport jsonimport numpy as npimport base64import soundfile as sfimport torchaudioimport torchimport osfrom typing import List, Dict, Anyimport tempfileimport timest.set_page_config(page_title="Sesame TTS Interface", layout="wide")defencode_audio(audio_tensor: torch.Tensor)->str: audio_list = audio_tensor.tolist() audio_bytes = torch.tensor(audio_list).numpy().tobytes()return base64.b64encode(audio_bytes).decode('utf-8')defprepare_context(speakers: List[int], transcripts: List[str], audio_files): context =[]for transcript, speaker, audio_file inzip(transcripts, speakers, audio_files):# Create a temporary file to save the uploaded filewith tempfile.NamedTemporaryFile(delete=False, suffix='.wav')as tmp_file: tmp_file.write(audio_file.getvalue()) tmp_path = tmp_file.name audio_tensor, sample_rate = torchaudio.load(tmp_path) audio_base64 = encode_audio(audio_tensor)# Clean up the temporary file os.unlink(tmp_path) context.append({"text": transcript,"speaker": speaker,"audio": audio_base64,"original_sample_rate": sample_rate})return contextdefprepare_payload(text:str, speaker:int, context: List[Dict[str, Any]], max_audio_length_ms:int)-> Dict[str, Any]:return{"inputs":[{"name":"text","datatype":"BYTES","shape":[1],"data":[text]},{"name":"speaker","datatype":"INT32","shape":[1],"data":[speaker]},{"name":"context","datatype":"BYTES","shape":[len(context)],"data":[json.dumps(item)for item in context]},{"name":"max_audio_length_ms","datatype":"INT32","shape":[1],"data":[max_audio_length_ms]}]}defsend_request(url:str, payload: Dict[str, Any])-> requests.Response:return requests.post(url, headers={"Content-Type":"application/json"}, json=payload)# App title and descriptionst.title("Sesame TTS Model Interface")st.markdown("Upload audio samples, provide transcripts, and generate speech using the Sesame TTS model.")# Triton server URL inputtriton_url = st.text_input("Triton Server URL", value="<DEPLOYMENT_LINK>/v2/models/csm_1b/infer")# Context inputs sectionst.header("Context Inputs")# Dynamic context inputsnum_contexts = st.number_input("Number of context examples", min_value=0, max_value=10, value=2)speakers =[]transcripts =[]audio_files =[]for i inrange(num_contexts): col1, col2, col3 = st.columns([1,2,2])with col1: speaker = st.number_input(f"Speaker ID #{i+1}", min_value=0, value=i %2) speakers.append(speaker)with col2: transcript = st.text_area(f"Transcript #{i+1}", value=f"Example text for speaker {speaker}.", height=100) transcripts.append(transcript)with col3: audio_file = st.file_uploader(f"Audio file #{i+1}",type=["wav"])if audio_file: audio_files.append(audio_file) st.audio(audio_file)# Generation parametersst.header("Generation Parameters")col1, col2 = st.columns(2)with col1: gen_text = st.text_area("Text to generate", value="This is the text I want to convert to speech.", height=150)with col2: gen_speaker = st.number_input("Speaker ID for generation", min_value=0, value=0) max_audio_length = st.number_input("Max audio length (ms)", min_value=1000, value=5000, step=1000)# Generate buttongenerate_button = st.button("Generate Speech")# Results sectionst.header("Results")result_placeholder = st.empty()audio_placeholder = st.empty()if generate_button:# Check if all audio files are uploadediflen(audio_files)!= num_contexts: st.error(f"Please upload all {num_contexts} audio files.")else:with st.spinner("Preparing context and generating speech..."):# Prepare context context = prepare_context(speakers, transcripts, audio_files)# Prepare payload payload = prepare_payload(gen_text, gen_speaker, context, max_audio_length)# Send requesttry: start_time = time.time() response = send_request(triton_url, payload) inference_time = time.time()- start_time# Process responseif response.status_code ==200: result = response.json() result_placeholder.success(f"Inference successful! (Time: {inference_time:.2f}s)") audio = np.array(result['outputs'][0]['data'], dtype=np.float32) sample_rate = result['outputs'][1]['data'][0]# Create a temporary file for the audiowith tempfile.NamedTemporaryFile(delete=False, suffix='.wav')as tmp_file: sf.write(tmp_file.name, audio, sample_rate)# Display audio playerwithopen(tmp_file.name,"rb")as f: audio_bytes = f.read() audio_placeholder.audio(audio_bytes,format="audio/wav")# Offer download button st.download_button( label="Download generated audio", data=audio_bytes, file_name="generated_speech.wav", mime="audio/wav")# Clean up os.unlink(tmp_file.name)# Display audio details st.text(f"Sample rate: {sample_rate} Hz") st.text(f"Audio length: {len(audio)/sample_rate:.2f} seconds ({len(audio)} samples)")else: result_placeholder.error(f"Request failed with status code {response.status_code}") st.code(response.text)except Exception as e: result_placeholder.error(f"Error: {str(e)}")# Add some helpful information at the bottomst.markdown("---")st.markdown("""### Tips:- Make sure the Triton Server URL is correct and accessible- Upload WAV files for best compatibility- Speaker IDs should match between context examples and generation if you want to maintain the same voice""")
Dont forget to install the required python packages before running the client.py file
streamlitnumpysoundfiletorchaudiotorch
pip install-r requirements.txt
streamlit run streamlit_app.py
Once you run the streamlit app, you will be able to use it to interface with your deployment.