- Long running AI Agents
- Building self-hosted Voice AI agents for low latency and improved accuracy
openai:gptoss
image.
Before we deploy, here’s a quick snapshot of inference benchmark scores for GPT-OSS models:
Model | GPU Configuration | Context Length | Tokens/sec |
---|---|---|---|
gpt-oss-20b | 1xH100 | 130k tokens | 240 |
gpt-oss-120b | 8xH100 | 130k tokens | 200 |
Prerequisites
Before you begin, make sure you sign up on the Tensorfuse app and configure the Tensorkube cluster in your AWS account. Using the Tensorkube cluster, you can deploy any custom or open-source model and even host your own AI gateway allowing you to connect to 100s of inference providers via single unified API.Deploying OpenAIs gpt-oss Models with Tensorfuse
Each Tensorkube deployment requires:- Your code (in this example, vLLM API server code from Docker image)
- Your environment (as a Dockerfile)
- A deployment configuration (
deployment.yaml
)
Step 1: Set huggingface token
Get aREAD
token from your huggingface profile and store it as a secret in Tensorfuse using the command below.
HUGGING_FACE_HUB_TOKEN
as vLLM assumes the same.
Step 2: Prepare the Dockerfiles
Let’s create separate Dockerfiles for gpt-oss-20b and gpt-oss-120b models:Step 3: Deployment Configuration
Create model-specific configuration files to optimize for each model’s requirements.Don’t forget the
readiness
endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness
on port 80.Step 4: Deploy your models
Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either 20b or 120b).Step 5: Accessing the deployed app
Voila! Your autoscaling production OpenAI service is ready. Only authenticated requests will be served. Once deployment is successful, check the status:YOUR_APP_URL
with the endpoint from the command output and run:
Remember to configure a TLS endpoint with a custom domain before going to production for security and compatibility with modern clients.