OpenAI recently released two open source models, gpt-oss-20b and gpt-oss-120b. These openwieght models are designed for reasoning, agentic tasks and improved function calling making it ideal for use in building:
  1. Long running AI Agents
  2. Building self-hosted Voice AI agents for low latency and improved accuracy
In this guide, we’ll walk you through deploying these state-of-the-art models in your AWS account using Tensorfuse and vLLM openai:gptossimage. Before we deploy, here’s a quick snapshot of inference benchmark scores for GPT-OSS models:
ModelGPU ConfigurationContext LengthTokens/sec
gpt-oss-20b1xH100130k tokens240
gpt-oss-120b8xH100130k tokens200

Prerequisites

Before you begin, make sure you sign up on the Tensorfuse app and configure the Tensorkube cluster in your AWS account. Using the Tensorkube cluster, you can deploy any custom or open-source model and even host your own AI gateway allowing you to connect to 100s of inference providers via single unified API.

Deploying OpenAIs gpt-oss Models with Tensorfuse

Each Tensorkube deployment requires:
  1. Your code (in this example, vLLM API server code from Docker image)
  2. Your environment (as a Dockerfile)
  3. A deployment configuration (deployment.yaml)

Step 1: Set huggingface token

Get a READ token from your huggingface profile and store it as a secret in Tensorfuse using the command below.
tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env default
Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN as vLLM assumes the same.

Step 2: Prepare the Dockerfiles

Let’s create separate Dockerfiles for gpt-oss-20b and gpt-oss-120b models:
FROM vllm/vllm-openai:gptoss

# Enable HF Hub Transfer for faster model downloads
ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV VLLM_USE_V1=1

# Add NCCL environment variables
ENV NCCL_CUMEM_ENABLE=0

# Expose port 8000
EXPOSE 8000

ENTRYPOINT ["vllm", "serve", "openai/gpt-oss-20b"]

We’ve configured the vLLM server with various CLI flags tailored to each model. For a comprehensive list of vLLM flags, refer to the vLLM documentation.

Step 3: Deployment Configuration

Create model-specific configuration files to optimize for each model’s requirements.
gpus: 1
gpu_type: h100
secret:
  - hugging-face-secret
min-scale: 0
max-scale: 3
readiness:
    httpGet:
        path: /health
        port: 80

Don’t forget the readiness endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness on port 80.

Step 4: Deploy your models

Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either 20b or 120b).
tensorkube deploy --config-file ./deployment.yaml

Step 5: Accessing the deployed app

Voila! Your autoscaling production OpenAI service is ready. Only authenticated requests will be served. Once deployment is successful, check the status:
tensorkube deployment list
To test your deployment, replace YOUR_APP_URL with the endpoint from the command output and run:
curl --request POST
--url YOUR_APP_URL/v1/completions
--header 'Content-Type: application/json'
--data '{
"model": "openai/gpt-oss-20b",
"prompt": "Earth to gpt-oss. What can you do?",
"max_tokens": 5000
}'
Since vLLM is compatible with the OpenAI API you can query the other endpoints present here. You can also use the OpenAI Python SDK to query your deployment as shown below:
import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"

client = openai.OpenAI(
    base_url=base_url
)

response = client.completions.create(
    model="openai/gpt-oss-120b",
    prompt="Hello, gpt-oss! What can you do today?",
    max_tokens=200
)

print(response)
Remember to configure a TLS endpoint with a custom domain before going to production for security and compatibility with modern clients.

Conclusion

With this guide, you’ve successfully deployed OpenAI’s oss models on serverless GPUs using Tensorfuse. These models represent the cutting edge of open-source AI, offering capabilities that rival or exceed proprietary alternatives at a fraction of the cost. Click here to get started with Tensorfuse. You can also explore the Tensorfuse examples repository for more deployment configurations and use cases.