Deploy OpenAI OSS Models in your AWS account

OpenAI recently released two open source models, gpt-oss-20b and gpt-oss-120b. These openwieght models are designed for reasoning, agentic tasks and improved function calling making it ideal for use in building:

Long running AI Agents
Building self-hosted Voice AI agents for low latency and improved accuracy

In this guide, we’ll walk you through deploying these state-of-the-art models in your AWS account using Tensorfuse and vLLM openai:gptossimage. Before we deploy, here’s a quick snapshot of inference benchmark scores for GPT-OSS models:

Model	GPU Configuration	Context Length	Tokens/sec
gpt-oss-20b	1xH100	130k tokens	240
gpt-oss-120b	8xH100	130k tokens	200

Prerequisites

Before you begin, make sure you sign up on the Tensorfuse app and configure the Tensorkube cluster in your AWS account. Using the Tensorkube cluster, you can deploy any custom or open-source model and even host your own AI gateway allowing you to connect to 100s of inference providers via single unified API.

Deploying OpenAIs gpt-oss Models with Tensorfuse

Each Tensorkube deployment requires:

Your code (in this example, vLLM API server code from Docker image)
Your environment (as a Dockerfile)
A deployment configuration (deployment.yaml)

Step 1: Set huggingface token

Get a READ token from your huggingface profile and store it as a secret in Tensorfuse using the command below.

tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env default

Ensure that the key for your secret is HUGGING_FACE_HUB_TOKEN as vLLM assumes the same.

Step 2: Prepare the Dockerfiles

Let’s create separate Dockerfiles for gpt-oss-20b and gpt-oss-120b models:

FROM vllm/vllm-openai:gptoss

# Enable HF Hub Transfer for faster model downloads
ENV HF_HUB_ENABLE_HF_TRANSFER=1
ENV VLLM_USE_V1=1

# Add NCCL environment variables
ENV NCCL_CUMEM_ENABLE=0

# Expose port 8000
EXPOSE 8000

ENTRYPOINT ["vllm", "serve", "openai/gpt-oss-20b"]

We’ve configured the vLLM server with various CLI flags tailored to each model. For a comprehensive list of vLLM flags, refer to the vLLM documentation.

Step 3: Deployment Configuration

Create model-specific configuration files to optimize for each model’s requirements.

gpus: 1
gpu_type: h100
secret:
  - hugging-face-secret
min_scale: 0
max_scale: 3
readiness:
    httpGet:
        path: /health
        port: 80

Don’t forget the readiness endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking /readiness on port 80.

Step 4: Deploy your models

Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either 20b or 120b).

tensorkube deploy --config-file ./deployment.yaml

Step 5: Accessing the deployed app

Voila! Your autoscaling production OpenAI service is ready. Only authenticated requests will be served. Once deployment is successful, check the status:

tensorkube deployment list

To test your deployment, replace YOUR_APP_URL with the endpoint from the command output and run:

curl --request POST
--url YOUR_APP_URL/v1/completions
--header 'Content-Type: application/json'
--data '{
"model": "openai/gpt-oss-20b",
"prompt": "Earth to gpt-oss. What can you do?",
"max_tokens": 5000
}'

Since vLLM is compatible with the OpenAI API you can query the other endpoints present here. You can also use the OpenAI Python SDK to query your deployment as shown below:

import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"

client = openai.OpenAI(
    base_url=base_url
)

response = client.completions.create(
    model="openai/gpt-oss-120b",
    prompt="Hello, gpt-oss! What can you do today?",
    max_tokens=200
)

print(response)

Remember to configure a TLS endpoint with a custom domain before going to production for security and compatibility with modern clients.

Conclusion

With this guide, you’ve successfully deployed OpenAI’s oss models on serverless GPUs using Tensorfuse. These models represent the cutting edge of open-source AI, offering capabilities that rival or exceed proprietary alternatives at a fraction of the cost. Click here to get started with Tensorfuse. You can also explore the Tensorfuse examples repository for more deployment configurations and use cases.

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

Deploy OpenAI OSS Models in your AWS account

Prerequisites

Deploying OpenAIs gpt-oss Models with Tensorfuse

Step 1: Set huggingface token

Step 2: Prepare the Dockerfiles

Step 3: Deployment Configuration

Step 4: Deploy your models

Step 5: Accessing the deployed app

Conclusion

Large Language Models

Image and Video Models

Audio Models

Miscellaneous

​Prerequisites

​Deploying OpenAIs gpt-oss Models with Tensorfuse

​Step 1: Set huggingface token

​Step 2: Prepare the Dockerfiles

​Step 3: Deployment Configuration

​Step 4: Deploy your models

​Step 5: Accessing the deployed app

​Conclusion

Prerequisites

Deploying OpenAIs gpt-oss Models with Tensorfuse

Step 1: Set huggingface token

Step 2: Prepare the Dockerfiles

Step 3: Deployment Configuration

Step 4: Deploy your models

Step 5: Accessing the deployed app

Conclusion