> ## Documentation Index
> Fetch the complete documentation index at: https://tensorfuse.io/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy OpenAI OSS Models in your AWS account

> Deploy GPT-OSS models from OpenAI in your AWS using Tensorfuse

OpenAI recently released two open source models, [gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) and [gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b).
These openwieght models are designed for reasoning, agentic tasks and improved function calling making it ideal for use in building:

1. Long running AI Agents
2. Building self-hosted Voice AI agents for low latency and improved accuracy

In this guide, we'll walk you through deploying these state-of-the-art models in your AWS account using Tensorfuse and vLLM `openai:gptoss`image.

Before we deploy, here's a quick snapshot of inference benchmark scores for GPT-OSS models:

| **Model**    | **GPU Configuration** | **Context Length** | **Tokens/sec** |
| ------------ | --------------------- | ------------------ | -------------- |
| gpt-oss-20b  | 1xH100                | 130k tokens        | 240            |
| gpt-oss-120b | 8xH100                | 130k tokens        | 200            |

## Prerequisites

Before you begin, make sure you [sign up](https://app.tensorfuse.io/) on the Tensorfuse app and configure the Tensorkube cluster in your AWS account.

Using the Tensorkube cluster, you can deploy any custom or open-source model and even host your own AI gateway allowing you to connect to 100s of inference providers via single unified API.

## Deploying OpenAIs gpt-oss Models with Tensorfuse

Each Tensorkube deployment requires:

1. **Your code** (in this example, vLLM API server code from Docker image)
2. **Your environment** (as a Dockerfile)
3. **A deployment configuration** (`deployment.yaml`)

### Step 1: Set huggingface token

Get a `READ` token from your [huggingface profile](https://huggingface.co/settings/tokens) and store it as a secret in Tensorfuse using the command below.

```bash theme={null}
tensorkube secret create hugging-face-secret HUGGING_FACE_HUB_TOKEN=hf_EkXXrzzZsuoZubXhDQ --env default
```

Ensure that the key for your secret is `HUGGING_FACE_HUB_TOKEN` as vLLM assumes the same.

### Step 2: Prepare the Dockerfiles

Let's create separate Dockerfiles for gpt-oss-20b and gpt-oss-120b models:

<CodeGroup>
  ```dockerfile Dockerfile (gpt-oss-20b) theme={null}
  FROM vllm/vllm-openai:gptoss

  # Enable HF Hub Transfer for faster model downloads
  ENV HF_HUB_ENABLE_HF_TRANSFER=1
  ENV VLLM_USE_V1=1

  # Add NCCL environment variables
  ENV NCCL_CUMEM_ENABLE=0

  # Expose port 8000
  EXPOSE 8000

  ENTRYPOINT ["vllm", "serve", "openai/gpt-oss-20b"]

  ```

  ```dockerfile Dockerfile (gpt-oss-120b) theme={null}
  FROM vllm/vllm-openai:gptoss

  # Enable HF Hub Transfer for faster model downloads
  ENV HF_HUB_ENABLE_HF_TRANSFER=1
  ENV HUGGING_FACE_HUB_TOKEN=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
  ENV VLLM_USE_V1=1

  # Add NCCL environment variables
  ENV NCCL_CUMEM_ENABLE=0

  # Expose port 8000
  EXPOSE 8000

  # 8-GPU tensor parallel configuration for gpt-oss-120b
  ENTRYPOINT ["vllm", "serve", "openai/gpt-oss-120b", "--tensor-parallel-size", "8"]
  ```
</CodeGroup>

We've configured the vLLM server with various CLI flags tailored to each model. For a comprehensive list
of vLLM flags, refer to the [vLLM documentation](https://docs.vllm.ai/en/v0.8.3/serving/openai_compatible_server.html).

### Step 3: Deployment Configuration

Create model-specific configuration files to optimize for each model's requirements.

<CodeGroup>
  ```yaml deployment.yaml (gpt-oss-20b) theme={null}
  gpus: 1
  gpu_type: h100
  secret:
    - hugging-face-secret
  min_scale: 0
  max_scale: 3
  readiness:
      httpGet:
          path: /health
          port: 80

  ```

  ```yaml deployment.yaml (gpt-oss-120b) theme={null}
  gpus: 8
  gpu_type: h100
  secret:
    - hugging-face-secret
  min_scale: 0
  max_scale: 3
  readiness:
      httpGet:
          path: /health
          port: 80

  ```
</CodeGroup>

<Note>
  Don't forget the `readiness` endpoint in your config. Tensorfuse uses this to ensure your service is healthy before routing traffic to it. If not specified, Tensorfuse will default to checking `/readiness` on port 80.
</Note>

### Step 4: Deploy your models

Deploy your services using these commands. Make sure there is only one Dockerfile in the directory (either 20b or 120b).

```
tensorkube deploy --config-file ./deployment.yaml
```

### Step 5: Accessing the deployed app

<Icon icon="rocket" /> Voila! Your **autoscaling** production OpenAI service is ready. Only authenticated requests will be served.

Once deployment is successful, check the status:

```
tensorkube deployment list
```

To test your deployment, replace `YOUR_APP_URL` with the endpoint from the command output and run:

<CodeGroup>
  ```bash gpt-oss-20b theme={null}
  curl --request POST
  --url YOUR_APP_URL/v1/completions
  --header 'Content-Type: application/json'
  --data '{
  "model": "openai/gpt-oss-20b",
  "prompt": "Earth to gpt-oss. What can you do?",
  "max_tokens": 5000
  }'
  ```

  ```bash gpt-oss-120b theme={null}
  curl --request POST \
  --url YOUR_APP_URL/v1/completions \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "openai/gpt-oss-120b",
  "prompt": "Earth to gpt-oss. What can you do?",
  "max_tokens": 5000
  }'
  ```
</CodeGroup>

Since vLLM is compatible with the OpenAI API you can query the other endpoints [present here](https://platform.openai.com/docs/api-reference/completions/create).

You can also use the OpenAI Python SDK to query your deployment as shown below:

```python theme={null}
import openai

# Replace with your actual URL and token
base_url = "YOUR_APP_URL/v1"

client = openai.OpenAI(
    base_url=base_url
)

response = client.completions.create(
    model="openai/gpt-oss-120b",
    prompt="Hello, gpt-oss! What can you do today?",
    max_tokens=200
)

print(response)
```

<Note>
  Remember to configure a TLS endpoint with a [custom domain](/concepts/custom_domains_with_tls) before going to production for security and compatibility with modern clients.
</Note>

## Conclusion

With this guide, you've successfully deployed OpenAI's oss models on serverless GPUs using Tensorfuse. These models represent the cutting edge of open-source AI, offering capabilities that rival or exceed proprietary alternatives at a fraction of the cost.

[Click here](https://app.tensorfuse.io/) to get started with Tensorfuse.

You can also explore the [Tensorfuse examples repository](https://github.com/tensorfuse/tensorfuse-examples) for more deployment configurations and use cases.
