Skip to content

Commit 14aa7de

Browse files
authored
Merge pull request #309403 from craigshoemaker/aca/ollama
[Continer Apps] New: Deploy OpenAI gpt-oss models with Ollama
2 parents 44c1e7f + bec4c55 commit 14aa7de

2 files changed

Lines changed: 226 additions & 0 deletions

File tree

articles/container-apps/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -195,6 +195,8 @@ items:
195195
href: gpu-image-generation.md
196196
- name: Deploy an NVIDIA Llama3 NIM
197197
href: serverless-gpu-nim.md
198+
- name: Deploy OpenAI GPT with OSS Ollama
199+
href: deploy-openai-gpt-oss-ollama.md
198200
- name: Dynamic sessions
199201
items:
200202
- name: Overview
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs
3+
description: "Learn how to deploy and run OpenAIs open-source gpt-oss-120b and gpt-oss-20b language models using Ollama on Azure Container Apps with serverless GPU support."
4+
#customer intent: As a developer, I want to deploy OpenAI's gpt-oss models on Azure Container Apps so that I can leverage serverless GPUs for scalable AI workloads.
5+
author: craigshoemaker
6+
ms.author: cshoe
7+
ms.reviewer: cshoe
8+
ms.service: azure-container-apps
9+
ms.collection: ce-skilling-ai-copilot
10+
ms.topic: tutorial
11+
ms.date: 12/12/2025
12+
---
13+
14+
# Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs
15+
16+
OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments.
17+
18+
This article shows you how to deploy these models by using [Azure Container Apps serverless GPUs](./gpu-serverless-overview.md) with Ollama, providing a cost-efficient and scalable platform with minimal infrastructure overhead.
19+
20+
By the end of this article, you can:
21+
22+
> [!div class="checklist"]
23+
> * Use Azure Container Apps serverless GPUs for AI workloads
24+
> * Choose the right gpt-oss model for your needs
25+
> * Deploy an Ollama container on Azure Container Apps with GPU support
26+
> * Configure and interact with deployed models
27+
> * Call model APIs from external applications
28+
29+
## Prerequisites
30+
31+
* **An Azure subscription**: If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn).
32+
* **Quota for serverless GPUs**: If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
33+
34+
## What are Azure Container Apps serverless GPUs?
35+
36+
Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. By using serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand.
37+
38+
### Benefits of using serverless GPUs
39+
40+
Azure Container Apps serverless GPUs provide the following advantages for deploying AI models:
41+
42+
* **Autoscaling**: Scale to zero when idle, scale out based on demand.
43+
44+
* **Pay-per-second billing**: Pay only for the compute you use.
45+
46+
* **Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud.
47+
48+
* **No infrastructure management**: Focus on your model and application.
49+
50+
* **Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance.
51+
52+
## Choose the right gpt-oss model
53+
54+
The [gpt-oss models](https://openai.com/index/introducing-gpt-oss/) deliver strong performance across common language benchmarks and are optimized for different use cases:
55+
56+
| Model | Performance | Use cases | Recommended GPU |
57+
|-------|-------------|-----------|-----------------|
58+
| `gpt-oss-120b` | Comparable to OpenAI's gpt-4o-mini | High-performance reasoning workloads | A100 |
59+
| `gpt-oss-20b` | Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 |
60+
61+
### Regional availability
62+
63+
Choose your deployment region based on the model you want to use and GPU availability:
64+
65+
| Region | A100 | T4 |
66+
| --- | --- | --- |
67+
| West US || |
68+
| West US 3 |||
69+
| Sweden Central |||
70+
| Australia East |||
71+
| West Europe | ||
72+
73+
> [!NOTE]
74+
> To run the 120 billion parameter model, select one of the A100 regions. To run the 20 billion parameter model, select either a T4 or A100 region.
75+
76+
## Deploy your container app
77+
78+
### Step 1: Create the container app resource
79+
80+
1. Go to the [Azure portal](https://portal.azure.com/).
81+
82+
1. Select **Create a resource**.
83+
84+
1. Search for **Container Apps**.
85+
86+
1. Select **Container App** and then select **Create**.
87+
88+
1. On the **Basics** tab, configure the following settings:
89+
90+
* Keep most default values.
91+
* For **Region**, select a region that supports your chosen model based on the regional availability table.
92+
93+
### Step 2: Configure container settings
94+
95+
1. Select the **Container** tab.
96+
97+
1. Configure the Ollama container settings:
98+
99+
| Field | Value |
100+
| --- | --- |
101+
| **Image source** | Select **Docker Hub or other registries** |
102+
| **Image type** | Select **Public** |
103+
| **Registry login server** | docker.io |
104+
| **Image and tag** | Enter **ollama/ollama:latest** |
105+
| **Workload profile** | Select **Consumption** |
106+
| **GPU** | Select the **GPU** box |
107+
| **GPU type** | Select **A100** for gpt-oss:120b, select **T4**, or **A100** for gpt-oss:20b |
108+
109+
> [!IMPORTANT]
110+
> By default, pay-as-you-go and enterprise agreement customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
111+
112+
### Step 3: Configure ingress
113+
114+
Configure ingress to allow external access to your Ollama container and enable API calls to your deployed models.
115+
116+
1. Select the **Ingress** tab.
117+
118+
1. Configure the following settings:
119+
120+
| Field | Value |
121+
| --- | --- |
122+
| **Ingress** | Enabled |
123+
| **Ingress traffic** | Accepting traffic from anywhere |
124+
| **Target port** | 11434 |
125+
126+
1. Select **Review + Create** at the bottom of the page.
127+
128+
1. Select **Create** to deploy your container app.
129+
130+
## Deploy and use your gpt-oss model
131+
132+
After creating your container app with GPU support and ingress, you're ready to pull and run the gpt-oss model.
133+
134+
### Step 1: Access your deployed application
135+
136+
1. Once your deployment is complete, select **Go to resource**.
137+
138+
1. Note the **Application URL** for your container app. You use this URL later for API calls.
139+
140+
### Step 2: Pull and run the model
141+
142+
> [!TIP]
143+
> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing.
144+
145+
1. In the Azure portal, select the **Monitoring** dropdown, and then select **Console**.
146+
147+
1. Under **Choose start up command**, select **Connect**.
148+
149+
1. Pull the gpt-oss model by running the following command. Use `120b` or `20b` depending on which model you want to run:
150+
151+
```bash
152+
ollama pull gpt-oss:120b
153+
```
154+
155+
1. Run the gpt-oss model:
156+
157+
```bash
158+
ollama run gpt-oss:120b
159+
```
160+
161+
1. Test the model with a sample prompt:
162+
163+
```text
164+
Can you explain LLMs and recent developments in AI the last few years?
165+
```
166+
167+
You successfully deployed and ran an OpenAI gpt-oss model on Azure Container Apps serverless GPUs.
168+
169+
## (Optional) Call the API from external applications
170+
171+
You can interact with your deployed model by using REST API calls from your local machine or other applications.
172+
173+
### Set up the environment
174+
175+
1. Open your local command line or terminal.
176+
177+
1. Copy your container app URL from the Azure portal.
178+
179+
1. Set the OLLAMA_URL environment variable:
180+
181+
Make sure to replace the placeholder surrounded by `<>` with your value before running the following command.
182+
183+
```bash
184+
export OLLAMA_URL="<YOUR_APPLICATION_URL>"
185+
```
186+
187+
### Make API calls
188+
189+
Use the following curl command to prompt the gpt-oss model:
190+
191+
```bash
192+
curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" -d '{
193+
"model": "gpt-oss:120b",
194+
"prompt": "Can you explain LLMs and recent developments in AI the last few years?",
195+
"stream": false
196+
}'
197+
```
198+
199+
This curl request has streaming set to false, so it returns the fully generated response.
200+
201+
## Clean up resources
202+
203+
To avoid charges on your Azure subscription, clean up the resources you created in this article.
204+
205+
1. In the Azure portal, go to your resource group.
206+
1. Select **Delete resource group**.
207+
1. To confirm the delete operation, enter your resource group name.
208+
1. Select **Delete**.
209+
210+
## Next steps
211+
212+
Now that you successfully deployed a gpt-oss model, consider the following ways to further develop your application:
213+
214+
* **Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md).
215+
216+
* **Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search).
217+
218+
* **Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options.
219+
220+
## Related content
221+
222+
* [Azure Container Apps serverless GPU overview](gpu-serverless-overview.md)
223+
* [Storage mounts in Azure Container Apps](storage-mounts.md)
224+
* [Scale rules in Azure Container Apps](scale-app.md)

0 commit comments

Comments
 (0)