You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/container-apps/deploy-openai-gpt-oss-ollama.md
+55-43Lines changed: 55 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,52 +8,55 @@ ms.reviewer: cshoe
8
8
ms.service: azure-container-apps
9
9
ms.collection: ce-skilling-ai-copilot
10
10
ms.topic: tutorial
11
-
ms.date: 12/11/2025
11
+
ms.date: 12/12/2025
12
12
---
13
13
14
14
# Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs
15
15
16
-
OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new state-of-the-art open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments.
16
+
OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments.
17
17
18
18
This article shows you how to deploy these models by using [Azure Container Apps serverless GPUs](./gpu-serverless-overview.md) with Ollama, providing a cost-efficient and scalable platform with minimal infrastructure overhead.
19
19
20
-
## Learning objectives
20
+
By the end of this article, you can:
21
21
22
-
By the end of this article, you'll be able to:
23
-
24
-
- Use Azure Container Apps serverless GPUs for AI workloads
25
-
- Choose the right gpt-oss model for your needs
26
-
- Deploy an Ollama container on Azure Container Apps with GPU support
27
-
- Configure and interact with deployed models
28
-
- Call model APIs from external applications
22
+
> [!div class="checklist"]
23
+
> * Use Azure Container Apps serverless GPUs for AI workloads
24
+
> * Choose the right gpt-oss model for your needs
25
+
> * Deploy an Ollama container on Azure Container Apps with GPU support
26
+
> * Configure and interact with deployed models
27
+
> * Call model APIs from external applications
29
28
30
29
## Prerequisites
31
30
32
-
- An Azure subscription. If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn).
33
-
- Quota for serverless GPUs in Azure Container Apps. If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
34
-
- Basic understanding of containers and Azure services
35
-
- Familiarity with command-line interface
31
+
***An Azure subscription**: If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn).
32
+
***Quota for serverless GPUs**: If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
36
33
37
34
## What are Azure Container Apps serverless GPUs?
38
35
39
36
Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. By using serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand.
40
37
41
-
### Key benefits
38
+
### Benefits of using serverless GPUs
39
+
40
+
Azure Container Apps serverless GPUs provide the following advantages for deploying AI models:
41
+
42
+
***Autoscaling**: Scale to zero when idle, scale out based on demand.
43
+
44
+
***Pay-per-second billing**: Pay only for the compute you use.
42
45
43
-
-**Autoscaling**: Scale to zero when idle, scale out based on demand.
44
-
-**Pay-per-second billing**: Pay only for the compute you use.
45
-
-**Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud.
46
-
-**No infrastructure management**: Focus on your model and application.
47
-
-**Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance.
46
+
***Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud.
47
+
48
+
***No infrastructure management**: Focus on your model and application.
49
+
50
+
***Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance.
48
51
49
52
## Choose the right gpt-oss model
50
53
51
54
The [gpt-oss models](https://openai.com/index/introducing-gpt-oss/) deliver strong performance across common language benchmarks and are optimized for different use cases:
52
55
53
56
| Model | Performance | Use cases | Recommended GPU |
|`gpt-oss-20b`| Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 |
57
60
58
61
### Regional availability
59
62
@@ -83,8 +86,9 @@ Choose your deployment region based on the model you want to use and GPU availab
83
86
1. Select **Container App** and then select **Create**.
84
87
85
88
1. On the **Basics** tab, configure the following settings:
86
-
- Keep most default values.
87
-
- For **Region**, select a region that supports your chosen model based on the regional availability table.
89
+
90
+
* Keep most default values.
91
+
* For **Region**, select a region that supports your chosen model based on the regional availability table.
88
92
89
93
### Step 2: Configure container settings
90
94
@@ -94,19 +98,21 @@ Choose your deployment region based on the model you want to use and GPU availab
94
98
95
99
| Field | Value |
96
100
| --- | --- |
97
-
|**Image source**| Docker Hub or other registries |
98
-
|**Image type**| Public |
101
+
|**Image source**|Select **Docker Hub or other registries**|
102
+
|**Image type**|Select **Public**|
99
103
|**Registry login server**| docker.io |
100
-
|**Image and tag**| ollama/ollama:latest|
101
-
|**Workload profile**| Consumption |
102
-
|**GPU**|✅ (check the box)|
103
-
|**GPU type**| A100 for gpt-oss:120b<br>T4 or A100 for gpt-oss:20b |
104
+
|**Image and tag**|Enter **ollama/ollama:latest**|
105
+
|**Workload profile**|Select **Consumption**|
106
+
|**GPU**|Select the **GPU**box |
107
+
|**GPU type**|Select **A100** for gpt-oss:120b, select **T4**, or **A100** for gpt-oss:20b |
104
108
105
109
> [!IMPORTANT]
106
-
> By default, pay-as-you-go and EA customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
110
+
> By default, pay-as-you-go and enterprise agreement customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
107
111
108
112
### Step 3: Configure ingress
109
113
114
+
Configure ingress to allow external access to your Ollama container and enable API calls to your deployed models.
115
+
110
116
1. Select the **Ingress** tab.
111
117
112
118
1. Configure the following settings:
@@ -123,6 +129,8 @@ Choose your deployment region based on the model you want to use and GPU availab
123
129
124
130
## Deploy and use your gpt-oss model
125
131
132
+
After creating your container app with GPU support and ingress, you're ready to pull and run the gpt-oss model.
133
+
126
134
### Step 1: Access your deployed application
127
135
128
136
1. Once your deployment is complete, select **Go to resource**.
@@ -132,7 +140,7 @@ Choose your deployment region based on the model you want to use and GPU availab
132
140
### Step 2: Pull and run the model
133
141
134
142
> [!TIP]
135
-
> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back in after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing.
143
+
> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing.
136
144
137
145
1. In the Azure portal, select the **Monitoring** dropdown, and then select **Console**.
138
146
@@ -152,7 +160,7 @@ Choose your deployment region based on the model you want to use and GPU availab
152
160
153
161
1. Test the model with a sample prompt:
154
162
155
-
```
163
+
```text
156
164
Can you explain LLMs and recent developments in AI the last few years?
157
165
```
158
166
@@ -170,8 +178,10 @@ You can interact with your deployed model by using REST API calls from your loca
170
178
171
179
1. Set the OLLAMA_URL environment variable:
172
180
181
+
Make sure to replace the placeholder surrounded by `<>` with your value before running the following command.
182
+
173
183
```bash
174
-
export OLLAMA_URL="{Your application URL}"
184
+
export OLLAMA_URL="<YOUR_APPLICATION_URL>"
175
185
```
176
186
177
187
### Make API calls
@@ -190,23 +200,25 @@ This curl request has streaming set to false, so it returns the fully generated
190
200
191
201
## Clean up resources
192
202
193
-
To avoid incurring charges on your Azure subscription, clean up the resources you created in this article.
203
+
To avoid charges on your Azure subscription, clean up the resources you created in this article.
194
204
195
205
1. In the Azure portal, go to your resource group.
196
206
1. Select **Delete resource group**.
197
-
1.Enter your resource group name to confirm deletion.
207
+
1.To confirm the delete operation, enter your resource group name.
198
208
1. Select **Delete**.
199
209
200
210
## Next steps
201
211
202
-
Now that you successfully deployed a gpt-oss model, consider these next steps:
212
+
Now that you successfully deployed a gpt-oss model, consider the following ways to further develop your application:
213
+
214
+
***Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md).
215
+
216
+
***Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search).
203
217
204
-
-**Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md).
205
-
-**Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search).
206
-
-**Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options.
218
+
***Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options.
0 commit comments