|
| 1 | +--- |
| 2 | +title: Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs |
| 3 | +description: "Learn how to deploy and run OpenAIs open-source gpt-oss-120b and gpt-oss-20b language models using Ollama on Azure Container Apps with serverless GPU support." |
| 4 | +#customer intent: As a developer, I want to deploy OpenAI's gpt-oss models on Azure Container Apps so that I can leverage serverless GPUs for scalable AI workloads. |
| 5 | +author: craigshoemaker |
| 6 | +ms.author: cshoe |
| 7 | +ms.reviewer: cshoe |
| 8 | +ms.service: azure-container-apps |
| 9 | +ms.collection: ce-skilling-ai-copilot |
| 10 | +ms.topic: tutorial |
| 11 | +ms.date: 12/12/2025 |
| 12 | +--- |
| 13 | + |
| 14 | +# Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs |
| 15 | + |
| 16 | +OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments. |
| 17 | + |
| 18 | +This article shows you how to deploy these models by using [Azure Container Apps serverless GPUs](./gpu-serverless-overview.md) with Ollama, providing a cost-efficient and scalable platform with minimal infrastructure overhead. |
| 19 | + |
| 20 | +By the end of this article, you can: |
| 21 | + |
| 22 | +> [!div class="checklist"] |
| 23 | +> * Use Azure Container Apps serverless GPUs for AI workloads |
| 24 | +> * Choose the right gpt-oss model for your needs |
| 25 | +> * Deploy an Ollama container on Azure Container Apps with GPU support |
| 26 | +> * Configure and interact with deployed models |
| 27 | +> * Call model APIs from external applications |
| 28 | +
|
| 29 | +## Prerequisites |
| 30 | + |
| 31 | +* **An Azure subscription**: If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn). |
| 32 | +* **Quota for serverless GPUs**: If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota). |
| 33 | + |
| 34 | +## What are Azure Container Apps serverless GPUs? |
| 35 | + |
| 36 | +Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. By using serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand. |
| 37 | + |
| 38 | +### Benefits of using serverless GPUs |
| 39 | + |
| 40 | +Azure Container Apps serverless GPUs provide the following advantages for deploying AI models: |
| 41 | + |
| 42 | +* **Autoscaling**: Scale to zero when idle, scale out based on demand. |
| 43 | + |
| 44 | +* **Pay-per-second billing**: Pay only for the compute you use. |
| 45 | + |
| 46 | +* **Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud. |
| 47 | + |
| 48 | +* **No infrastructure management**: Focus on your model and application. |
| 49 | + |
| 50 | +* **Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance. |
| 51 | + |
| 52 | +## Choose the right gpt-oss model |
| 53 | + |
| 54 | +The [gpt-oss models](https://openai.com/index/introducing-gpt-oss/) deliver strong performance across common language benchmarks and are optimized for different use cases: |
| 55 | + |
| 56 | +| Model | Performance | Use cases | Recommended GPU | |
| 57 | +|-------|-------------|-----------|-----------------| |
| 58 | +| `gpt-oss-120b` | Comparable to OpenAI's gpt-4o-mini | High-performance reasoning workloads | A100 | |
| 59 | +| `gpt-oss-20b` | Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 | |
| 60 | + |
| 61 | +### Regional availability |
| 62 | + |
| 63 | +Choose your deployment region based on the model you want to use and GPU availability: |
| 64 | + |
| 65 | +| Region | A100 | T4 | |
| 66 | +| --- | --- | --- | |
| 67 | +| West US | ✅ | | |
| 68 | +| West US 3 | ✅ | ✅ | |
| 69 | +| Sweden Central | ✅ | ✅ | |
| 70 | +| Australia East | ✅ | ✅ | |
| 71 | +| West Europe | | ✅ | |
| 72 | + |
| 73 | +> [!NOTE] |
| 74 | +> To run the 120 billion parameter model, select one of the A100 regions. To run the 20 billion parameter model, select either a T4 or A100 region. |
| 75 | +
|
| 76 | +## Deploy your container app |
| 77 | + |
| 78 | +### Step 1: Create the container app resource |
| 79 | + |
| 80 | +1. Go to the [Azure portal](https://portal.azure.com/). |
| 81 | + |
| 82 | +1. Select **Create a resource**. |
| 83 | + |
| 84 | +1. Search for **Container Apps**. |
| 85 | + |
| 86 | +1. Select **Container App** and then select **Create**. |
| 87 | + |
| 88 | +1. On the **Basics** tab, configure the following settings: |
| 89 | + |
| 90 | + * Keep most default values. |
| 91 | + * For **Region**, select a region that supports your chosen model based on the regional availability table. |
| 92 | + |
| 93 | +### Step 2: Configure container settings |
| 94 | + |
| 95 | +1. Select the **Container** tab. |
| 96 | + |
| 97 | +1. Configure the Ollama container settings: |
| 98 | + |
| 99 | + | Field | Value | |
| 100 | + | --- | --- | |
| 101 | + | **Image source** | Select **Docker Hub or other registries** | |
| 102 | + | **Image type** | Select **Public** | |
| 103 | + | **Registry login server** | docker.io | |
| 104 | + | **Image and tag** | Enter **ollama/ollama:latest** | |
| 105 | + | **Workload profile** | Select **Consumption** | |
| 106 | + | **GPU** | Select the **GPU** box | |
| 107 | + | **GPU type** | Select **A100** for gpt-oss:120b, select **T4**, or **A100** for gpt-oss:20b | |
| 108 | + |
| 109 | + > [!IMPORTANT] |
| 110 | + > By default, pay-as-you-go and enterprise agreement customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota). |
| 111 | +
|
| 112 | +### Step 3: Configure ingress |
| 113 | + |
| 114 | +Configure ingress to allow external access to your Ollama container and enable API calls to your deployed models. |
| 115 | + |
| 116 | +1. Select the **Ingress** tab. |
| 117 | + |
| 118 | +1. Configure the following settings: |
| 119 | + |
| 120 | + | Field | Value | |
| 121 | + | --- | --- | |
| 122 | + | **Ingress** | Enabled | |
| 123 | + | **Ingress traffic** | Accepting traffic from anywhere | |
| 124 | + | **Target port** | 11434 | |
| 125 | + |
| 126 | +1. Select **Review + Create** at the bottom of the page. |
| 127 | + |
| 128 | +1. Select **Create** to deploy your container app. |
| 129 | + |
| 130 | +## Deploy and use your gpt-oss model |
| 131 | + |
| 132 | +After creating your container app with GPU support and ingress, you're ready to pull and run the gpt-oss model. |
| 133 | + |
| 134 | +### Step 1: Access your deployed application |
| 135 | + |
| 136 | +1. Once your deployment is complete, select **Go to resource**. |
| 137 | + |
| 138 | +1. Note the **Application URL** for your container app. You use this URL later for API calls. |
| 139 | + |
| 140 | +### Step 2: Pull and run the model |
| 141 | + |
| 142 | +> [!TIP] |
| 143 | +> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing. |
| 144 | +
|
| 145 | +1. In the Azure portal, select the **Monitoring** dropdown, and then select **Console**. |
| 146 | + |
| 147 | +1. Under **Choose start up command**, select **Connect**. |
| 148 | + |
| 149 | +1. Pull the gpt-oss model by running the following command. Use `120b` or `20b` depending on which model you want to run: |
| 150 | + |
| 151 | + ```bash |
| 152 | + ollama pull gpt-oss:120b |
| 153 | + ``` |
| 154 | + |
| 155 | +1. Run the gpt-oss model: |
| 156 | + |
| 157 | + ```bash |
| 158 | + ollama run gpt-oss:120b |
| 159 | + ``` |
| 160 | + |
| 161 | +1. Test the model with a sample prompt: |
| 162 | + |
| 163 | + ```text |
| 164 | + Can you explain LLMs and recent developments in AI the last few years? |
| 165 | + ``` |
| 166 | + |
| 167 | +You successfully deployed and ran an OpenAI gpt-oss model on Azure Container Apps serverless GPUs. |
| 168 | + |
| 169 | +## (Optional) Call the API from external applications |
| 170 | + |
| 171 | +You can interact with your deployed model by using REST API calls from your local machine or other applications. |
| 172 | + |
| 173 | +### Set up the environment |
| 174 | + |
| 175 | +1. Open your local command line or terminal. |
| 176 | + |
| 177 | +1. Copy your container app URL from the Azure portal. |
| 178 | + |
| 179 | +1. Set the OLLAMA_URL environment variable: |
| 180 | + |
| 181 | + Make sure to replace the placeholder surrounded by `<>` with your value before running the following command. |
| 182 | + |
| 183 | + ```bash |
| 184 | + export OLLAMA_URL="<YOUR_APPLICATION_URL>" |
| 185 | + ``` |
| 186 | + |
| 187 | +### Make API calls |
| 188 | + |
| 189 | +Use the following curl command to prompt the gpt-oss model: |
| 190 | + |
| 191 | +```bash |
| 192 | +curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" -d '{ |
| 193 | + "model": "gpt-oss:120b", |
| 194 | + "prompt": "Can you explain LLMs and recent developments in AI the last few years?", |
| 195 | + "stream": false |
| 196 | +}' |
| 197 | +``` |
| 198 | + |
| 199 | +This curl request has streaming set to false, so it returns the fully generated response. |
| 200 | + |
| 201 | +## Clean up resources |
| 202 | + |
| 203 | +To avoid charges on your Azure subscription, clean up the resources you created in this article. |
| 204 | + |
| 205 | +1. In the Azure portal, go to your resource group. |
| 206 | +1. Select **Delete resource group**. |
| 207 | +1. To confirm the delete operation, enter your resource group name. |
| 208 | +1. Select **Delete**. |
| 209 | + |
| 210 | +## Next steps |
| 211 | + |
| 212 | +Now that you successfully deployed a gpt-oss model, consider the following ways to further develop your application: |
| 213 | + |
| 214 | +* **Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md). |
| 215 | + |
| 216 | +* **Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search). |
| 217 | + |
| 218 | +* **Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options. |
| 219 | + |
| 220 | +## Related content |
| 221 | + |
| 222 | +* [Azure Container Apps serverless GPU overview](gpu-serverless-overview.md) |
| 223 | +* [Storage mounts in Azure Container Apps](storage-mounts.md) |
| 224 | +* [Scale rules in Azure Container Apps](scale-app.md) |
0 commit comments