|
| 1 | +--- |
| 2 | +title: Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs |
| 3 | +description: "Learn how to deploy and run OpenAIs open-source gpt-oss-120b and gpt-oss-20b language models using Ollama on Azure Container Apps with serverless GPU support." |
| 4 | +#customer intent: As a developer, I want to deploy OpenAI's gpt-oss models on Azure Container Apps so that I can leverage serverless GPUs for scalable AI workloads. |
| 5 | +author: craigshoemaker |
| 6 | +ms.author: cshoe |
| 7 | +ms.reviewer: cshoe |
| 8 | +ms.service: azure-container-apps |
| 9 | +ms.collection: ce-skilling-ai-copilot |
| 10 | +ms.topic: tutorial |
| 11 | +ms.date: 12/11/2025 |
| 12 | +--- |
| 13 | + |
| 14 | +# Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs |
| 15 | + |
| 16 | +OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new state-of-the-art open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments. |
| 17 | + |
| 18 | +This article shows you how to deploy these models by using [Azure Container Apps serverless GPUs](./gpu-serverless-overview.md) with Ollama, providing a cost-efficient and scalable platform with minimal infrastructure overhead. |
| 19 | + |
| 20 | +## Learning objectives |
| 21 | + |
| 22 | +By the end of this article, you'll be able to: |
| 23 | + |
| 24 | +- Use Azure Container Apps serverless GPUs for AI workloads |
| 25 | +- Choose the right gpt-oss model for your needs |
| 26 | +- Deploy an Ollama container on Azure Container Apps with GPU support |
| 27 | +- Configure and interact with deployed models |
| 28 | +- Call model APIs from external applications |
| 29 | + |
| 30 | +## Prerequisites |
| 31 | + |
| 32 | +- An Azure subscription. If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn). |
| 33 | +- Quota for serverless GPUs in Azure Container Apps. If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota). |
| 34 | +- Basic understanding of containers and Azure services |
| 35 | +- Familiarity with command-line interface |
| 36 | + |
| 37 | +## What are Azure Container Apps serverless GPUs? |
| 38 | + |
| 39 | +Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. By using serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand. |
| 40 | + |
| 41 | +### Key benefits |
| 42 | + |
| 43 | +- **Autoscaling**: Scale to zero when idle, scale out based on demand. |
| 44 | +- **Pay-per-second billing**: Pay only for the compute you use. |
| 45 | +- **Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud. |
| 46 | +- **No infrastructure management**: Focus on your model and application. |
| 47 | +- **Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance. |
| 48 | + |
| 49 | +## Choose the right gpt-oss model |
| 50 | + |
| 51 | +The [gpt-oss models](https://openai.com/index/introducing-gpt-oss/) deliver strong performance across common language benchmarks and are optimized for different use cases: |
| 52 | + |
| 53 | +| Model | Performance | Use cases | Recommended GPU | |
| 54 | +|-------|-------------|-----------|-----------------| |
| 55 | +| gpt-oss-120b | Comparable to OpenAI's gpt-4o-mini | High-performance reasoning workloads | A100 | |
| 56 | +| gpt-oss-20b | Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 | |
| 57 | + |
| 58 | +### Regional availability |
| 59 | + |
| 60 | +Choose your deployment region based on the model you want to use and GPU availability: |
| 61 | + |
| 62 | +| Region | A100 | T4 | |
| 63 | +| --- | --- | --- | |
| 64 | +| West US | ✅ | | |
| 65 | +| West US 3 | ✅ | ✅ | |
| 66 | +| Sweden Central | ✅ | ✅ | |
| 67 | +| Australia East | ✅ | ✅ | |
| 68 | +| West Europe | | ✅ | |
| 69 | + |
| 70 | +> [!NOTE] |
| 71 | +> To run the 120 billion parameter model, select one of the A100 regions. To run the 20 billion parameter model, select either a T4 or A100 region. |
| 72 | +
|
| 73 | +## Deploy your container app |
| 74 | + |
| 75 | +### Step 1: Create the container app resource |
| 76 | + |
| 77 | +1. Go to the [Azure portal](https://portal.azure.com/). |
| 78 | + |
| 79 | +1. Select **Create a resource**. |
| 80 | + |
| 81 | +1. Search for **Container Apps**. |
| 82 | + |
| 83 | +1. Select **Container App** and then select **Create**. |
| 84 | + |
| 85 | +1. On the **Basics** tab, configure the following settings: |
| 86 | + - Keep most default values. |
| 87 | + - For **Region**, select a region that supports your chosen model based on the regional availability table. |
| 88 | + |
| 89 | +### Step 2: Configure container settings |
| 90 | + |
| 91 | +1. Select the **Container** tab. |
| 92 | + |
| 93 | +1. Configure the Ollama container settings: |
| 94 | + |
| 95 | + | Field | Value | |
| 96 | + | --- | --- | |
| 97 | + | **Image source** | Docker Hub or other registries | |
| 98 | + | **Image type** | Public | |
| 99 | + | **Registry login server** | docker.io | |
| 100 | + | **Image and tag** | ollama/ollama:latest | |
| 101 | + | **Workload profile** | Consumption | |
| 102 | + | **GPU** | ✅ (check the box) | |
| 103 | + | **GPU type** | A100 for gpt-oss:120b<br>T4 or A100 for gpt-oss:20b | |
| 104 | + |
| 105 | + > [!IMPORTANT] |
| 106 | + > By default, pay-as-you-go and EA customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota). |
| 107 | +
|
| 108 | +### Step 3: Configure ingress |
| 109 | + |
| 110 | +1. Select the **Ingress** tab. |
| 111 | + |
| 112 | +1. Configure the following settings: |
| 113 | + |
| 114 | + | Field | Value | |
| 115 | + | --- | --- | |
| 116 | + | **Ingress** | Enabled | |
| 117 | + | **Ingress traffic** | Accepting traffic from anywhere | |
| 118 | + | **Target port** | 11434 | |
| 119 | + |
| 120 | +1. Select **Review + Create** at the bottom of the page. |
| 121 | + |
| 122 | +1. Select **Create** to deploy your container app. |
| 123 | + |
| 124 | +## Deploy and use your gpt-oss model |
| 125 | + |
| 126 | +### Step 1: Access your deployed application |
| 127 | + |
| 128 | +1. Once your deployment is complete, select **Go to resource**. |
| 129 | + |
| 130 | +1. Note the **Application URL** for your container app. You use this URL later for API calls. |
| 131 | + |
| 132 | +### Step 2: Pull and run the model |
| 133 | + |
| 134 | +> [!TIP] |
| 135 | +> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back in after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing. |
| 136 | +
|
| 137 | +1. In the Azure portal, select the **Monitoring** dropdown, and then select **Console**. |
| 138 | + |
| 139 | +1. Under **Choose start up command**, select **Connect**. |
| 140 | + |
| 141 | +1. Pull the gpt-oss model by running the following command. Use `120b` or `20b` depending on which model you want to run: |
| 142 | + |
| 143 | + ```bash |
| 144 | + ollama pull gpt-oss:120b |
| 145 | + ``` |
| 146 | + |
| 147 | +1. Run the gpt-oss model: |
| 148 | + |
| 149 | + ```bash |
| 150 | + ollama run gpt-oss:120b |
| 151 | + ``` |
| 152 | + |
| 153 | +1. Test the model with a sample prompt: |
| 154 | + |
| 155 | + ``` |
| 156 | + Can you explain LLMs and recent developments in AI the last few years? |
| 157 | + ``` |
| 158 | + |
| 159 | +You successfully deployed and ran an OpenAI gpt-oss model on Azure Container Apps serverless GPUs. |
| 160 | + |
| 161 | +## (Optional) Call the API from external applications |
| 162 | + |
| 163 | +You can interact with your deployed model by using REST API calls from your local machine or other applications. |
| 164 | + |
| 165 | +### Set up the environment |
| 166 | + |
| 167 | +1. Open your local command line or terminal. |
| 168 | + |
| 169 | +1. Copy your container app URL from the Azure portal. |
| 170 | + |
| 171 | +1. Set the OLLAMA_URL environment variable: |
| 172 | + |
| 173 | + ```bash |
| 174 | + export OLLAMA_URL="{Your application URL}" |
| 175 | + ``` |
| 176 | + |
| 177 | +### Make API calls |
| 178 | + |
| 179 | +Use the following curl command to prompt the gpt-oss model: |
| 180 | + |
| 181 | +```bash |
| 182 | +curl -X POST "$OLLAMA_URL/api/generate" -H "Content-Type: application/json" -d '{ |
| 183 | + "model": "gpt-oss:120b", |
| 184 | + "prompt": "Can you explain LLMs and recent developments in AI the last few years?", |
| 185 | + "stream": false |
| 186 | +}' |
| 187 | +``` |
| 188 | + |
| 189 | +This curl request has streaming set to false, so it returns the fully generated response. |
| 190 | + |
| 191 | +## Clean up resources |
| 192 | + |
| 193 | +To avoid incurring charges on your Azure subscription, clean up the resources you created in this article. |
| 194 | + |
| 195 | +1. In the Azure portal, go to your resource group. |
| 196 | +1. Select **Delete resource group**. |
| 197 | +1. Enter your resource group name to confirm deletion. |
| 198 | +1. Select **Delete**. |
| 199 | + |
| 200 | +## Next steps |
| 201 | + |
| 202 | +Now that you successfully deployed a gpt-oss model, consider these next steps: |
| 203 | + |
| 204 | +- **Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md). |
| 205 | +- **Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search). |
| 206 | +- **Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options. |
| 207 | + |
| 208 | +## Related content |
| 209 | + |
| 210 | +- [Azure Container Apps serverless GPU overview](gpu-serverless-overview.md) |
| 211 | +- [Storage mounts in Azure Container Apps](storage-mounts.md) |
| 212 | +- [Scale rules in Azure Container Apps](scale-app.md) |
0 commit comments