updates

craigshoemaker · craigshoemaker · commit bec4c5524c32 · 2025-12-12T10:06:37.000-07:00
diff --git a/articles/container-apps/deploy-openai-gpt-oss-ollama.md b/articles/container-apps/deploy-openai-gpt-oss-ollama.md
@@ -8,52 +8,55 @@ ms.reviewer: cshoe
 ms.service: azure-container-apps
 ms.collection: ce-skilling-ai-copilot
 ms.topic: tutorial
-ms.date: 12/11/2025
+ms.date: 12/12/2025
 ---
 
 # Deploy OpenAI gpt-oss models with Ollama on Azure Container Apps serverless GPUs
 
-OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new state-of-the-art open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments.
+OpenAI recently announced the release of [gpt-oss-120b and gpt-oss-20b](https://openai.com/index/introducing-gpt-oss/), two new open-weight language models designed to run on lighter weight GPU resources. These models make powerful language capabilities highly accessible for developers who want to self-host language models within their own environments.
 
 This article shows you how to deploy these models by using [Azure Container Apps serverless GPUs](./gpu-serverless-overview.md) with Ollama, providing a cost-efficient and scalable platform with minimal infrastructure overhead.
 
-## Learning objectives
+By the end of this article, you can:
 
-By the end of this article, you'll be able to:
-
-- Use Azure Container Apps serverless GPUs for AI workloads
-- Choose the right gpt-oss model for your needs
-- Deploy an Ollama container on Azure Container Apps with GPU support
-- Configure and interact with deployed models
-- Call model APIs from external applications
+> [!div class="checklist"]
+> * Use Azure Container Apps serverless GPUs for AI workloads
+> * Choose the right gpt-oss model for your needs
+> * Deploy an Ollama container on Azure Container Apps with GPU support
+> * Configure and interact with deployed models
+> * Call model APIs from external applications
 
 ## Prerequisites
 
-- An Azure subscription. If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn).
-- Quota for serverless GPUs in Azure Container Apps. If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
-- Basic understanding of containers and Azure services
-- Familiarity with command-line interface
+* **An Azure subscription**: If you don't have one, [create a free account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn).
+* **Quota for serverless GPUs**: If you don't have quota, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
 
 ## What are Azure Container Apps serverless GPUs?
 
 Azure Container Apps is a fully managed, serverless container platform that simplifies the deployment and operation of containerized applications. By using serverless GPU support, you can bring your own containers and deploy them to GPU-backed environments that automatically scale based on demand.
 
-### Key benefits
+### Benefits of using serverless GPUs
+
+Azure Container Apps serverless GPUs provide the following advantages for deploying AI models:
+
+* **Autoscaling**: Scale to zero when idle, scale out based on demand.
+
+* **Pay-per-second billing**: Pay only for the compute you use.
 
-- **Autoscaling**: Scale to zero when idle, scale out based on demand.
-- **Pay-per-second billing**: Pay only for the compute you use.
-- **Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud.
-- **No infrastructure management**: Focus on your model and application.
-- **Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance.
+* **Ease of use**: Accelerate developer velocity and easily bring any container to run on GPUs in the cloud.
+
+* **No infrastructure management**: Focus on your model and application.
+
+* **Enterprise-grade features**: Built-in support for virtual networks, managed identity, private endpoints, and full data governance.
 
 ## Choose the right gpt-oss model
 
 The [gpt-oss models](https://openai.com/index/introducing-gpt-oss/) deliver strong performance across common language benchmarks and are optimized for different use cases:
 
 | Model | Performance | Use cases | Recommended GPU |
 |-------|-------------|-----------|-----------------|
-| gpt-oss-120b | Comparable to OpenAI's gpt-4o-mini | High-performance reasoning workloads | A100 |
-| gpt-oss-20b | Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 |
+| `gpt-oss-120b` | Comparable to OpenAI's gpt-4o-mini | High-performance reasoning workloads | A100 |
+| `gpt-oss-20b` | Comparable to gpt-o3-mini | Lightweight applications, cost-effective small language model (SLM) apps | T4 or A100 |
 
 ### Regional availability
 
@@ -83,8 +86,9 @@ Choose your deployment region based on the model you want to use and GPU availab
 1. Select **Container App** and then select **Create**.
 
 1. On the **Basics** tab, configure the following settings:
-   - Keep most default values.
-   - For **Region**, select a region that supports your chosen model based on the regional availability table.
+
+   * Keep most default values.
+   * For **Region**, select a region that supports your chosen model based on the regional availability table.
 
 ### Step 2: Configure container settings
 
@@ -94,19 +98,21 @@ Choose your deployment region based on the model you want to use and GPU availab
 
    | Field | Value |
    | --- | --- |
-   | **Image source** | Docker Hub or other registries |
-   | **Image type** | Public |
+   | **Image source** | Select **Docker Hub or other registries** |
+   | **Image type** | Select **Public** |
    | **Registry login server** | docker.io |
-   | **Image and tag** | ollama/ollama:latest |
-   | **Workload profile** | Consumption |
-   | **GPU** | ✅ (check the box) |
-   | **GPU type** | A100 for gpt-oss:120b<br>T4 or A100 for gpt-oss:20b |
+   | **Image and tag** | Enter **ollama/ollama:latest** |
+   | **Workload profile** | Select **Consumption** |
+   | **GPU** | Select the **GPU** box |
+   | **GPU type** | Select **A100** for gpt-oss:120b, select **T4**, or **A100** for gpt-oss:20b |
 
    > [!IMPORTANT]
-   > By default, pay-as-you-go and EA customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
+   > By default, pay-as-you-go and enterprise agreement customers have quota. If you don't have quota for serverless GPUs in Azure Container Apps, [request a GPU quota](gpu-serverless-overview.md#request-serverless-gpu-quota).
 
 ### Step 3: Configure ingress
 
+Configure ingress to allow external access to your Ollama container and enable API calls to your deployed models.
+
 1. Select the **Ingress** tab.
 
 1. Configure the following settings:
@@ -123,6 +129,8 @@ Choose your deployment region based on the model you want to use and GPU availab
 
 ## Deploy and use your gpt-oss model
 
+After creating your container app with GPU support and ingress, you're ready to pull and run the gpt-oss model.
+
 ### Step 1: Access your deployed application
 
 1. Once your deployment is complete, select **Go to resource**.
@@ -132,7 +140,7 @@ Choose your deployment region based on the model you want to use and GPU availab
 ### Step 2: Pull and run the model
 
 > [!TIP]
-> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back in after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing.
+> Console commands in the container app aren't counted as traffic for the container app to stay scaled out, so your application might scale back after a set period. If you want the container app to remain active for a longer duration, go to **Application** > **Scaling** and set the minimum replica count to 1 or increase the cooldown period duration. Remember to reset the minimum replica count to 0 when not in use to avoid ongoing billing.
 
 1. In the Azure portal, select the **Monitoring** dropdown, and then select **Console**.
 
@@ -152,7 +160,7 @@ Choose your deployment region based on the model you want to use and GPU availab
 
 1. Test the model with a sample prompt:
 
-   ```
+   ```text
    Can you explain LLMs and recent developments in AI the last few years?
    ```
 
@@ -170,8 +178,10 @@ You can interact with your deployed model by using REST API calls from your loca
 
 1. Set the OLLAMA_URL environment variable:
 
+    Make sure to replace the placeholder surrounded by `<>` with your value before running the following command.
+
    ```bash
-   export OLLAMA_URL="{Your application URL}"
+   export OLLAMA_URL="<YOUR_APPLICATION_URL>"
    ```
 
 ### Make API calls
@@ -190,23 +200,25 @@ This curl request has streaming set to false, so it returns the fully generated
 
 ## Clean up resources
 
-To avoid incurring charges on your Azure subscription, clean up the resources you created in this article.
+To avoid charges on your Azure subscription, clean up the resources you created in this article.
 
 1. In the Azure portal, go to your resource group.
 1. Select **Delete resource group**.
-1. Enter your resource group name to confirm deletion.
+1. To confirm the delete operation, enter your resource group name.
 1. Select **Delete**.
 
 ## Next steps
 
-Now that you successfully deployed a gpt-oss model, consider these next steps:
+Now that you successfully deployed a gpt-oss model, consider the following ways to further develop your application:
+
+* **Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md).
+
+* **Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search).
 
-- **Add persistent storage**: Azure Container Apps is fully ephemeral and doesn't feature mounted storage by default. To persist your data and conversations, [add a volume mount to your container app](storage-mounts.md).
-- **Explore other models**: Follow these same steps to run any model available in [Ollama's library](https://ollama.com/search).
-- **Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options.
+* **Learn more about serverless GPUs**: Review the [Azure Container Apps serverless GPU documentation](gpu-serverless-overview.md) for advanced configuration options.
 
 ## Related content
 
-- [Azure Container Apps serverless GPU overview](gpu-serverless-overview.md)
-- [Storage mounts in Azure Container Apps](storage-mounts.md)
-- [Scale rules in Azure Container Apps](scale-app.md)
+* [Azure Container Apps serverless GPU overview](gpu-serverless-overview.md)
+* [Storage mounts in Azure Container Apps](storage-mounts.md)
+* [Scale rules in Azure Container Apps](scale-app.md)