Merge pull request #53796 from GraemeMalcolm/main

JamesJBarnett · web-flow · commit 0e0aab5f7a73 · 2026-03-11T19:36:15.000-07:00
Updated GenAI Speech module
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/1-introduction.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/1-introduction.yml
@@ -3,10 +3,10 @@ uid: learn.wwl.develop-generative-ai-audio-apps.introduction
 title: Introduction
 metadata:
   title: Introduction
-  description: "Get started with audio-enabled generative AI models."
-  ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  description: "Get started with speech-capable generative AI models."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
 durationInMinutes: 1
 content: |
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/2-deploy-multimodal-model.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/2-deploy-multimodal-model.yml
@@ -1,12 +1,12 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.develop-generative-ai-audio-apps.deploy-multimodal-models
-title: Deploy a multimodal model
+title: Choose a speech-capable model
 metadata:
-  title: Deploy a multimodal model
-  description: "Deploy a multimodal model that can respond to audio-based prompts."
-  ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  title: Choose a speech-capable model
+  description: "Choose a speech-capable model for your application requirements."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
 durationInMinutes: 3
 content: |
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3-develop-audio-chat-app.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3-develop-audio-chat-app.yml
@@ -1,14 +1,13 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.develop-generative-ai-audio-apps.develop-audio-chat-apps
-title: Develop an audio-based chat app
+title: Transcribe speech
 metadata:
-  title: Develop an audio-based chat app
-  description: "Use Microsoft Foundry, Azure AI Model Inference, and Azure OpenAI SDKs to develop an audio-based chat app."
-  ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  title: Transcribe speech
+  description: "Use a generative AI model in Microsoft Foundry to transcribe speech."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
-durationInMinutes: 5
+durationInMinutes: 3
 content: |
   [!include[](includes/3-develop-audio-chat-app.md)]
-
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3b-develop-speech-app.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/3b-develop-speech-app.yml
@@ -0,0 +1,13 @@
+### YamlMime:ModuleUnit
+uid: learn.wwl.develop-generative-ai-audio-apps.develop-speech-apps
+title: Synthesize speech
+metadata:
+  title: Synthesize speech
+  description: "Use a generative AI model in Microsoft Foundry to synthesize speech."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
+  ms.topic: unit
+durationInMinutes: 3
+content: |
+  [!include[](includes/3b-develop-speech-app.md)]
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/4-exercise.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/4-exercise.yml
@@ -5,8 +5,8 @@ metadata:
   title: Exercise - Develop an audio-enabled chat app
   description: "Get practical experience of deploying a multimodal model and creating an audio-enabled chat app."
   ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
 durationInMinutes: 30
 content: |
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/5-knowledge-check.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/5-knowledge-check.yml
@@ -3,46 +3,34 @@ uid: learn.wwl.develop-generative-ai-audio-apps.knowledge-check
 title: Module assessment
 metadata:
   title: Module assessment
-  description: "Check your learning on audio-enabled generative AI."
-  ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  description: "Check your learning on speech-capable generative AI."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
-durationInMinutes: 3
+durationInMinutes: 2
 content: |
 quiz:
   questions:
-  - content: "Which kind of model can you use to respond to audio input?"
-    choices:
-    - content: "Only OpenAI GPT models"
-      isCorrect: false
-      explanation: "Incorrect."
-    - content: "Embedding models"
-      isCorrect: false
-      explanation: Incorrect."
-    - content: "Multimodal models"
-      isCorrect: true
-      explanation: "Correct."
-  - content: "How can you submit a prompt that asks a model to analyze an audio file?"
-    choices:
-    - content: "Submit one prompt with an audio-based message followed by another prompt with a text-based message."
-      isCorrect: false
-      explanation: "Incorrect."
-    - content: "Submit a prompt that contains a multi-part user message, containing both text content and audio content."
-      isCorrect: true
-      explanation: "Correct."
-    - content: "Submit the audio file as the system message and the instruction or question as the user message."
-      isCorrect: false
-      explanation: "Incorrect."
-  - content: "How can you include an audio in a message?"
-    choices:
-    - content: "As a URL or as binary data"
-      isCorrect: true
-      explanation: "Correct."
-    - content: "Only as a URL"
-      isCorrect: false
-      explanation: "Incorrect."
-    - content: "Only as binary data"
-      isCorrect: false
-      explanation: "Incorrect."
-
+    - content: "Which model can you use to generate text from speech?"
+      choices:
+        - content: "gpt-4o-mini"
+          isCorrect: false
+          explanation: "Incorrect."
+        - content: "gpt-4o-mini-tts"
+          isCorrect: false
+          explanation: Incorrect."
+        - content: "gpt-4o-mini-transcribe"
+          isCorrect: true
+          explanation: "Correct."
+    - content: "Which model can you use to synthesize speech from text?"
+      choices:
+        - content: "gpt-4o-mini"
+          isCorrect: false
+          explanation: "Incorrect."
+        - content: "gpt-4o-mini-tts"
+          isCorrect: true
+          explanation: "Correct."
+        - content: "gpt-4o-mini-transcribe"
+          isCorrect: false
+          explanation: "Incorrect."
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/6-summary.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/6-summary.yml
@@ -3,10 +3,10 @@ uid: learn.wwl.develop-generative-ai-audio-apps.summary
 title: Summary
 metadata:
   title: Summary
-  description: "Reflect on what you've learned about audio-enabled generative AI models."
-  ms.date: 05/6/2025
-  author: buzahid
-  ms.author: buzahid
+  description: "Reflect on what you've learned about speech-capable generative AI models."
+  ms.date: 03/11/2026
+  author: graememalcolm
+  ms.author: gmalc
   ms.topic: unit
 durationInMinutes: 1
 content: |
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/1-introduction.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/1-introduction.md
@@ -1,4 +1,8 @@
-Generative AI models make it possible to build intelligent chat-based applications that can understand and reason over input. Traditionally, text input is the primary mode of interaction with AI models, but multimodal models are increasingly becoming available. These models make it possible for chat applications to respond to audio input as well as text.
+Speech transcription and synthesis are useful capabilities in many scenarios, including:
 
-In this module, we'll discuss audio-enabled generative AI and explore how you can use Microsoft Foundry to create generative AI solutions that respond to prompts that include a mix of text and audio data.
+- Documenting spoken conversations in calls and meetings.
+- Generating captions for videos or presentations.
+- Creating audible user interfaces to improve application accessibility.
+- Developing hands-free AI assistants that read text messages or emails aloud.
 
+In this module, we'll explore how to use speech-capable generative AI models in Microsoft Foundry to convert speech to text and text to speech.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/2-deploy-multimodal-model.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/2-deploy-multimodal-model.md
@@ -1,17 +1,15 @@
-To handle prompts that include audio, you need to deploy a *multimodal* generative AI model - in other words, a model that supports not only text-based input, but audio-based input as well. Multimodal models available in Microsoft Foundry include (among others):
+Microsoft Foundry Models is a model catalog that includes generative AI models from multiple providers. Different models have different capabilities, and are optimized for different use-cases.
 
-- Microsoft **Phi-4-multimodal-instruct**
-- OpenAI **gpt-4o**
-- OpenAI **gpt-4o-mini**
+To find a suitable model, you can use the filter and search features in the Microsoft Foundry Portal.
 
-> [!TIP]
-> To learn more about available models in Microsoft Foundry, see the **[Model catalog and collections in Microsoft Foundry portal](/azure/ai-foundry/how-to/model-catalog-overview)** article in the Microsoft Foundry documentation.
-
-## Testing multimodal models with audio-based prompts
+![Screenshot of the model catalog in the Foundry Portal.](../media/model-catalog.png)
 
-After deploying a multimodal model, you can test it in the chat playground in Microsoft Foundry portal. Some models allow you to include audio attachments in the playground, either by uploading a file or recording a message.
+When it comes to speech-capable models, there are two common use-cases to consider:
 
-![Screenshot of the chat playground with an audio-based prompt.](../media/audio-prompt.png)
+- Generative AI models that can transcribe speech to text.
+- Generative AI models that can synthesize text to speech.
 
-In the chat playground, you can upload a local audio file and add text to the message to elicit a response from a multimodal model.
+Microsoft Foundry provides models that support both of these use-cases, including specialized speech-capable models from the **gpt-4o** family of OpenAI models.
 
+> [!TIP]
+> To learn more about available models in Microsoft Foundry, see the **[Microsoft Foundry Models overview](/azure/foundry/concepts/foundry-models-overview?azure-portal=true)** article in the Microsoft Foundry documentation.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3-develop-audio-chat-app.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3-develop-audio-chat-app.md
@@ -1,47 +1,39 @@
-To develop a client app that engages in audio-based chats with a multimodal model, you can use the same basic techniques used for text-based chats. You require a connection to the endpoint where the model is deployed, and you use that endpoint to submit prompts that consists of messages to the model and process the responses.
-
-The key difference is that prompts for an audio-based chat include multi-part user messages that contain both a *text* content item and an *audio* content item.
-
-![Diagram of a multi-part prompt being submitted to a model.](../media/multi-part-prompt.png)
-
-The JSON representation of a prompt that includes a multi-part user message looks something like this:
-
-```json
-{ 
-    "messages": [ 
-        { "role": "system", "content": "You are a helpful assistant." }, 
-        { "role": "user", "content": [  
-            { 
-                "type": "text", 
-                "text": "Transcribe this audio:" 
-            },
-            { 
-                "type": "audio_url",
-                "audio_url": {
-                    "url": "https://....."
-                }
-            }
-        ] } 
-    ]
-} 
-```
+Speech transcription, or *speech-to-text*, involves submitting audio content to a model, which responds with a text-based transcript of the speech in the audio source.
 
-The audio content item can be:
+Models that support speech-to-text operations include:
 
-- A URL to an audio file in a web site.
-- Binary audio data
+- **gpt-4o-transcribe**
+- **gpt-4o-mini-transcribe**
+- **gpt-4o-transcribe-diarize**
 
-When using binary data to submit a local audio file, the **audio_url** content takes the form of a base64 encoded value in a data URL format:
+> [!NOTE]
+> Model availability varies by region. Review the **[model regional availability](/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure?pivots=azure-openai#model-summary-table-and-region-availability&azure-portal=true)** table in the Microsoft Foundry documentation.
 
-```json
-{
-    "type": "audio_url",
-    "audio_url": {
-       "url": "data:audio/mp3;base64,<binary_audio_data>"
-    }
-}
-```
+## Using a speech-to-text model
+
+To use a speech-to-text model in your own application, you can use the **AzureOpenAI** client in the OpenAI SDK to connect to the endpoint for your Microsoft Foundry resource, and upload the contents of an audio file to the model for transcription.
+
+```python
+from openai import AzureOpenAI
+from pathlib import Path
 
-Depending on the model type, and where you deployed it, you can use Microsoft Azure AI Model Inference or OpenAI APIs to submit audio-based prompts. These libraries also provide language-specific SDKs that abstract the underlying REST APIs.
+# Create an AzureOpenAI client
+client = AzureOpenAI(
+    azure_endpoint=YOUR_FOUNDRY_ENDPOINT,
+    api_key=YOUR_FOUNDRY_KEY,
+    api_version="2025-03-01-preview"
+)
 
-In the exercise that follows in this module, you can use the Python or .NET SDK for the Azure AI Model Inference API and the OpenAI API to develop an audio-enabled chat application.
+# Get the audio file
+file_path = Path("speech.mp3")
+audio_file = open(file_path, "rb")
+
+# Use the model to transcribe the audio file
+transcription = client.audio.transcriptions.create(
+    model=YOUR_MODEL_DEPLOYMENT,
+    file=audio_file,
+    response_format="text"
+)
+
+print(transcription)
+```
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3b-develop-speech-app.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/3b-develop-speech-app.md
@@ -0,0 +1,39 @@
+Speech synthesis, or *text-to-speech*, is the reverse of speech-to-text. It involves submitting text to a model, which returns an audio stream of the vocalized text.
+
+Models that support text-to-speech operations include:
+
+- **gpt-4o-tts**
+- **gpt-4o-mini-tts**
+
+> [!NOTE]
+> Model availability varies by region. Review the **[model regional availability](/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure?pivots=azure-openai#model-summary-table-and-region-availability&azure-portal=true)** table in the Microsoft Foundry documentation.
+
+## Using a text-to-speech model
+
+Similarly to speech-to-text models, you can use the **AzureOpenAI** client in the OpenAI SDK to connect to the endpoint for your Microsoft Foundry resource, and upload text to a text-to-speech model for speech synthesis.
+
+```python
+from openai import AzureOpenAI
+from pathlib import Path
+
+# Create an AzureOpenAI client
+client = AzureOpenAI(
+    azure_endpoint=YOUR_FOUNDRY_ENDPOINT,
+    api_key=YOUR_FOUNDRY_KEY,
+    api_version="2025-03-01-preview"
+)
+
+# Path for audio output file
+speech_file_path = Path("output_speech.wav")
+
+# Generate speech and save to file
+with client.audio.speech.with_streaming_response.create(
+            model=YOUR_MODEL_DEPLOYMENT,
+            voice="alloy",
+            input="This speech was AI-generated!",
+            instructions="Speak in an upbeat, excited tone.",
+    ) as response:
+    response.stream_to_file(speech_file_path)
+
+print(f"Speech generated and saved to {speech_file_path}")
+```
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/4-exercise.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/4-exercise.md
@@ -1,8 +1,8 @@
-If you have an Azure subscription, you can complete this exercise to develop an audio-enabled chat app.
+If you have an Azure subscription, you can complete this exercise to implement speech transcription and synthesis using generative AI models in Microsoft Foundry.
 
 > [!NOTE]
 > If you don't have an Azure subscription, you can [sign up for an account](https://azure.microsoft.com/pricing/purchase-options/azure-account?cid=msft_learn), which includes credits for the first 30 days.
 
 Launch the exercise and follow the instructions.
 
-[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2320123&azure-portal=true)
+[![Button to launch exercise.](../media/launch-exercise.png)](https://go.microsoft.com/fwlink/?linkid=2356363&azure-portal=true)
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/6-summary.md b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/includes/6-summary.md
@@ -1,7 +1,7 @@
-In this module, you learned about audio-enabled generative AI models and how to implement chat solutions that include audio-based input.
+In this module, you learned about speech-capable AI models, and how you can use Microsoft Foundry to create generative AI solutions that:
 
-Audio-enabled models let you create AI solutions that can understand audio and respond to related questions or instructions. Beyond just identifying spoken words, some models can also use reasoning based on what they hear. For instance, they can summarize a message or assess the speaker's sentiment.
+- Transcribe speech to text.
+- Synthesize speech from text.
 
 > [!TIP]
-> For more information about working with multimodal models in Microsoft Foundry, see **[How to use image and audio in chat completions with Azure AI model inference](/azure/ai-foundry/model-inference/how-to/use-chat-multi-modal)** and **[Quickstart: Use speech and audio in your AI chats](/azure/ai-services/openai/realtime-audio-quickstart)**.
-
+> For more information about speech-capable models in Microsoft Foundry, see **[Audio models](/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure?pivots=azure-openai#audio-models&azure-portal=true)** in the Microsoft Foundry documentation.
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/index.yml b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/index.yml
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/audio-prompt.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/audio-prompt.png
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/model-catalog.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/model-catalog.png
diff --git a/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/multi-part-prompt.png b/learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/media/multi-part-prompt.png