Skip to content

Commit 0e0aab5

Browse files
Merge pull request #53796 from GraemeMalcolm/main
Updated GenAI Speech module
2 parents d83e287 + b6e0413 commit 0e0aab5

17 files changed

Lines changed: 177 additions & 145 deletions

learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/1-introduction.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ uid: learn.wwl.develop-generative-ai-audio-apps.introduction
33
title: Introduction
44
metadata:
55
title: Introduction
6-
description: "Get started with audio-enabled generative AI models."
7-
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
6+
description: "Get started with speech-capable generative AI models."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
durationInMinutes: 1
1212
content: |

learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/2-deploy-multimodal-model.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.develop-generative-ai-audio-apps.deploy-multimodal-models
3-
title: Deploy a multimodal model
3+
title: Choose a speech-capable model
44
metadata:
5-
title: Deploy a multimodal model
6-
description: "Deploy a multimodal model that can respond to audio-based prompts."
7-
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
5+
title: Choose a speech-capable model
6+
description: "Choose a speech-capable model for your application requirements."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
durationInMinutes: 3
1212
content: |
Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.develop-generative-ai-audio-apps.develop-audio-chat-apps
3-
title: Develop an audio-based chat app
3+
title: Transcribe speech
44
metadata:
5-
title: Develop an audio-based chat app
6-
description: "Use Microsoft Foundry, Azure AI Model Inference, and Azure OpenAI SDKs to develop an audio-based chat app."
7-
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
5+
title: Transcribe speech
6+
description: "Use a generative AI model in Microsoft Foundry to transcribe speech."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
11-
durationInMinutes: 5
11+
durationInMinutes: 3
1212
content: |
1313
[!include[](includes/3-develop-audio-chat-app.md)]
14-
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.develop-generative-ai-audio-apps.develop-speech-apps
3+
title: Synthesize speech
4+
metadata:
5+
title: Synthesize speech
6+
description: "Use a generative AI model in Microsoft Foundry to synthesize speech."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
10+
ms.topic: unit
11+
durationInMinutes: 3
12+
content: |
13+
[!include[](includes/3b-develop-speech-app.md)]

learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/4-exercise.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@ metadata:
55
title: Exercise - Develop an audio-enabled chat app
66
description: "Get practical experience of deploying a multimodal model and creating an audio-enabled chat app."
77
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
durationInMinutes: 30
1212
content: |

learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/5-knowledge-check.yml

Lines changed: 27 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -3,46 +3,34 @@ uid: learn.wwl.develop-generative-ai-audio-apps.knowledge-check
33
title: Module assessment
44
metadata:
55
title: Module assessment
6-
description: "Check your learning on audio-enabled generative AI."
7-
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
6+
description: "Check your learning on speech-capable generative AI."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
11-
durationInMinutes: 3
11+
durationInMinutes: 2
1212
content: |
1313
quiz:
1414
questions:
15-
- content: "Which kind of model can you use to respond to audio input?"
16-
choices:
17-
- content: "Only OpenAI GPT models"
18-
isCorrect: false
19-
explanation: "Incorrect."
20-
- content: "Embedding models"
21-
isCorrect: false
22-
explanation: Incorrect."
23-
- content: "Multimodal models"
24-
isCorrect: true
25-
explanation: "Correct."
26-
- content: "How can you submit a prompt that asks a model to analyze an audio file?"
27-
choices:
28-
- content: "Submit one prompt with an audio-based message followed by another prompt with a text-based message."
29-
isCorrect: false
30-
explanation: "Incorrect."
31-
- content: "Submit a prompt that contains a multi-part user message, containing both text content and audio content."
32-
isCorrect: true
33-
explanation: "Correct."
34-
- content: "Submit the audio file as the system message and the instruction or question as the user message."
35-
isCorrect: false
36-
explanation: "Incorrect."
37-
- content: "How can you include an audio in a message?"
38-
choices:
39-
- content: "As a URL or as binary data"
40-
isCorrect: true
41-
explanation: "Correct."
42-
- content: "Only as a URL"
43-
isCorrect: false
44-
explanation: "Incorrect."
45-
- content: "Only as binary data"
46-
isCorrect: false
47-
explanation: "Incorrect."
48-
15+
- content: "Which model can you use to generate text from speech?"
16+
choices:
17+
- content: "gpt-4o-mini"
18+
isCorrect: false
19+
explanation: "Incorrect."
20+
- content: "gpt-4o-mini-tts"
21+
isCorrect: false
22+
explanation: Incorrect."
23+
- content: "gpt-4o-mini-transcribe"
24+
isCorrect: true
25+
explanation: "Correct."
26+
- content: "Which model can you use to synthesize speech from text?"
27+
choices:
28+
- content: "gpt-4o-mini"
29+
isCorrect: false
30+
explanation: "Incorrect."
31+
- content: "gpt-4o-mini-tts"
32+
isCorrect: true
33+
explanation: "Correct."
34+
- content: "gpt-4o-mini-transcribe"
35+
isCorrect: false
36+
explanation: "Incorrect."

learn-pr/wwl-data-ai/develop-generative-ai-audio-apps/6-summary.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,10 @@ uid: learn.wwl.develop-generative-ai-audio-apps.summary
33
title: Summary
44
metadata:
55
title: Summary
6-
description: "Reflect on what you've learned about audio-enabled generative AI models."
7-
ms.date: 05/6/2025
8-
author: buzahid
9-
ms.author: buzahid
6+
description: "Reflect on what you've learned about speech-capable generative AI models."
7+
ms.date: 03/11/2026
8+
author: graememalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
durationInMinutes: 1
1212
content: |
Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
Generative AI models make it possible to build intelligent chat-based applications that can understand and reason over input. Traditionally, text input is the primary mode of interaction with AI models, but multimodal models are increasingly becoming available. These models make it possible for chat applications to respond to audio input as well as text.
1+
Speech transcription and synthesis are useful capabilities in many scenarios, including:
22

3-
In this module, we'll discuss audio-enabled generative AI and explore how you can use Microsoft Foundry to create generative AI solutions that respond to prompts that include a mix of text and audio data.
3+
- Documenting spoken conversations in calls and meetings.
4+
- Generating captions for videos or presentations.
5+
- Creating audible user interfaces to improve application accessibility.
6+
- Developing hands-free AI assistants that read text messages or emails aloud.
47

8+
In this module, we'll explore how to use speech-capable generative AI models in Microsoft Foundry to convert speech to text and text to speech.
Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,15 @@
1-
To handle prompts that include audio, you need to deploy a *multimodal* generative AI model - in other words, a model that supports not only text-based input, but audio-based input as well. Multimodal models available in Microsoft Foundry include (among others):
1+
Microsoft Foundry Models is a model catalog that includes generative AI models from multiple providers. Different models have different capabilities, and are optimized for different use-cases.
22

3-
- Microsoft **Phi-4-multimodal-instruct**
4-
- OpenAI **gpt-4o**
5-
- OpenAI **gpt-4o-mini**
3+
To find a suitable model, you can use the filter and search features in the Microsoft Foundry Portal.
64

7-
> [!TIP]
8-
> To learn more about available models in Microsoft Foundry, see the **[Model catalog and collections in Microsoft Foundry portal](/azure/ai-foundry/how-to/model-catalog-overview)** article in the Microsoft Foundry documentation.
9-
10-
## Testing multimodal models with audio-based prompts
5+
![Screenshot of the model catalog in the Foundry Portal.](../media/model-catalog.png)
116

12-
After deploying a multimodal model, you can test it in the chat playground in Microsoft Foundry portal. Some models allow you to include audio attachments in the playground, either by uploading a file or recording a message.
7+
When it comes to speech-capable models, there are two common use-cases to consider:
138

14-
![Screenshot of the chat playground with an audio-based prompt.](../media/audio-prompt.png)
9+
- Generative AI models that can transcribe speech to text.
10+
- Generative AI models that can synthesize text to speech.
1511

16-
In the chat playground, you can upload a local audio file and add text to the message to elicit a response from a multimodal model.
12+
Microsoft Foundry provides models that support both of these use-cases, including specialized speech-capable models from the **gpt-4o** family of OpenAI models.
1713

14+
> [!TIP]
15+
> To learn more about available models in Microsoft Foundry, see the **[Microsoft Foundry Models overview](/azure/foundry/concepts/foundry-models-overview?azure-portal=true)** article in the Microsoft Foundry documentation.
Lines changed: 33 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,39 @@
1-
To develop a client app that engages in audio-based chats with a multimodal model, you can use the same basic techniques used for text-based chats. You require a connection to the endpoint where the model is deployed, and you use that endpoint to submit prompts that consists of messages to the model and process the responses.
2-
3-
The key difference is that prompts for an audio-based chat include multi-part user messages that contain both a *text* content item and an *audio* content item.
4-
5-
![Diagram of a multi-part prompt being submitted to a model.](../media/multi-part-prompt.png)
6-
7-
The JSON representation of a prompt that includes a multi-part user message looks something like this:
8-
9-
```json
10-
{
11-
"messages": [
12-
{ "role": "system", "content": "You are a helpful assistant." },
13-
{ "role": "user", "content": [
14-
{
15-
"type": "text",
16-
"text": "Transcribe this audio:"
17-
},
18-
{
19-
"type": "audio_url",
20-
"audio_url": {
21-
"url": "https://....."
22-
}
23-
}
24-
] }
25-
]
26-
}
27-
```
1+
Speech transcription, or *speech-to-text*, involves submitting audio content to a model, which responds with a text-based transcript of the speech in the audio source.
282

29-
The audio content item can be:
3+
Models that support speech-to-text operations include:
304

31-
- A URL to an audio file in a web site.
32-
- Binary audio data
5+
- **gpt-4o-transcribe**
6+
- **gpt-4o-mini-transcribe**
7+
- **gpt-4o-transcribe-diarize**
338

34-
When using binary data to submit a local audio file, the **audio_url** content takes the form of a base64 encoded value in a data URL format:
9+
> [!NOTE]
10+
> Model availability varies by region. Review the **[model regional availability](/azure/foundry/foundry-models/concepts/models-sold-directly-by-azure?pivots=azure-openai#model-summary-table-and-region-availability&azure-portal=true)** table in the Microsoft Foundry documentation.
3511
36-
```json
37-
{
38-
"type": "audio_url",
39-
"audio_url": {
40-
"url": "data:audio/mp3;base64,<binary_audio_data>"
41-
}
42-
}
43-
```
12+
## Using a speech-to-text model
13+
14+
To use a speech-to-text model in your own application, you can use the **AzureOpenAI** client in the OpenAI SDK to connect to the endpoint for your Microsoft Foundry resource, and upload the contents of an audio file to the model for transcription.
15+
16+
```python
17+
from openai import AzureOpenAI
18+
from pathlib import Path
4419

45-
Depending on the model type, and where you deployed it, you can use Microsoft Azure AI Model Inference or OpenAI APIs to submit audio-based prompts. These libraries also provide language-specific SDKs that abstract the underlying REST APIs.
20+
# Create an AzureOpenAI client
21+
client = AzureOpenAI(
22+
azure_endpoint=YOUR_FOUNDRY_ENDPOINT,
23+
api_key=YOUR_FOUNDRY_KEY,
24+
api_version="2025-03-01-preview"
25+
)
4626

47-
In the exercise that follows in this module, you can use the Python or .NET SDK for the Azure AI Model Inference API and the OpenAI API to develop an audio-enabled chat application.
27+
# Get the audio file
28+
file_path = Path("speech.mp3")
29+
audio_file = open(file_path, "rb")
30+
31+
# Use the model to transcribe the audio file
32+
transcription = client.audio.transcriptions.create(
33+
model=YOUR_MODEL_DEPLOYMENT,
34+
file=audio_file,
35+
response_format="text"
36+
)
37+
38+
print(transcription)
39+
```

0 commit comments

Comments
 (0)