Skip to content

Commit 47bf42f

Browse files
authored
Merge pull request #53509 from sherzyang/NEW-get-started-speech-azure
New get started speech azure
2 parents 14116fe + 739250c commit 47bf42f

26 files changed

Lines changed: 559 additions & 0 deletions
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: "Introduction"
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 1
13+
content: |
14+
[!include[](includes/1-introduction.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.speech-recognition
3+
title: Speech recognition
4+
metadata:
5+
title: Speech recognition
6+
description: "Speech recognition"
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 8
13+
content: |
14+
[!include[](includes/2-speech-recognition.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.speech-synthesis
3+
title: Speech synthesis
4+
metadata:
5+
title: Speech synthesis
6+
description: "Speech synthesis"
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 5
13+
content: |
14+
[!include[](includes/3-speech-synthesis.md)]
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.voice-live
3+
title: Creating a speech-capable agent
4+
metadata:
5+
title: Creating a speech-capable agent
6+
description: "Creating a speech-capable agent."
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 4
13+
content: |
14+
[!include[](includes/4-voice-live.md)]
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.exercise
3+
title: Exercise - Get started with speech in Microsoft Foundry
4+
metadata:
5+
title: Exercise - Get started with speech in Microsoft Foundry
6+
description: "Exercise - Get started with speech in Microsoft Foundry"
7+
ms.date: 02/13/2026
8+
author: sherzyang
9+
ms.author: sheryang
10+
ms.topic: unit
11+
ms.custom:
12+
- N/A
13+
durationInMinutes: 30
14+
content: |
15+
[!include[](includes/5-exercise.md)]
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: "Knowledge check"
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
module_assessment: true
12+
durationInMinutes: 3
13+
quiz:
14+
title: ""
15+
questions:
16+
- content: "Why would a developer use the Azure Speech‑to‑Text SDK instead of only using the Foundry playground?"
17+
choices:
18+
- content: "The SDK replaces the need for Azure Speech models."
19+
isCorrect: false
20+
explanation: "The SDK does not replace Azure Speech models—it acts as a bridge between application code and the Azure Speech service."
21+
- content: "The SDK is required to upload audio files to the Foundry portal."
22+
isCorrect: false
23+
explanation: "The Foundry playground already allows uploading or recording audio without writing code."
24+
- content: "The SDK allows speech recognition to be added directly into application code."
25+
isCorrect: true
26+
explanation: "The Speech‑to‑Text SDK is designed for use in applications, handling tasks like audio streaming, authentication, and receiving transcription results in code."
27+
- content: "What does the Azure Text‑to‑Speech SDK handle for developers?"
28+
choices:
29+
- content: "Only selecting the voice and writing audio files manually"
30+
isCorrect: false
31+
explanation: "The SDK does more than voice selection—it manages the full process of sending text and receiving audio."
32+
- content: "Authentication, network communication, and audio generation"
33+
isCorrect: true
34+
explanation: "The Text‑to‑Speech SDK handles authentication, network communication, audio formatting, and playback, allowing developers to focus on application logic."
35+
- content: "Storing synthesized audio permanently in Azure Storage"
36+
isCorrect: false
37+
explanation: "The SDK can play or save audio, but it does not automatically store audio permanently in Azure Storage."
38+
- content: "What role does the Voice Live Python SDK (azure-ai-voicelive) play in a voice‑enabled agent?"
39+
choices:
40+
- content: " It stores audio recordings permanently in Azure Storage"
41+
isCorrect: false
42+
explanation: "The SDK focuses on real‑time interaction, not long‑term audio storage."
43+
- content: "It replaces the need for microphones and speakers on the user’s device"
44+
isCorrect: false
45+
explanation: "The SDK still relies on the user’s audio hardware, such as microphones and speakers."
46+
- content: "It opens a real‑time connection, streams audio, and handles spoken responses and interruptions"
47+
isCorrect: true
48+
explanation: "The Voice Live SDK opens a WebSocket session, streams microphone audio to the service, receives spoken responses, and supports interruptions for natural conversations."
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-speech-in-azure.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: "Summary"
7+
ms.date: 02/13/2026
8+
author: wwlpublish
9+
ms.author: sheryang
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 1
13+
content: |
14+
[!include[](includes/7-summary.md)]
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
::: zone pivot="video"
2+
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=b39c7047-52b8-41ef-bf80-64fee2785023]
4+
5+
::: zone-end
6+
7+
::: zone pivot="text"
8+
9+
**AI speech** capabilities enable us to manage systems with voice instructions, get answers from computers for spoken questions, generate captions from audio, and much more. Voice-based interfaces provide a more natural way to engage with AI software. The ability to interact through spoken language can increase the accessibility and inclusiveness of applications and agents.
10+
11+
To enable this kind of interaction, the AI system must support at least two capabilities:
12+
13+
- **Speech recognition**: the ability to detect and interpret spoken input
14+
- **Speech synthesis**: the ability to generate spoken output
15+
16+
Examples of these capabilities include:
17+
18+
- **Clinical dictation and note-taking in healthcare**: Doctors can say patient notes aloud during or after appointments. An AI speech app converts the audio into accurate medical text, reducing manual typing and saving time.
19+
20+
- **Call transcription in customer support**: Contact centers transcribe customer calls in real time, making it easier to review conversations, detect issues, and analyze sentiment.
21+
22+
- **Automated captioning in media and entertainment**: Video platforms generate live or recorded captions for shows and streams, improving accessibility and supporting multilingual audiences.
23+
24+
- **Language learning and pronunciation feedback in education**: Learning apps use AI speech capabilities to listen to students speak and provide pronunciation feedback, helping learners practice and improve spoken language skills.
25+
26+
- **Voice‑enabled assistants in retail and e‑commerce**: Virtual shopping assistants use speech recognition to understand spoken customer requests and text‑to‑speech to respond with product information or order status.
27+
28+
**Azure Speech in Microsoft Foundry Tools** provides speech-to-text, text-to-speech, and speech translation capabilities through speech recognition and synthesis. You can use prebuilt and custom Speech service models for a variety of tasks, from transcribing audio to text with high accuracy, to identifying speakers in conversations, creating custom voices, and more. Next learn how to incorporate speech recognition into an application with Azure Speech.
29+
30+
::: zone-end
31+
32+
> [!NOTE]
33+
> We recognize that different people like to learn in different ways. You can choose to complete this module in video-based format or you can read the content as text and images. The text contains greater detail than the videos, so in some cases you might want to refer to it as supplemental material to the video presentation.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
::: zone pivot="video"
2+
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=70c30087-2c80-45bd-a8e8-17f691bd1c95]
4+
5+
::: zone-end
6+
7+
::: zone pivot="text"
8+
9+
**Speech recognition**, often called **speech-to-text (STT)**, is an AI capability that enables apps and agents to respond to spoken input. Speech recognition takes the spoken word and converts it into data, usually text. Speech-to-text software typically uses multiple models, including:
10+
11+
- An *acoustic* model that converts the audio into phonemes (representations of specific sounds).
12+
- A *language* model that maps phonemes to words.
13+
14+
The words AI speech recognizes are converted to text. You can use the text for various purposes, such as providing closed captions, creating call transcripts, automating note dictation, and much more.
15+
16+
## Azure Speech - Speech to Text
17+
18+
**Azure Speech** includes a **speech-to-text API** that you can use to process voice input from a microphone or audio file.
19+
20+
>[!NOTE]
21+
>An *API* (Application Programming Interface) is a set of rules and endpoints that allows one software application to communicate with and use the functionality or data of another application.
22+
23+
**Microsoft Foundry** is a Microsoft platform that helps developers build, test, and deploy AI applications and agents by bringing together models, tools, data, and services in one place.
24+
25+
In the *new Microsoft Foundry portal*, we can explore Azure Speech's speech-to-text capabilities in the *Foundry playground*. To get to the playground, navigate to the *Build* page, then to *Models*, then to the *AI services* tab. In the tab, you can find a selection of AI services available for testing, including *Azure Speech - Speech to Text*.
26+
27+
In the playground, you can either upload an audio file or record yourself speaking. Azure Speech transcribes what is said, giving you a feel for how your own application would respond to audio input.
28+
29+
:::image type="content" source="../media/speech-to-text-playground.png" alt-text="Screenshot of speech-to-text in the Foundry playground." lightbox="../media/speech-to-text-playground.png":::
30+
31+
The playground in the Foundry portal is a great place to experiment with Azure Speech, but to use speech-to-text in an application, we need to write some code.
32+
33+
## Using the Azure speech-to-text SDK
34+
35+
The **Azure Speech – Speech-to-Text SDK** is a client library that lets applications convert spoken audio into written text. The speech-to-text SDK is designed to make speech recognition easy to add to applications.
36+
37+
>[!NOTE]
38+
>A client library is a set of ready‑made code that developers can use in their application to easily talk to a service or API.
39+
40+
The SDK enables your application to:
41+
42+
- Capture or send audio from a microphone, audio file, or audio stream
43+
- Send that audio to Azure Speech securely
44+
- Receive transcribed text in near real time or after processing completes
45+
46+
The SDK handles networking, authentication, audio streaming, and response parsing so developers can focus on application logic.
47+
48+
## Developing an application
49+
50+
The Speech-to-Text SDK is typically used in the client or service layer of an application. The SDK acts as the bridge between your application code and the Azure Speech service.
51+
52+
To use the Azure Speech Python SDK, you need to have compatible version of Python and the Azure Speech Python SDK installed.
53+
54+
The Python SDK can be installed in the Visual Studio Code *terminal* using:
55+
56+
```bash
57+
pip install azure-cognitiveservices-speech
58+
```
59+
60+
>[!NOTE]
61+
> Application code is written in *code editors*, such as Visual Studio Code. A code editor’s *terminal* is a built‑in command‑line window inside the editor where you can run commands without leaving your development environment.
62+
63+
To use Azure Speech, you also need to create a Foundry resource. The Foundry resource endpoint and key is used in your code to authenticate your connection.
64+
65+
After you install the Python SDK and create a Foundry resource, you can create and run your program. Consider the following Python code. When you run it:
66+
67+
1. **Your app initializes the Speech SDK**: Provides an endpoint and authentication (key or Microsoft Entra ID)
68+
2. **Audio is captured or loaded**: Microphone input or an audio file/stream
69+
3. **Audio is sent to Azure Speech**: The SDK streams or uploads audio securely
70+
4. **Speech recognition runs in the cloud**: Azure’s speech models analyze the audio
71+
5. **Text results are returned**: Your app receives recognized text and optional metadata
72+
73+
```python
74+
import azure.cognitiveservices.speech as speechsdk
75+
76+
# Set up the speech config using resource endpoint
77+
endpoint_url = "ENDPOINT"
78+
speech_key = "FOUNDRY_KEY"
79+
80+
speech_config = speechsdk.SpeechConfig(
81+
subscription=speech_key,
82+
endpoint=endpoint_url
83+
)
84+
85+
# Create a recognizer with microphone input
86+
audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True)
87+
speech_recognizer = speechsdk.SpeechRecognizer(
88+
speech_config=speech_config,
89+
audio_config=audio_config
90+
)
91+
92+
# Event handlers
93+
def recognized_handler(evt):
94+
print(f"Recognized: {evt.result.text}")
95+
96+
def recognizing_handler(evt):
97+
print(f"Recognizing: {evt.result.text}")
98+
99+
# Connect event handlers
100+
speech_recognizer.recognized.connect(recognized_handler)
101+
speech_recognizer.recognizing.connect(recognizing_handler)
102+
103+
# Start continuous recognition
104+
speech_recognizer.start_continuous_recognition()
105+
print("Say something...")
106+
107+
# Keep the program running
108+
input("Press Enter to stop...")
109+
speech_recognizer.stop_continuous_recognition()
110+
```
111+
112+
#### Client app example
113+
For example, let's say you want to develop a lightweight app that automatically transcribes voicemail messages. In the code editor, we have one audio file, and one Python file, which contains application code.
114+
115+
![Screenshot of Visual Studio Code with an audio file open.](../media/audio-file-in-vs-code.png)
116+
117+
Say you have an audio file containing a voicemail recording. To transcribe the message, start by specifying the endpoint and key and the audio source you want to transcribe. Then use a `SpeechRecognizer` object to perform the transcription, before displaying the results.
118+
119+
:::image type="content" source="../media/speech-to-text-python.png" alt-text="Screenshot of speech-to-text python code in Visual Studio Code." lightbox="../media/speech-to-text-python.png":::
120+
121+
Once you run the code, you can see the transcription text.
122+
123+
![Screenshot of Visual Studio Code with the terminal open and the results of speech-to-text.](../media/language-client-1.png)
124+
125+
#### Audio processing options
126+
127+
You can use Azure Speech's speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file.
128+
129+
**Real-time transcription**: Real-time speech to text allows you to transcribe audio streams to text. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking.
130+
131+
In order for real-time transcription to work, your application needs to be listening for incoming audio from a microphone, or other audio input source such as an audio file. Your application code streams the audio to the service, which returns the transcribed text.
132+
133+
**Batch transcription**: Not all speech to text scenarios are real time. You might have audio recordings stored on a file share, a remote server, or even on Azure storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results.
134+
135+
Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a *best-effort basis*. Normally a job starts executing within minutes of the request but there's no estimate for when a job changes into the running state.
136+
137+
Speech Recognition in Azure Speech is a great way to build solutions that transcribe recorded audio or automate speech captioning. Next, learn how to incorporate speech synthesis into an application.
138+
139+
::: zone-end

0 commit comments

Comments
 (0)