|
| 1 | +::: zone pivot="video" |
| 2 | + |
| 3 | +>[!VIDEO https://learn-video.azurefd.net/vod/player?id=70c30087-2c80-45bd-a8e8-17f691bd1c95] |
| 4 | +
|
| 5 | +::: zone-end |
| 6 | + |
| 7 | +::: zone pivot="text" |
| 8 | + |
| 9 | +**Speech recognition**, often called **speech-to-text (STT)**, is an AI capability that enables apps and agents to respond to spoken input. Speech recognition takes the spoken word and converts it into data, usually text. Speech-to-text software typically uses multiple models, including: |
| 10 | + |
| 11 | +- An *acoustic* model that converts the audio into phonemes (representations of specific sounds). |
| 12 | +- A *language* model that maps phonemes to words. |
| 13 | + |
| 14 | +The words AI speech recognizes are converted to text. You can use the text for various purposes, such as providing closed captions, creating call transcripts, automating note dictation, and much more. |
| 15 | + |
| 16 | +## Azure Speech - Speech to Text |
| 17 | + |
| 18 | +**Azure Speech** includes a **speech-to-text API** that you can use to process voice input from a microphone or audio file. |
| 19 | + |
| 20 | +>[!NOTE] |
| 21 | +>An *API* (Application Programming Interface) is a set of rules and endpoints that allows one software application to communicate with and use the functionality or data of another application. |
| 22 | +
|
| 23 | +**Microsoft Foundry** is a Microsoft platform that helps developers build, test, and deploy AI applications and agents by bringing together models, tools, data, and services in one place. |
| 24 | + |
| 25 | +In the *new Microsoft Foundry portal*, we can explore Azure Speech's speech-to-text capabilities in the *Foundry playground*. To get to the playground, navigate to the *Build* page, then to *Models*, then to the *AI services* tab. In the tab, you can find a selection of AI services available for testing, including *Azure Speech - Speech to Text*. |
| 26 | + |
| 27 | +In the playground, you can either upload an audio file or record yourself speaking. Azure Speech transcribes what is said, giving you a feel for how your own application would respond to audio input. |
| 28 | + |
| 29 | +:::image type="content" source="../media/speech-to-text-playground.png" alt-text="Screenshot of speech-to-text in the Foundry playground." lightbox="../media/speech-to-text-playground.png"::: |
| 30 | + |
| 31 | +The playground in the Foundry portal is a great place to experiment with Azure Speech, but to use speech-to-text in an application, we need to write some code. |
| 32 | + |
| 33 | +## Using the Azure speech-to-text SDK |
| 34 | + |
| 35 | +The **Azure Speech – Speech-to-Text SDK** is a client library that lets applications convert spoken audio into written text. The speech-to-text SDK is designed to make speech recognition easy to add to applications. |
| 36 | + |
| 37 | +>[!NOTE] |
| 38 | +>A client library is a set of ready‑made code that developers can use in their application to easily talk to a service or API. |
| 39 | +
|
| 40 | +The SDK enables your application to: |
| 41 | + |
| 42 | +- Capture or send audio from a microphone, audio file, or audio stream |
| 43 | +- Send that audio to Azure Speech securely |
| 44 | +- Receive transcribed text in near real time or after processing completes |
| 45 | + |
| 46 | +The SDK handles networking, authentication, audio streaming, and response parsing so developers can focus on application logic. |
| 47 | + |
| 48 | +## Developing an application |
| 49 | + |
| 50 | +The Speech-to-Text SDK is typically used in the client or service layer of an application. The SDK acts as the bridge between your application code and the Azure Speech service. |
| 51 | + |
| 52 | +To use the Azure Speech Python SDK, you need to have compatible version of Python and the Azure Speech Python SDK installed. |
| 53 | + |
| 54 | +The Python SDK can be installed in the Visual Studio Code *terminal* using: |
| 55 | + |
| 56 | +```bash |
| 57 | +pip install azure-cognitiveservices-speech |
| 58 | +``` |
| 59 | + |
| 60 | +>[!NOTE] |
| 61 | +> Application code is written in *code editors*, such as Visual Studio Code. A code editor’s *terminal* is a built‑in command‑line window inside the editor where you can run commands without leaving your development environment. |
| 62 | +
|
| 63 | +To use Azure Speech, you also need to create a Foundry resource. The Foundry resource endpoint and key is used in your code to authenticate your connection. |
| 64 | + |
| 65 | +After you install the Python SDK and create a Foundry resource, you can create and run your program. Consider the following Python code. When you run it: |
| 66 | + |
| 67 | +1. **Your app initializes the Speech SDK**: Provides an endpoint and authentication (key or Microsoft Entra ID) |
| 68 | +2. **Audio is captured or loaded**: Microphone input or an audio file/stream |
| 69 | +3. **Audio is sent to Azure Speech**: The SDK streams or uploads audio securely |
| 70 | +4. **Speech recognition runs in the cloud**: Azure’s speech models analyze the audio |
| 71 | +5. **Text results are returned**: Your app receives recognized text and optional metadata |
| 72 | + |
| 73 | +```python |
| 74 | +import azure.cognitiveservices.speech as speechsdk |
| 75 | + |
| 76 | +# Set up the speech config using resource endpoint |
| 77 | +endpoint_url = "ENDPOINT" |
| 78 | +speech_key = "FOUNDRY_KEY" |
| 79 | + |
| 80 | +speech_config = speechsdk.SpeechConfig( |
| 81 | + subscription=speech_key, |
| 82 | + endpoint=endpoint_url |
| 83 | +) |
| 84 | + |
| 85 | +# Create a recognizer with microphone input |
| 86 | +audio_config = speechsdk.audio.AudioConfig(use_default_microphone=True) |
| 87 | +speech_recognizer = speechsdk.SpeechRecognizer( |
| 88 | + speech_config=speech_config, |
| 89 | + audio_config=audio_config |
| 90 | +) |
| 91 | + |
| 92 | +# Event handlers |
| 93 | +def recognized_handler(evt): |
| 94 | + print(f"Recognized: {evt.result.text}") |
| 95 | + |
| 96 | +def recognizing_handler(evt): |
| 97 | + print(f"Recognizing: {evt.result.text}") |
| 98 | + |
| 99 | +# Connect event handlers |
| 100 | +speech_recognizer.recognized.connect(recognized_handler) |
| 101 | +speech_recognizer.recognizing.connect(recognizing_handler) |
| 102 | + |
| 103 | +# Start continuous recognition |
| 104 | +speech_recognizer.start_continuous_recognition() |
| 105 | +print("Say something...") |
| 106 | + |
| 107 | +# Keep the program running |
| 108 | +input("Press Enter to stop...") |
| 109 | +speech_recognizer.stop_continuous_recognition() |
| 110 | +``` |
| 111 | + |
| 112 | +#### Client app example |
| 113 | +For example, let's say you want to develop a lightweight app that automatically transcribes voicemail messages. In the code editor, we have one audio file, and one Python file, which contains application code. |
| 114 | + |
| 115 | + |
| 116 | + |
| 117 | +Say you have an audio file containing a voicemail recording. To transcribe the message, start by specifying the endpoint and key and the audio source you want to transcribe. Then use a `SpeechRecognizer` object to perform the transcription, before displaying the results. |
| 118 | + |
| 119 | +:::image type="content" source="../media/speech-to-text-python.png" alt-text="Screenshot of speech-to-text python code in Visual Studio Code." lightbox="../media/speech-to-text-python.png"::: |
| 120 | + |
| 121 | +Once you run the code, you can see the transcription text. |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | +#### Audio processing options |
| 126 | + |
| 127 | +You can use Azure Speech's speech-to-text API to perform real-time or batch transcription of audio into a text format. The audio source for transcription can be a real-time audio stream from a microphone or an audio file. |
| 128 | + |
| 129 | +**Real-time transcription**: Real-time speech to text allows you to transcribe audio streams to text. You can use real-time transcription for presentations, demos, or any other scenario where a person is speaking. |
| 130 | + |
| 131 | +In order for real-time transcription to work, your application needs to be listening for incoming audio from a microphone, or other audio input source such as an audio file. Your application code streams the audio to the service, which returns the transcribed text. |
| 132 | + |
| 133 | +**Batch transcription**: Not all speech to text scenarios are real time. You might have audio recordings stored on a file share, a remote server, or even on Azure storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. |
| 134 | + |
| 135 | +Batch transcription should be run in an asynchronous manner because the batch jobs are scheduled on a *best-effort basis*. Normally a job starts executing within minutes of the request but there's no estimate for when a job changes into the running state. |
| 136 | + |
| 137 | +Speech Recognition in Azure Speech is a great way to build solutions that transcribe recorded audio or automate speech captioning. Next, learn how to incorporate speech synthesis into an application. |
| 138 | + |
| 139 | +::: zone-end |
0 commit comments