Skip to content

Commit a8668df

Browse files
Merge pull request #53571 from sherzyang/NEW-get-started-vision-azure
Add new module.
2 parents c39b22a + 66c0136 commit a8668df

24 files changed

Lines changed: 595 additions & 0 deletions
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.introduction
3+
title: Introduction
4+
metadata:
5+
title: Introduction
6+
description: Introduction
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 1
13+
content: |
14+
[!include[](includes/1-introduction.md)]
15+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.vision-enabled-models
3+
title: Multimodal models for image analysis
4+
metadata:
5+
title: Multimodal models for image analysis
6+
description: Multimodal models for image analysis
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 7
13+
content: |
14+
[!include[](includes/2-vision-enabled-models.md)]
15+
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.image-generation
3+
title: Image generation models
4+
metadata:
5+
title: Image generation models
6+
description: Understand image generation capabilities
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 4
13+
content: |
14+
[!include[](includes/3-image-generation.md)]
15+
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.video-generation
3+
title: Video generation models
4+
metadata:
5+
title: Video generation models
6+
description: Understand video generation capabilities
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 3
13+
content: |
14+
[!include[](includes/4-video-generation.md)]
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.exercise
3+
title: Exercise - Get started with computer vision in Microsoft Foundry
4+
metadata:
5+
title: Exercise - Get started with computer vision in Microsoft Foundry
6+
description: Exercise - Get started with computer vision in Microsoft Foundry
7+
author: graememalcolm
8+
ms.author: gmalc
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
durationInMinutes: 30
12+
content: |
13+
[!include[](includes/5-exercise.md)]
14+
15+
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.knowledge-check
3+
title: Module assessment
4+
metadata:
5+
title: Module assessment
6+
description: Knowledge check
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
durationInMinutes: 4
12+
quiz:
13+
title: ""
14+
questions:
15+
- content: "What is a multimodal model?"
16+
choices:
17+
- content: "A model that can only process images but not text."
18+
isCorrect: false
19+
explanation: "Multimodal models are designed to handle multiple types of input at the same time, such as text and images. "
20+
- content: "A model that can understand and work with more than one type of data, such as text and images."
21+
isCorrect: true
22+
explanation: "Multimodal models are designed to handle multiple types of input at the same time, such as text and images. "
23+
- content: "A model that generates video content only."
24+
isCorrect: false
25+
explanation: "Multimodal models are designed to handle multiple types of input at the same time, such as text and images. "
26+
- content: "How can developers programmatically generate images using Foundry image generation models?"
27+
choices:
28+
- content: "By sending text prompts through the OpenAI Responses API using a deployed image model"
29+
isCorrect: true
30+
explanation: " Developers can submit text prompts and retrieve generated images from Foundry models using the OpenAI Responses API."
31+
- content: "By uploading images through the Foundry Playground UI."
32+
isCorrect: false
33+
explanation: "The Foundry Playground UI is for interacting with models, not for programmatically generating images."
34+
- content: "By calling the GPT-4.1 model endpoint."
35+
isCorrect: false
36+
explanation: "GPT-4.1 is a multimodal model that can process text and images together. It can't generate images."
37+
- content: "When you generate images programmatically using the OpenAI Python SDK with Microsoft Foundry, which value should you pass as the model parameter in the request?"
38+
choices:
39+
- content: "The original base model name (for example, gpt-image-1.5)."
40+
isCorrect: false
41+
explanation: "In Microsoft Foundry, API calls reference the deployment name, not the underlying base model name."
42+
- content: "The deployment name you gave the image generation model in your Foundry resource."
43+
isCorrect: true
44+
explanation: "In Microsoft Foundry, API calls reference the deployment name, which is the name you gave the model."
45+
- content: "The name you gave your Foundry resource."
46+
isCorrect: false
47+
explanation: "You need to provide the deployment name of the model, not the Foundry resource name, when making API calls to generate images."
48+
- content: "Why is video generation with Sora models in Microsoft Foundry handled as an asynchronous job?"
49+
choices:
50+
- content: "Because video generation requires user interaction during rendering."
51+
isCorrect: false
52+
explanation: "Video generation doesn't require user interaction during rendering."
53+
- content: "Because the REST API doesn't support synchronous requests."
54+
isCorrect: false
55+
explanation: "REST APIs absolutely can be synchronous. In this content, the reason given for async is workload duration and compute cost, not that it's not possible to process synchronously."
56+
- content: "Because video generation is resource‑intensive and takes time to complete."
57+
isCorrect: true
58+
explanation: "Video generation is computationally intensive and can take several minutes to complete. For this reason, Foundry runs video generation as an asynchronous process where you create a job, poll for its status, and download the video once it's finished."
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
### YamlMime:ModuleUnit
2+
uid: learn.wwl.get-started-with-computer-vision-in-azure.summary
3+
title: Summary
4+
metadata:
5+
title: Summary
6+
description: Summary
7+
author: wwlpublish
8+
ms.author: sheryang
9+
ms.date: 02/17/2026
10+
ms.topic: unit
11+
zone_pivot_groups: video-or-text
12+
durationInMinutes: 1
13+
content: |
14+
[!include[](includes/7-summary.md)]
15+
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
::: zone pivot="video"
2+
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=c3f07882-8a43-4671-9f1a-da926192e784]
4+
5+
::: zone-end
6+
7+
::: zone pivot="text"
8+
9+
**Computer vision** is a field of AI that enables machines to interpret and understand visual information from the world—such as images, videos, and live camera feeds. Computer vision capabilities are powered by AI models and support the automation of all kinds of time-intensive tasks.
10+
11+
This module will discuss AI models that can identify and analyze objects, recognize patterns, read text within images, and interpret scenes much like a human would. The module also covers visual AI models that can go beyond image analysis to generate new visual content. Together, these capabilities enable a wide range of applications from image search and document analysis, to creative tools and interactive AI experiences, by allowing systems to both see and create visual information.
12+
13+
Consider these applications of computer vision:
14+
15+
- **Defect detection in manufacturing**: AI vision systems inspect products on assembly lines in real time. They detect surface defects, misalignments, or missing components using object detection and image segmentation, reducing waste and improving quality control.
16+
17+
- **Medical imaging analysis**: Computer vision helps radiologists analyze X-rays, MRIs, and CT scans. AI models can highlight anomalies like tumors or fractures, assist in early diagnosis, and reduce human error.
18+
19+
- **Shelf monitoring in retail**: Retailers use AI vision to monitor store shelves. Cameras detect when products are out of stock or misplaced, enabling real-time inventory updates and improving customer experience.
20+
21+
- **Autonomous vehicles**: Self-driving cars rely on computer vision to recognize road signs, lane markings, pedestrians, and other vehicles. This enables safe navigation and decision-making in dynamic environments.
22+
23+
Next, explore multimodal models in **Microsoft Foundry**, Microsoft's unified platform-as-a-service offering on Azure for enterprise AI operations and application development.
24+
25+
::: zone-end
26+
27+
> [!NOTE]
28+
> We recognize that different people like to learn in different ways. You can choose to complete this module in video-based format or you can read the content as text and images. The text contains greater detail than the videos, so in some cases you might want to refer to it as supplemental material to the video presentation.
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
::: zone pivot="video"
2+
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=5bfdd223-9358-439d-8814-56a006aafa76]
4+
5+
::: zone-end
6+
7+
::: zone pivot="text"
8+
9+
Increasingly, new AI models are multimodal. In other words, they support multiple kinds of input data, including images and text. **Multimodal models** are AI models that can understand and work with more than one type of data at the same time, such as text, images, audio, or video. For instance, the multimodal model could describe an image in natural language or answer a question about a photo.
10+
11+
Multimodal models are commonly used as part of:
12+
13+
- **AI applications**, where image understanding enhances user workflows
14+
- **AI agents**, where visual input helps the agent make better decisions
15+
16+
Examples include:
17+
18+
- An agent that reviews uploaded documents and screenshots
19+
- A support app that analyzes photos submitted by customers
20+
- A learning tool that explains diagrams or charts in plain language
21+
22+
Because multimodal models accept both text and images, they reduce the need for separate vision pipelines and make it easier to build end‑to‑end intelligent experiences.
23+
24+
The ability for models to combine visual understanding with natural language responses is referred to as **vision‑enabled GPT models** or GPT with vision. Vision‑enabled models are designed for flexible, general‑purpose visual reasoning. They can analyze visual input and respond in natural language, making it easy to build intelligent applications without needing deep computer vision expertise.
25+
26+
## Multimodal models in Microsoft Foundry
27+
28+
Microsoft Foundry includes many models that accept image-based input, enabling you to create intelligent, vision-based solutions. Multimodal models in Microsoft Foundry allow applications and agents to understand, analyze, and reason over images and visual content.
29+
30+
For example, vision‑enabled GPT models in Foundry can:
31+
32+
- Describe the contents of an image in natural language
33+
- Answer questions about objects, text, or scenes in an image
34+
- Extract meaning from charts, screenshots, documents, or photos
35+
- Combine image understanding with text instructions in a single prompt
36+
37+
Foundry's model catalog contains many multimodal models including:
38+
39+
- **GPT‑4.1 / GPT‑4.1‑mini / GPT‑4.1‑nano**: These general‑purpose multimodal GPT models can process text and images together. They're commonly used for image description and visual question answering, document and screenshot analysis, and chart and diagram interpretation.
40+
41+
- **GPT‑5 series (for example, GPT‑5.1, GPT‑5.2)**: The GPT‑5 family available in Foundry includes advanced multimodal models designed for enterprise and agentic scenarios. These models support multimodal inputs (including text and images), structured outputs, and tool use, large‑context reasoning across modalities. The GPT-5 series models are typically used in production‑grade AI agents and complex multimodal applications.
42+
43+
Foundry also hosts partner‑provided multimodal models in its model catalog, including models from providers such as Anthropic and others that support text and image understanding.
44+
45+
#### Image analysis in the Foundry playground
46+
47+
> [!NOTE]
48+
> Foundry portal has a *classic* user interface (UI) and a *new* user interface.
49+
50+
In the *new Microsoft Foundry portal*, you can use the model playground to chat with a deployed model. You can select a vision‑enabled model, upload images, and test prompts interactively to understand how the model interprets visual information.
51+
52+
:::image type="content" source="../media/playground-upload-image.png" alt-text="Screenshot of Foundry Playground with a gpt-4.1 mini model deployed and the user uploading an image of an animal." lightbox="../media/playground-upload-image.png":::
53+
54+
For example, you can attach an image file and get the multimodal model (such as gpt-4.1 mini) to analyze and describe it.
55+
56+
:::image type="content" source="../media/image-analysis-result-playground.png" alt-text="Screenshot of Foundry Playground with a prompt asking the model to describe what is in an image and a response with a description." lightbox="../media/image-analysis-result-playground.png":::
57+
58+
Once validated, the same capabilities can be accessed programmatically using APIs, allowing images to be submitted alongside text prompts in application code.
59+
60+
## Using the Azure OpenAI API for image analysis
61+
62+
In order to develop an application, you need to move from the Foundry playground to code. In a code editor, you can write your application code using the **OpenAI Responses API** in Foundry. The OpenAI Responses API is designed for agentic apps and supports native multimodal inputs (including images).
63+
64+
At a high level:
65+
66+
- A single request can include text input and image input together
67+
- Images can be provided as URLs or as base64‑encoded image data
68+
- The model processes both inputs simultaneously to generate a response
69+
70+
Conceptually, the prompt structure looks like:
71+
72+
- A text instruction (for example, *What objects are visible in this image?*)
73+
- One or more image inputs attached to the same request
74+
75+
This approach allows developers to build applications where users upload images and ask questions about them in real time.
76+
77+
## Using the Azure OpenAI Python SDK
78+
79+
You can use a Microsoft Foundry resource with the OpenAI API to perform image analysis—including sending images in prompts and getting text responses—by using the Responses API with a vision‑capable model deployment.
80+
81+
The Python SDK can be installed in the Visual Studio Code *terminal* using:
82+
83+
```bash
84+
pip install openai
85+
```
86+
87+
In the code editor, we can create one Python file, which contains application code. Importantly, you need your **Foundry resource** *key* and *endpoint*, and the *name of your deployed model*.
88+
89+
>[!NOTE]
90+
>When you deploy a model in Foundry, it has a *base* or *original* name, and an original **deployment name** you give it. Foundry hosts the deployed model (for example, GPT‑class models with vision) and provides you with an endpoint.
91+
92+
In the code example, you create the *client*, point it to your endpoint, and pass your *model deployment name* (the name you gave the model) as the `MODEL_NAME`.
93+
94+
```python
95+
import os
96+
from openai import OpenAI
97+
98+
# Environment variables you set locally or in your app service:
99+
FOUNDRY_KEY = "... your key ..."
100+
ENDPOINT = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/"
101+
MODEL_NAME = "your-model-deployment-name" # e.g., "gpt-4.1-mini" deployed as "my-vision-deploy"
102+
103+
client = OpenAI(
104+
api_key=os.getenv("FOUNDRY_KEY"),
105+
base_url=os.getenv("ENDPOINT"),
106+
)
107+
108+
image_url = ""
109+
110+
response = client.responses.create(
111+
model=os.getenv("MODEL_NAME"), # your deployment name
112+
input=[
113+
{
114+
"role": "user",
115+
"content": [
116+
{"type": "input_text", "text": "What is in this image? Provide 3 bullet points."},
117+
{"type": "input_image", "image_url": image_url}
118+
],
119+
}
120+
],
121+
)
122+
123+
print(response.output_text)
124+
125+
```
126+
127+
#### Client app example
128+
129+
You can build a custom application that uses a vision-enabled model to analyze an image with the OpenAI Python SDK. For example, suppose you want to build an app that can identify animals photographed on Safari. You can upload your photos and create a Python file in your code editor.
130+
131+
![Screenshot of the image used for image analysis.](../media/image-example-vs-code.png)
132+
133+
Then you can write application code that uses the OpenAI API to connect to your model's endpoint in Foundry.
134+
135+
:::image type="content" source="../media/vision-analysis-python.png" alt-text="Screenshot of Visual Studio Code with a python file containing application code for image analysis." lightbox="../media/vision-analysis-python.png":::
136+
137+
The application code needs to load the image data and get a natural language prompt from a user. To submit the input to the model, you need to create a multi-part message that includes both the image and text data. The model can respond with an appropriate output based on both the text and image in the prompt.
138+
139+
:::image type="content" source="../media/image-analysis-result-vs-code.png" alt-text="Screenshot of Visual Studio Code with the result of the image analysis." lightbox="../media/image-analysis-result-vs-code.png":::
140+
141+
Next, learn how to use Foundry models and the Azure OpenAI SDK for image generation.
142+
143+
::: zone-end

0 commit comments

Comments
 (0)