|
| 1 | +::: zone pivot="video" |
| 2 | + |
| 3 | +>[!VIDEO https://learn-video.azurefd.net/vod/player?id=5bfdd223-9358-439d-8814-56a006aafa76] |
| 4 | +
|
| 5 | +::: zone-end |
| 6 | + |
| 7 | +::: zone pivot="text" |
| 8 | + |
| 9 | +Increasingly, new AI models are multimodal. In other words, they support multiple kinds of input data, including images and text. **Multimodal models** are AI models that can understand and work with more than one type of data at the same time, such as text, images, audio, or video. For instance, the multimodal model could describe an image in natural language or answer a question about a photo. |
| 10 | + |
| 11 | +Multimodal models are commonly used as part of: |
| 12 | + |
| 13 | +- **AI applications**, where image understanding enhances user workflows |
| 14 | +- **AI agents**, where visual input helps the agent make better decisions |
| 15 | + |
| 16 | +Examples include: |
| 17 | + |
| 18 | +- An agent that reviews uploaded documents and screenshots |
| 19 | +- A support app that analyzes photos submitted by customers |
| 20 | +- A learning tool that explains diagrams or charts in plain language |
| 21 | + |
| 22 | +Because multimodal models accept both text and images, they reduce the need for separate vision pipelines and make it easier to build end‑to‑end intelligent experiences. |
| 23 | + |
| 24 | +The ability for models to combine visual understanding with natural language responses is referred to as **vision‑enabled GPT models** or GPT with vision. Vision‑enabled models are designed for flexible, general‑purpose visual reasoning. They can analyze visual input and respond in natural language, making it easy to build intelligent applications without needing deep computer vision expertise. |
| 25 | + |
| 26 | +## Multimodal models in Microsoft Foundry |
| 27 | + |
| 28 | +Microsoft Foundry includes many models that accept image-based input, enabling you to create intelligent, vision-based solutions. Multimodal models in Microsoft Foundry allow applications and agents to understand, analyze, and reason over images and visual content. |
| 29 | + |
| 30 | +For example, vision‑enabled GPT models in Foundry can: |
| 31 | + |
| 32 | +- Describe the contents of an image in natural language |
| 33 | +- Answer questions about objects, text, or scenes in an image |
| 34 | +- Extract meaning from charts, screenshots, documents, or photos |
| 35 | +- Combine image understanding with text instructions in a single prompt |
| 36 | + |
| 37 | +Foundry's model catalog contains many multimodal models including: |
| 38 | + |
| 39 | +- **GPT‑4.1 / GPT‑4.1‑mini / GPT‑4.1‑nano**: These general‑purpose multimodal GPT models can process text and images together. They're commonly used for image description and visual question answering, document and screenshot analysis, and chart and diagram interpretation. |
| 40 | + |
| 41 | +- **GPT‑5 series (for example, GPT‑5.1, GPT‑5.2)**: The GPT‑5 family available in Foundry includes advanced multimodal models designed for enterprise and agentic scenarios. These models support multimodal inputs (including text and images), structured outputs, and tool use, large‑context reasoning across modalities. The GPT-5 series models are typically used in production‑grade AI agents and complex multimodal applications. |
| 42 | + |
| 43 | +Foundry also hosts partner‑provided multimodal models in its model catalog, including models from providers such as Anthropic and others that support text and image understanding. |
| 44 | + |
| 45 | +#### Image analysis in the Foundry playground |
| 46 | + |
| 47 | +> [!NOTE] |
| 48 | +> Foundry portal has a *classic* user interface (UI) and a *new* user interface. |
| 49 | +
|
| 50 | +In the *new Microsoft Foundry portal*, you can use the model playground to chat with a deployed model. You can select a vision‑enabled model, upload images, and test prompts interactively to understand how the model interprets visual information. |
| 51 | + |
| 52 | +:::image type="content" source="../media/playground-upload-image.png" alt-text="Screenshot of Foundry Playground with a gpt-4.1 mini model deployed and the user uploading an image of an animal." lightbox="../media/playground-upload-image.png"::: |
| 53 | + |
| 54 | +For example, you can attach an image file and get the multimodal model (such as gpt-4.1 mini) to analyze and describe it. |
| 55 | + |
| 56 | +:::image type="content" source="../media/image-analysis-result-playground.png" alt-text="Screenshot of Foundry Playground with a prompt asking the model to describe what is in an image and a response with a description." lightbox="../media/image-analysis-result-playground.png"::: |
| 57 | + |
| 58 | +Once validated, the same capabilities can be accessed programmatically using APIs, allowing images to be submitted alongside text prompts in application code. |
| 59 | + |
| 60 | +## Using the Azure OpenAI API for image analysis |
| 61 | + |
| 62 | +In order to develop an application, you need to move from the Foundry playground to code. In a code editor, you can write your application code using the **OpenAI Responses API** in Foundry. The OpenAI Responses API is designed for agentic apps and supports native multimodal inputs (including images). |
| 63 | + |
| 64 | +At a high level: |
| 65 | + |
| 66 | +- A single request can include text input and image input together |
| 67 | +- Images can be provided as URLs or as base64‑encoded image data |
| 68 | +- The model processes both inputs simultaneously to generate a response |
| 69 | + |
| 70 | +Conceptually, the prompt structure looks like: |
| 71 | + |
| 72 | +- A text instruction (for example, *What objects are visible in this image?*) |
| 73 | +- One or more image inputs attached to the same request |
| 74 | + |
| 75 | +This approach allows developers to build applications where users upload images and ask questions about them in real time. |
| 76 | + |
| 77 | +## Using the Azure OpenAI Python SDK |
| 78 | + |
| 79 | +You can use a Microsoft Foundry resource with the OpenAI API to perform image analysis—including sending images in prompts and getting text responses—by using the Responses API with a vision‑capable model deployment. |
| 80 | + |
| 81 | +The Python SDK can be installed in the Visual Studio Code *terminal* using: |
| 82 | + |
| 83 | +```bash |
| 84 | +pip install openai |
| 85 | +``` |
| 86 | + |
| 87 | +In the code editor, we can create one Python file, which contains application code. Importantly, you need your **Foundry resource** *key* and *endpoint*, and the *name of your deployed model*. |
| 88 | + |
| 89 | +>[!NOTE] |
| 90 | +>When you deploy a model in Foundry, it has a *base* or *original* name, and an original **deployment name** you give it. Foundry hosts the deployed model (for example, GPT‑class models with vision) and provides you with an endpoint. |
| 91 | +
|
| 92 | +In the code example, you create the *client*, point it to your endpoint, and pass your *model deployment name* (the name you gave the model) as the `MODEL_NAME`. |
| 93 | + |
| 94 | +```python |
| 95 | +import os |
| 96 | +from openai import OpenAI |
| 97 | + |
| 98 | +# Environment variables you set locally or in your app service: |
| 99 | +FOUNDRY_KEY = "... your key ..." |
| 100 | +ENDPOINT = "https://YOUR-RESOURCE-NAME.openai.azure.com/openai/v1/" |
| 101 | +MODEL_NAME = "your-model-deployment-name" # e.g., "gpt-4.1-mini" deployed as "my-vision-deploy" |
| 102 | + |
| 103 | +client = OpenAI( |
| 104 | + api_key=os.getenv("FOUNDRY_KEY"), |
| 105 | + base_url=os.getenv("ENDPOINT"), |
| 106 | +) |
| 107 | + |
| 108 | +image_url = "" |
| 109 | + |
| 110 | +response = client.responses.create( |
| 111 | + model=os.getenv("MODEL_NAME"), # your deployment name |
| 112 | + input=[ |
| 113 | + { |
| 114 | + "role": "user", |
| 115 | + "content": [ |
| 116 | + {"type": "input_text", "text": "What is in this image? Provide 3 bullet points."}, |
| 117 | + {"type": "input_image", "image_url": image_url} |
| 118 | + ], |
| 119 | + } |
| 120 | + ], |
| 121 | +) |
| 122 | + |
| 123 | +print(response.output_text) |
| 124 | + |
| 125 | +``` |
| 126 | + |
| 127 | +#### Client app example |
| 128 | + |
| 129 | +You can build a custom application that uses a vision-enabled model to analyze an image with the OpenAI Python SDK. For example, suppose you want to build an app that can identify animals photographed on Safari. You can upload your photos and create a Python file in your code editor. |
| 130 | + |
| 131 | + |
| 132 | + |
| 133 | +Then you can write application code that uses the OpenAI API to connect to your model's endpoint in Foundry. |
| 134 | + |
| 135 | +:::image type="content" source="../media/vision-analysis-python.png" alt-text="Screenshot of Visual Studio Code with a python file containing application code for image analysis." lightbox="../media/vision-analysis-python.png"::: |
| 136 | + |
| 137 | +The application code needs to load the image data and get a natural language prompt from a user. To submit the input to the model, you need to create a multi-part message that includes both the image and text data. The model can respond with an appropriate output based on both the text and image in the prompt. |
| 138 | + |
| 139 | +:::image type="content" source="../media/image-analysis-result-vs-code.png" alt-text="Screenshot of Visual Studio Code with the result of the image analysis." lightbox="../media/image-analysis-result-vs-code.png"::: |
| 140 | + |
| 141 | +Next, learn how to use Foundry models and the Azure OpenAI SDK for image generation. |
| 142 | + |
| 143 | +::: zone-end |
0 commit comments