Skip to content

Commit 8f24eb5

Browse files
authored
Merge pull request #53930 from GraemeMalcolm/main
Updated module
2 parents 1b1cf84 + 5ed4489 commit 8f24eb5

13 files changed

Lines changed: 120 additions & 133 deletions

learn-pr/wwl-data-ai/model-catalog-evaluate/7-knowledge-check.yml

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -13,36 +13,36 @@ durationInMinutes: 3
1313
quiz:
1414
title: Check your knowledge
1515
questions:
16-
- content: "Which deployment type in Microsoft Foundry portal requires an Azure Marketplace subscription for models from partners and the community?"
17-
choices:
18-
- content: "Managed compute"
19-
isCorrect: false
20-
explanation: "Incorrect. Managed compute doesn't necessarily require Azure Marketplace subscriptions. It uses Azure virtual machines and is billed for hosting and inference costs."
21-
- content: "Serverless API"
22-
isCorrect: true
23-
explanation: "Correct. Serverless API deployments for models from partners and community require Azure Marketplace subscriptions, while models sold directly by Azure don't require this subscription."
24-
- content: "Provisioned"
25-
isCorrect: false
26-
explanation: "Incorrect. Provisioned deployments reserve dedicated capacity but don't specifically require Azure Marketplace subscriptions."
27-
- content: "Which evaluation metric measures whether model responses are based on provided context rather than speculation?"
28-
choices:
29-
- content: "Fluency"
30-
isCorrect: false
31-
explanation: "Incorrect. Fluency evaluates linguistic correctness and natural language quality, not whether responses are grounded in context."
32-
- content: "Groundedness"
33-
isCorrect: true
34-
explanation: "Correct. Groundedness determines whether responses are based on provided context rather than speculation or the model's general knowledge."
35-
- content: "Coherence"
36-
isCorrect: false
37-
explanation: "Incorrect. Coherence assesses whether responses flow logically and maintain consistent ideas, not whether they're based on provided context."
38-
- content: "What type of model should you select if your application needs to process both text and images?"
39-
choices:
40-
- content: "Small Language Model (SLM)"
41-
isCorrect: false
42-
explanation: "Incorrect. SLMs are efficient text-focused models but don't inherently process images alongside text."
43-
- content: "Embedding model"
44-
isCorrect: false
45-
explanation: "Incorrect. Embedding models convert text into numerical representations for semantic search and similarity tasks, not for processing images."
46-
- content: "Multimodal model"
47-
isCorrect: true
48-
explanation: "Correct. Multimodal models like GPT-4o and Phi-3-vision can process multiple data types including both text and images."
16+
- content: "Which model benchmark indicates the model's ability to process prompts and return comprehensive responses quickly?"
17+
choices:
18+
- content: "Quality index"
19+
isCorrect: false
20+
explanation: "Incorrect. Quality index evaluates the overall quality of the model's responses, not its speed or efficiency."
21+
- content: "Cost"
22+
isCorrect: false
23+
explanation: "Incorrect. Cost evaluates the expense of using the model, not its speed or efficiency."
24+
- content: "Throughput"
25+
isCorrect: true
26+
explanation: "Correct. Throughput indicates the model's ability to process prompts and return comprehensive responses quickly."
27+
- content: "Which deployment type in Microsoft Foundry is best for general use while offering the largest quota?"
28+
choices:
29+
- content: "Data Zone Batch"
30+
isCorrect: false
31+
explanation: "Incorrect. Data Zone Batch deployments are designed for batch processing and may not offer the largest quota for general use."
32+
- content: "Global Standard"
33+
isCorrect: true
34+
explanation: "Correct. Global Standard deployments offer the largest quota and are suitable for general use."
35+
- content: "Developer"
36+
isCorrect: false
37+
explanation: "Incorrect. Developer deployments are intended for development and testing of fine-tuned models, not for general use with the largest quota."
38+
- content: "Which evaluation metric measures linguistic correctness and natural language quality?"
39+
choices:
40+
- content: "Fluency"
41+
isCorrect: true
42+
explanation: "Correct. Fluency evaluates linguistic correctness and natural language quality."
43+
- content: "Groundedness"
44+
isCorrect: false
45+
explanation: "Incorrect. Groundedness determines whether responses are based on provided context rather than speculation or the model's general knowledge."
46+
- content: "Relevance"
47+
isCorrect: false
48+
explanation: "Incorrect. Relevance assesses whether responses are pertinent to the given context or query, not linguistic correctness or natural language quality."
Lines changed: 29 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,57 +1,59 @@
1-
The model catalog in Microsoft Foundry portal serves as your central hub for discovering and comparing AI models. With over 1,900 models available from various providers, you need effective ways to filter and find models that match your specific requirements.
1+
The Foundry Models catalog serves as your central hub for discovering and comparing AI models. With over 1,900 models available from various providers, you need effective ways to filter and find models that match your specific requirements.
22

3-
## Access the model catalog
3+
The model catalog includes two broad categories of model:
44

5-
You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Discover** from the top navigation. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
5+
- **Foundry Models sold directly by Azure**
66

7-
:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
8-
9-
## Filter models by key attributes
7+
These models are billed directly through your Azure subscription, and include Azure OpenAI models as well as models from Microsoft and other providers.
108

11-
The model catalog provides several filters to help you narrow your search:
9+
- **Foundry Models from partners and community**
1210

13-
**Collection** filters let you browse models by provider, such as Azure OpenAI, Meta, Mistral, Cohere, or Hugging Face. This helps when you have preferences or requirements for specific model families.
11+
These models are provided by trusted partners and the community; each with their own licensing and pricing.
1412

15-
**Industry** filters show models trained on industry-specific datasets. These specialized models often outperform general-purpose models in their respective domains.
13+
## Finding models in the model catalog
1614

17-
**Capabilities** filters highlight unique model features. You can filter for reasoning capabilities (complex problem-solving), tool calling (API and function integration), or multimodal processing (text, images, audio).
15+
The model catalog user interface in the Foundry Portal provides an easy way to search for the right model for your needs. Each model has a *model card* showing its key information; including the provider, capabilities, benchmark metrics, responsible AI considerations, and deployment options.
1816

19-
**Inference tasks** and **Fine-tune tasks** filters let you find models suited for specific activities like text generation, summarization, translation, or entity extraction.
17+
:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
2018

21-
## Understand model types
19+
You can search for models by keyword, and you can filter based on the following attributes:
2220

23-
As you explore the catalog, you encounter different categories of models designed for various use cases.
21+
- **Collection**: Models are organized into collections, such as models that are provided directly in Azure, or models in the Hugging Face repository.
22+
- **Capabilities**: Specific model abilities, including *reasoning* (complex problem-solving), *tool calling* (API and function integration), or *multimodal processing* (text, images, audio).
23+
- **Source**: The model provider, including Azure OpenAI, Microsoft, Cohere, Mistral, Meta, Anthropic, and others.
24+
- **Inference tasks**: Specific tasks like text generation, summarization, translation, image-generation, speech synthesis, or other common AI tasks.
25+
- **Fine-tuning methods**: Supported techniques for fine-tuning a model.
26+
- **Industry**: Models trained on industry-specific datasets. These specialized models often outperform general-purpose models in their respective domains.
2427

25-
### Large Language Models and Small Language Models
28+
## Understand generative AI model types
2629

27-
**Large Language Models (LLMs)** like GPT-4, Mistral Large, and Llama 3 70B are powerful models designed for tasks requiring deep reasoning, complex content generation, and extensive context understanding. These models excel at sophisticated applications but require more computational resources.
30+
As you explore the catalog, you encounter different categories of models designed for various use cases. In broad terms, you can categorize language models as:
2831

29-
**Small Language Models (SLMs)** like Phi-3, Mistral OSS models, and Llama 3 8B offer efficiency and cost-effectiveness while handling common natural language processing tasks. They're ideal for scenarios where speed and cost matter more than handling the most complex reasoning tasks. SLMs can run on lower-end hardware or edge devices.
32+
- **Large Language Models (LLMs)** like GPT-5, Mistral Large, and Llama 3 70B that are designed for tasks requiring deep reasoning, complex content generation, and extensive context understanding. These models excel at sophisticated applications but require more computational resources.
33+
- **Small Language Models (SLMs)** like Phi-4, Mistral OSS models, and Llama 3 8B that offer efficiency and cost-effectiveness while handling common natural language processing tasks. They're ideal for scenarios where speed and cost matter more than handling the most complex reasoning tasks. SLMs can run on lower-end hardware or edge devices.
3034

3135
### Chat completion and reasoning models
3236

3337
Most language models in the catalog are **chat completion** models designed to generate coherent, contextually appropriate text responses. These models power conversational interfaces and content generation applications.
3438

3539
For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like Claude Opus 4.6 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
3640

37-
### Multimodal models
38-
39-
Beyond text-only processing, **multimodal models** like GPT-4o and Phi-3-vision can handle multiple data types including images, audio, and text. Use these models when your application needs to analyze visual content, such as document understanding, image description, or chart explanation.
40-
4141
### Specialized models
4242

4343
The catalog also includes task-specific models:
4444

45-
**Image generation models** like DALL·E 3 create visual content from text descriptions. Use these for generating marketing materials, illustrations, or design mockups.
46-
4745
**Embedding models** like Ada and Cohere convert text into numerical representations. These models enable semantic search, recommendation systems, and Retrieval Augmented Generation (RAG) scenarios where you need to find relevant information based on meaning rather than exact keyword matches.
4846

49-
### Regional and domain-specific models
47+
**Image generation models** like GPT-image-1 create images from text descriptions. Use these for generating marketing materials, illustrations, or design mockups.
5048

51-
Some models are optimized for specific languages, regions, or industries. When you need specialized performance in a particular domain or language, these models often outperform general-purpose alternatives. Examples include models trained on medical literature, legal documents, or specific language corpora.
49+
**Video generation models** like Sora 2 create video content from text descriptions.
50+
51+
**Image analysis models** like GPT-4.1 can accept *multimodal* input, including text and images; and generate natural language output based on prompts that include images for analysis.
5252

53-
## Use the search and compare features
53+
**Text to speech models** like GPT-4o-tts can convert text-based input to synthesized speech.
5454

55-
Beyond filters, the model catalog offers search functionality to find models by name or keywords. You can open multiple model cards to compare their specifications, benchmarks, and capabilities side by side. This comparison helps you make informed decisions about which model best fits your use case, budget, and performance requirements.
55+
**Speech to text models** like GPT-4o-transcribe can convert audio data containing speech into text transcriptions.
5656

57-
When you identify promising candidates, you can view detailed benchmark results, test models in the playground, or proceed directly to deployment. The structured approach of filtering, comparing, and testing helps ensure you select the right model for your generative AI application.
57+
### Regional and domain-specific models
58+
59+
Some models are optimized for specific languages, regions, or industries. When you need specialized performance in a particular domain or language, these models often outperform general-purpose alternatives. Examples include models trained on medical literature, legal documents, or specific language corpora.

learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Before deploying a model, you want to understand how it performs across differen
44

55
You can explore benchmarks in two ways within the Microsoft Foundry portal:
66

7-
From the **model catalog**, select **Go to leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
7+
In the **model catalog**, view the **Model leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
88

99
For detailed benchmarks on a specific model, open its model card and select the **Benchmarks** tab. This view shows how the individual model performs across various metrics and datasets, with comparison charts placing it relative to similar models.
1010

@@ -15,6 +15,7 @@ Quality benchmarks assess how well a model generates accurate, coherent, and con
1515
The **Quality index** provides a high-level overview by averaging accuracy scores across multiple benchmark datasets that measure reasoning, knowledge, question answering, mathematical capabilities, and coding skills. Higher quality index values indicate stronger overall performance across general-purpose language tasks.
1616

1717
Quality benchmarks use datasets such as:
18+
1819
- **Arena-Hard** - adversarial question answering
1920
- **BIG-Bench Hard** - reasoning capabilities
2021
- **GPQA** - graduate-level multi-discipline questions
@@ -27,13 +28,14 @@ Benchmark scores are normalized indexes ranging from zero to one, where higher v
2728

2829
:::image type="content" source="../media/model-leaderboard.png" alt-text="Screenshot of model leaderboard in Microsoft Foundry portal." lightbox="../media/model-leaderboard.png":::
2930

30-
## Safety and risk benchmarks
31+
## Safety benchmarks
3132

3233
Safety metrics ensure models don't generate harmful, biased, or inappropriate content. These benchmarks are crucial for applications exposed to end users, especially in regulated industries or customer-facing scenarios.
3334

3435
Microsoft Foundry evaluates models across multiple safety dimensions:
3536

3637
**Harmful behavior detection** uses the HarmBench benchmark to measure how well models resist generating unsafe content. The evaluation calculates **Attack Success Rate (ASR)**, where lower values indicate safer, more robust models. HarmBench tests three functional areas:
38+
3739
- **Standard harmful behaviors** - cybercrime, illegal activities, general harm
3840
- **Contextually harmful behaviors** - misinformation, harassment, bullying
3941
- **Copyright violations** - reproducing copyrighted material
@@ -61,6 +63,7 @@ Cost benchmarks help you identify models that deliver the quality you need at a
6163
Performance metrics measure how quickly and efficiently models respond to requests. These benchmarks matter for real-time applications where user experience depends on responsiveness.
6264

6365
**Latency** measurements include:
66+
6467
- **Latency mean** - average time in seconds to process a request
6568
- **Latency P50** (median) - 50% of requests complete faster than this time
6669
- **Latency P90** - 90% of requests complete faster than this time
@@ -69,6 +72,7 @@ Performance metrics measure how quickly and efficiently models respond to reques
6972
- **Time to first token (TTFT)** - time until the first token arrives when using streaming
7073

7174
**Throughput** measurements include:
75+
7276
- **Generated tokens per second (GTPS)** - output tokens generated per second
7377
- **Total tokens per second (TTPS)** - combined input and output tokens processed per second
7478
- **Time between tokens** - interval between receiving consecutive tokens
@@ -84,17 +88,10 @@ The model leaderboard lets you view top models for specific metrics. You can sor
8488
**Trade-off charts** display two metrics simultaneously, such as quality versus cost or quality versus throughput. These visualizations help you find the optimal balance for your requirements. Use the dropdown to compare quality against cost, throughput, or safety. Models closer to the top-right corner of the chart perform well on both metrics. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
8589

8690
**Side-by-side comparison** lets you select two or three models from the leaderboard and compare them across multiple dimensions:
91+
8792
- Performance benchmarks (quality, safety, throughput)
8893
- Model details (context window, training data, supported languages)
8994
- Supported endpoints (deployment options)
9095
- Feature support (function calling, structured output, vision)
9196

9297
Select models by checking boxes next to their names, then choose **Compare** to open the detailed comparison view.
93-
94-
## Evaluate with your own data
95-
96-
While public benchmark results provide valuable guidance, you can also evaluate models using your own test data. From a model's **Benchmarks** tab, select **Try with your own data** to run evaluations on scenarios specific to your application.
97-
98-
This custom evaluation uses your own prompts, expected responses, and evaluation criteria. The results show how the model performs on your actual use case, complementing the public benchmark data with application-specific insights.
99-
100-
By combining public benchmarks with custom testing, you gather the evidence needed to select a model confidently. You understand not only how a model performs generally, but specifically how well it addresses your unique requirements for quality, safety, cost, and performance.

0 commit comments

Comments
 (0)