You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how to develop generative AI apps in Azure. (AI-3016)
6
+
ms.date: 02/17/2026
7
+
author: ivorb
8
+
ms.author: berryivor
9
+
ms.topic: learning-path
10
+
ms.collection: wwl-ai-copilot
11
+
ms.custom: [copilot-learning-hub]
12
+
title: Develop generative AI apps in Azure
13
+
prerequisites: |
14
+
Before starting this module, you should be familiar with fundamental AI concepts and services in Azure. You should also have programming experience.
15
+
summary: |
16
+
Generative Artificial Intelligence (AI) is becoming more accessible through comprehensive development platforms like Microsoft Foundry. Learn how to build generative AI applications that use language models to interact with your users.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md
+2-10Lines changed: 2 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@ The model catalog in Microsoft Foundry portal serves as your central hub for dis
2
2
3
3
## Access the model catalog
4
4
5
-
You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Model catalog** from the left navigation pane. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
5
+
You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Discover** from the top navigation. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
6
6
7
7
:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
8
8
@@ -16,16 +16,8 @@ The model catalog provides several filters to help you narrow your search:
16
16
17
17
**Capabilities** filters highlight unique model features. You can filter for reasoning capabilities (complex problem-solving), tool calling (API and function integration), or multimodal processing (text, images, audio).
18
18
19
-
**Deployment options** filters help you find models that support your preferred deployment type:
20
-
- Serverless API for pay-per-call flexibility
21
-
- Provisioned deployment for consistent, high-volume workloads
22
-
- Managed compute for virtual machine-based hosting
23
-
- Batch processing for cost-optimized, non-latency-sensitive jobs
24
-
25
19
**Inference tasks** and **Fine-tune tasks** filters let you find models suited for specific activities like text generation, summarization, translation, or entity extraction.
26
20
27
-
**License** filters help you identify models that align with your organization's licensing requirements and usage policies.
28
-
29
21
## Understand model types
30
22
31
23
As you explore the catalog, you encounter different categories of models designed for various use cases.
@@ -40,7 +32,7 @@ As you explore the catalog, you encounter different categories of models designe
40
32
41
33
Most language models in the catalog are **chat completion** models designed to generate coherent, contextually appropriate text responses. These models power conversational interfaces and content generation applications.
42
34
43
-
For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like DeepSeek-R1 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
35
+
For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like Claude Opus 4.6 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md
+33-25Lines changed: 33 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,45 +4,45 @@ Before deploying a model, you want to understand how it performs across differen
4
4
5
5
You can explore benchmarks in two ways within the Microsoft Foundry portal:
6
6
7
-
From the **model catalog**, select **Browse leaderboards** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios.
7
+
From the **model catalog**, select **Go to leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
8
8
9
9
For detailed benchmarks on a specific model, open its model card and select the **Benchmarks** tab. This view shows how the individual model performs across various metrics and datasets, with comparison charts placing it relative to similar models.
10
10
11
11
## Quality benchmarks
12
12
13
13
Quality benchmarks assess how well a model generates accurate, coherent, and contextually appropriate responses. These metrics use public datasets and standardized evaluation methods to ensure consistency.
14
14
15
-
**Accuracy**measures whether model-generated text matches correct answers according to the dataset. The result is binary: one if the generated text matches exactly, zero otherwise. High accuracy indicates the model reliably produces correct factual responses.
15
+
The **Quality index**provides a high-level overview by averaging accuracy scores across multiple benchmark datasets that measure reasoning, knowledge, question answering, mathematical capabilities, and coding skills. Higher quality index values indicate stronger overall performance across general-purpose language tasks.
16
16
17
-
**Coherence** evaluates whether model output flows smoothly and resembles human-like language. A coherent response maintains logical structure and clear relationships between ideas, making it easy for users to follow and understand.
-**HumanEval+** and **MBPP+** - code generation tasks
22
+
-**MATH** - mathematical reasoning
23
+
-**MMLU-Pro** - general knowledge assessment
24
+
-**IFEval** - instruction following
18
25
19
-
**Fluency** assesses grammatical correctness, syntactic structure, and appropriate vocabulary usage. Fluent responses sound natural and linguistically correct, avoiding awkward phrasing or grammatical errors.
26
+
Benchmark scores are normalized indexes ranging from zero to one, where higher values indicate better performance.
20
27
21
-
**GPT similarity** quantifies semantic similarity between ground truth sentences and model predictions. This metric considers meaning rather than exact wording, allowing for paraphrasing while penalizing responses that miss key concepts.
22
-
23
-
Additional NLP metrics you might encounter include:
24
-
-**BLEU** (Bilingual Evaluation Understudy) - commonly used for translation tasks
25
-
-**METEOR** (Metric for Evaluation of Translation with Explicit Ordering) - accounts for synonyms and paraphrasing
26
-
-**ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) - emphasizes recall for summarization tasks
27
-
-**F1-score** - measures shared words between generated and ground truth answers
28
-
29
-
:::image type="content" source="../media/model-benchmarks.png" alt-text="Screenshot of model benchmarks in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
28
+
:::image type="content" source="../media/model-leaderboard.png" alt-text="Screenshot of model leaderboard in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
30
29
31
30
## Safety and risk benchmarks
32
31
33
32
Safety metrics ensure models don't generate harmful, biased, or inappropriate content. These benchmarks are crucial for applications exposed to end users, especially in regulated industries or customer-facing scenarios.
34
33
35
34
Microsoft Foundry evaluates models across multiple safety dimensions:
36
35
37
-
**Content harm defect rate** measures the percentage of instances where output exceeds a severity threshold (default is Medium) for categories including:
38
-
- Self-harm-related content
39
-
- Hateful and unfair content
40
-
- Violent content
41
-
- Sexual content
36
+
**Harmful behavior detection** uses the HarmBench benchmark to measure how well models resist generating unsafe content. The evaluation calculates **Attack Success Rate (ASR)**, where lower values indicate safer, more robust models. HarmBench tests three functional areas:
37
+
-**Standard harmful behaviors** - cybercrime, illegal activities, general harm
-**Copyright violations** - reproducing copyrighted material
40
+
41
+
**Toxic content detection** uses the ToxiGen dataset to measure how well models identify adversarial and implicit hate speech. Higher F1 scores indicate better detection performance across references to minority groups.
42
42
43
-
**Protected material**detection identifies whether models reproduce copyrighted or proprietary content. The defect rate calculates the percentage of instances where output contains protected material.
43
+
**Sensitive domain knowledge**uses the WMDP (Weapons of Mass Destruction Proxy) benchmark to measure model knowledge in biosecurity, cybersecurity, and chemical security. Higher WMDP scores indicate more knowledge of potentially dangerous capabilities.
44
44
45
-
**Indirect attack** (jailbreak) resistance measures how well models maintain safety guardrails when users attempt to manipulate them into generating harmful content through indirect prompting techniques.
45
+
Safety scores help you understand model robustness, especially important for customer-facing applications where harmful output poses significant concerns.
46
46
47
47
## Cost benchmarks
48
48
@@ -73,15 +73,23 @@ Performance metrics measure how quickly and efficiently models respond to reques
73
73
-**Total tokens per second (TTPS)** - combined input and output tokens processed per second
74
74
-**Time between tokens** - interval between receiving consecutive tokens
75
75
76
-
High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
76
+
The leaderboard summarizes performance using mean time to first token (lower is better) and mean generated tokens per second (higher is better). High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
77
+
78
+
## Use leaderboards and comparison features
79
+
80
+
The model leaderboard lets you view top models for specific metrics. You can sort by quality, safety, estimated cost, and throughput to identify models that best match your requirements.
77
81
78
-
## Use leaderboards and comparison charts
82
+
**Scenario leaderboards** help you find models optimized for specific use cases like reasoning, coding, math, question answering, or groundedness. If your application maps to a particular scenario, start with the relevant scenario leaderboard rather than relying solely on overall quality index.
79
83
80
-
The **Browse leaderboards**feature lets you view top models for specific metrics. You can filter leaderboards by scenario (such as question answering or summarization) to find models optimized for your use case.
84
+
**Trade-off charts**display two metrics simultaneously, such as quality versus cost or quality versus throughput. These visualizations help you find the optimal balance for your requirements. Use the dropdown to compare quality against cost, throughput, or safety. Models closer to the top-right corner of the chart perform well on both metrics. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
81
85
82
-
**Trade-off charts** display two metrics simultaneously, such as quality versus cost or latency versus throughput. These visualizations help you find the optimal balance for your requirements. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
86
+
**Side-by-side comparison** lets you select two or three models from the leaderboard and compare them across multiple dimensions:
- Model details (context window, training data, supported languages)
89
+
- Supported endpoints (deployment options)
90
+
- Feature support (function calling, structured output, vision)
83
91
84
-
**Comparison tables** show detailed results for each metric across multiple models, making it easy to see exact numbers and compare candidates side by side.
92
+
Select models by checking boxes next to their names, then choose **Compare** to open the detailed comparison view.
0 commit comments