Skip to content

Commit 62e91cf

Browse files
committed
review for new portal accuracy
1 parent b45752d commit 62e91cf

9 files changed

Lines changed: 89 additions & 51 deletions

File tree

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
### YamlMime:LearningPath
2+
uid: learn.wwl.develop-generative-ai-apps
3+
metadata:
4+
title: Develop generative AI apps in Azure
5+
description: Learn how to develop generative AI apps in Azure. (AI-3016)
6+
ms.date: 02/17/2026
7+
author: ivorb
8+
ms.author: berryivor
9+
ms.topic: learning-path
10+
ms.collection: wwl-ai-copilot
11+
ms.custom: [copilot-learning-hub]
12+
title: Develop generative AI apps in Azure
13+
prerequisites: |
14+
Before starting this module, you should be familiar with fundamental AI concepts and services in Azure. You should also have programming experience.
15+
summary: |
16+
Generative Artificial Intelligence (AI) is becoming more accessible through comprehensive development platforms like Microsoft Foundry. Learn how to build generative AI applications that use language models to interact with your users.
17+
iconUrl: /training/achievements/generic-badge.svg
18+
levels:
19+
- intermediate
20+
roles:
21+
- data-scientist
22+
- ai-engineer
23+
products:
24+
- ai-services
25+
- azure-ai-foundry
26+
- azure-ai-foundry-sdk
27+
subjects:
28+
- artificial-intelligence
29+
modules:
30+
- learn.wwl.prepare-azure-ai-development
31+
- learn.wwl.model-catalog-evaluate
32+
- learn.wwl.ai-foundry-sdk
33+
- learn.wwl.finetune-model-copilot-ai-studio
34+
- learn.wwl.responsible-ai-studio
35+
trophy:
36+
uid: learn.wwl.develop-generative-ai-apps.trophy
37+

learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ The model catalog in Microsoft Foundry portal serves as your central hub for dis
22

33
## Access the model catalog
44

5-
You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Model catalog** from the left navigation pane. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
5+
You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Discover** from the top navigation. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
66

77
:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
88

@@ -16,16 +16,8 @@ The model catalog provides several filters to help you narrow your search:
1616

1717
**Capabilities** filters highlight unique model features. You can filter for reasoning capabilities (complex problem-solving), tool calling (API and function integration), or multimodal processing (text, images, audio).
1818

19-
**Deployment options** filters help you find models that support your preferred deployment type:
20-
- Serverless API for pay-per-call flexibility
21-
- Provisioned deployment for consistent, high-volume workloads
22-
- Managed compute for virtual machine-based hosting
23-
- Batch processing for cost-optimized, non-latency-sensitive jobs
24-
2519
**Inference tasks** and **Fine-tune tasks** filters let you find models suited for specific activities like text generation, summarization, translation, or entity extraction.
2620

27-
**License** filters help you identify models that align with your organization's licensing requirements and usage policies.
28-
2921
## Understand model types
3022

3123
As you explore the catalog, you encounter different categories of models designed for various use cases.
@@ -40,7 +32,7 @@ As you explore the catalog, you encounter different categories of models designe
4032

4133
Most language models in the catalog are **chat completion** models designed to generate coherent, contextually appropriate text responses. These models power conversational interfaces and content generation applications.
4234

43-
For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like DeepSeek-R1 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
35+
For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like Claude Opus 4.6 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
4436

4537
### Multimodal models
4638

learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md

Lines changed: 33 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -4,45 +4,45 @@ Before deploying a model, you want to understand how it performs across differen
44

55
You can explore benchmarks in two ways within the Microsoft Foundry portal:
66

7-
From the **model catalog**, select **Browse leaderboards** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios.
7+
From the **model catalog**, select **Go to leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
88

99
For detailed benchmarks on a specific model, open its model card and select the **Benchmarks** tab. This view shows how the individual model performs across various metrics and datasets, with comparison charts placing it relative to similar models.
1010

1111
## Quality benchmarks
1212

1313
Quality benchmarks assess how well a model generates accurate, coherent, and contextually appropriate responses. These metrics use public datasets and standardized evaluation methods to ensure consistency.
1414

15-
**Accuracy** measures whether model-generated text matches correct answers according to the dataset. The result is binary: one if the generated text matches exactly, zero otherwise. High accuracy indicates the model reliably produces correct factual responses.
15+
The **Quality index** provides a high-level overview by averaging accuracy scores across multiple benchmark datasets that measure reasoning, knowledge, question answering, mathematical capabilities, and coding skills. Higher quality index values indicate stronger overall performance across general-purpose language tasks.
1616

17-
**Coherence** evaluates whether model output flows smoothly and resembles human-like language. A coherent response maintains logical structure and clear relationships between ideas, making it easy for users to follow and understand.
17+
Quality benchmarks use datasets such as:
18+
- **Arena-Hard** - adversarial question answering
19+
- **BIG-Bench Hard** - reasoning capabilities
20+
- **GPQA** - graduate-level multi-discipline questions
21+
- **HumanEval+** and **MBPP+** - code generation tasks
22+
- **MATH** - mathematical reasoning
23+
- **MMLU-Pro** - general knowledge assessment
24+
- **IFEval** - instruction following
1825

19-
**Fluency** assesses grammatical correctness, syntactic structure, and appropriate vocabulary usage. Fluent responses sound natural and linguistically correct, avoiding awkward phrasing or grammatical errors.
26+
Benchmark scores are normalized indexes ranging from zero to one, where higher values indicate better performance.
2027

21-
**GPT similarity** quantifies semantic similarity between ground truth sentences and model predictions. This metric considers meaning rather than exact wording, allowing for paraphrasing while penalizing responses that miss key concepts.
22-
23-
Additional NLP metrics you might encounter include:
24-
- **BLEU** (Bilingual Evaluation Understudy) - commonly used for translation tasks
25-
- **METEOR** (Metric for Evaluation of Translation with Explicit Ordering) - accounts for synonyms and paraphrasing
26-
- **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) - emphasizes recall for summarization tasks
27-
- **F1-score** - measures shared words between generated and ground truth answers
28-
29-
:::image type="content" source="../media/model-benchmarks.png" alt-text="Screenshot of model benchmarks in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
28+
:::image type="content" source="../media/model-leaderboard.png" alt-text="Screenshot of model leaderboard in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
3029

3130
## Safety and risk benchmarks
3231

3332
Safety metrics ensure models don't generate harmful, biased, or inappropriate content. These benchmarks are crucial for applications exposed to end users, especially in regulated industries or customer-facing scenarios.
3433

3534
Microsoft Foundry evaluates models across multiple safety dimensions:
3635

37-
**Content harm defect rate** measures the percentage of instances where output exceeds a severity threshold (default is Medium) for categories including:
38-
- Self-harm-related content
39-
- Hateful and unfair content
40-
- Violent content
41-
- Sexual content
36+
**Harmful behavior detection** uses the HarmBench benchmark to measure how well models resist generating unsafe content. The evaluation calculates **Attack Success Rate (ASR)**, where lower values indicate safer, more robust models. HarmBench tests three functional areas:
37+
- **Standard harmful behaviors** - cybercrime, illegal activities, general harm
38+
- **Contextually harmful behaviors** - misinformation, harassment, bullying
39+
- **Copyright violations** - reproducing copyrighted material
40+
41+
**Toxic content detection** uses the ToxiGen dataset to measure how well models identify adversarial and implicit hate speech. Higher F1 scores indicate better detection performance across references to minority groups.
4242

43-
**Protected material** detection identifies whether models reproduce copyrighted or proprietary content. The defect rate calculates the percentage of instances where output contains protected material.
43+
**Sensitive domain knowledge** uses the WMDP (Weapons of Mass Destruction Proxy) benchmark to measure model knowledge in biosecurity, cybersecurity, and chemical security. Higher WMDP scores indicate more knowledge of potentially dangerous capabilities.
4444

45-
**Indirect attack** (jailbreak) resistance measures how well models maintain safety guardrails when users attempt to manipulate them into generating harmful content through indirect prompting techniques.
45+
Safety scores help you understand model robustness, especially important for customer-facing applications where harmful output poses significant concerns.
4646

4747
## Cost benchmarks
4848

@@ -73,15 +73,23 @@ Performance metrics measure how quickly and efficiently models respond to reques
7373
- **Total tokens per second (TTPS)** - combined input and output tokens processed per second
7474
- **Time between tokens** - interval between receiving consecutive tokens
7575

76-
High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
76+
The leaderboard summarizes performance using mean time to first token (lower is better) and mean generated tokens per second (higher is better). High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
77+
78+
## Use leaderboards and comparison features
79+
80+
The model leaderboard lets you view top models for specific metrics. You can sort by quality, safety, estimated cost, and throughput to identify models that best match your requirements.
7781

78-
## Use leaderboards and comparison charts
82+
**Scenario leaderboards** help you find models optimized for specific use cases like reasoning, coding, math, question answering, or groundedness. If your application maps to a particular scenario, start with the relevant scenario leaderboard rather than relying solely on overall quality index.
7983

80-
The **Browse leaderboards** feature lets you view top models for specific metrics. You can filter leaderboards by scenario (such as question answering or summarization) to find models optimized for your use case.
84+
**Trade-off charts** display two metrics simultaneously, such as quality versus cost or quality versus throughput. These visualizations help you find the optimal balance for your requirements. Use the dropdown to compare quality against cost, throughput, or safety. Models closer to the top-right corner of the chart perform well on both metrics. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
8185

82-
**Trade-off charts** display two metrics simultaneously, such as quality versus cost or latency versus throughput. These visualizations help you find the optimal balance for your requirements. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
86+
**Side-by-side comparison** lets you select two or three models from the leaderboard and compare them across multiple dimensions:
87+
- Performance benchmarks (quality, safety, throughput)
88+
- Model details (context window, training data, supported languages)
89+
- Supported endpoints (deployment options)
90+
- Feature support (function calling, structured output, vision)
8391

84-
**Comparison tables** show detailed results for each metric across multiple models, making it easy to see exact numbers and compare candidates side by side.
92+
Select models by checking boxes next to their names, then choose **Compare** to open the detailed comparison view.
8593

8694
## Evaluate with your own data
8795

0 commit comments

Comments
 (0)