MicrosoftDocs
diff --git a/‎learn-pr/paths/develop-generative-ai-apps/index.yml‎
Lines changed: 37 additions & 0 deletions b/‎learn-pr/paths/develop-generative-ai-apps/index.yml‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md‎
Lines changed: 2 additions & 10 deletions b/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md‎
Lines changed: 2 additions & 10 deletions
diff --git a/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md‎
Lines changed: 33 additions & 25 deletions b/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md‎
Lines changed: 33 additions & 25 deletions
@@ -0,0 +1,37 @@
+### YamlMime:LearningPath
+uid: learn.wwl.develop-generative-ai-apps
+metadata:
+  title: Develop generative AI apps in Azure
+  description: Learn how to develop generative AI apps in Azure. (AI-3016)
+  ms.date: 02/17/2026
+  author: ivorb
+  ms.author: berryivor
+  ms.topic: learning-path
+  ms.collection: wwl-ai-copilot
+  ms.custom: [copilot-learning-hub]
+title: Develop generative AI apps in Azure
+prerequisites: |
+  Before starting this module, you should be familiar with fundamental AI concepts and services in Azure. You should also have programming experience.
+summary: |
+  Generative Artificial Intelligence (AI) is becoming more accessible through comprehensive development platforms like Microsoft Foundry. Learn how to build generative AI applications that use language models to interact with your users.
+iconUrl: /training/achievements/generic-badge.svg
+levels:
+- intermediate
+roles:
+- data-scientist
+- ai-engineer
+products:
+- ai-services
+- azure-ai-foundry
+- azure-ai-foundry-sdk
+subjects:
+- artificial-intelligence
+modules:
+- learn.wwl.prepare-azure-ai-development
+- learn.wwl.model-catalog-evaluate
+- learn.wwl.ai-foundry-sdk
+- learn.wwl.finetune-model-copilot-ai-studio
+- learn.wwl.responsible-ai-studio
+trophy:
+  uid: learn.wwl.develop-generative-ai-apps.trophy
+
@@ -2,7 +2,7 @@ The model catalog in Microsoft Foundry portal serves as your central hub for dis
 
 ## Access the model catalog
 
-You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Model catalog** from the left navigation pane. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
+You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Discover** from the top navigation. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
 
 :::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
 
@@ -16,16 +16,8 @@ The model catalog provides several filters to help you narrow your search:
 
 **Capabilities** filters highlight unique model features. You can filter for reasoning capabilities (complex problem-solving), tool calling (API and function integration), or multimodal processing (text, images, audio).
 
-**Deployment options** filters help you find models that support your preferred deployment type:
-- Serverless API for pay-per-call flexibility
-- Provisioned deployment for consistent, high-volume workloads
-- Managed compute for virtual machine-based hosting
-- Batch processing for cost-optimized, non-latency-sensitive jobs
-
 **Inference tasks** and **Fine-tune tasks** filters let you find models suited for specific activities like text generation, summarization, translation, or entity extraction.
 
-**License** filters help you identify models that align with your organization's licensing requirements and usage policies.
-
 ## Understand model types
 
 As you explore the catalog, you encounter different categories of models designed for various use cases.
@@ -40,7 +32,7 @@ As you explore the catalog, you encounter different categories of models designe
 
 Most language models in the catalog are **chat completion** models designed to generate coherent, contextually appropriate text responses. These models power conversational interfaces and content generation applications.
 
-For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like DeepSeek-R1 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
+For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like Claude Opus 4.6 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
 
 ### Multimodal models
 
 
@@ -4,45 +4,45 @@ Before deploying a model, you want to understand how it performs across differen
 
 You can explore benchmarks in two ways within the Microsoft Foundry portal:
 
-From the **model catalog**, select **Browse leaderboards** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios.
+From the **model catalog**, select **Go to leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
 
 For detailed benchmarks on a specific model, open its model card and select the **Benchmarks** tab. This view shows how the individual model performs across various metrics and datasets, with comparison charts placing it relative to similar models.
 
 ## Quality benchmarks
 
 Quality benchmarks assess how well a model generates accurate, coherent, and contextually appropriate responses. These metrics use public datasets and standardized evaluation methods to ensure consistency.
 
-**Accuracy** measures whether model-generated text matches correct answers according to the dataset. The result is binary: one if the generated text matches exactly, zero otherwise. High accuracy indicates the model reliably produces correct factual responses.
+The **Quality index** provides a high-level overview by averaging accuracy scores across multiple benchmark datasets that measure reasoning, knowledge, question answering, mathematical capabilities, and coding skills. Higher quality index values indicate stronger overall performance across general-purpose language tasks.
 
-**Coherence** evaluates whether model output flows smoothly and resembles human-like language. A coherent response maintains logical structure and clear relationships between ideas, making it easy for users to follow and understand.
+Quality benchmarks use datasets such as:
+- **Arena-Hard** - adversarial question answering
+- **BIG-Bench Hard** - reasoning capabilities
+- **GPQA** - graduate-level multi-discipline questions
+- **HumanEval+** and **MBPP+** - code generation tasks
+- **MATH** - mathematical reasoning
+- **MMLU-Pro** - general knowledge assessment
+- **IFEval** - instruction following
 
-**Fluency** assesses grammatical correctness, syntactic structure, and appropriate vocabulary usage. Fluent responses sound natural and linguistically correct, avoiding awkward phrasing or grammatical errors.
+Benchmark scores are normalized indexes ranging from zero to one, where higher values indicate better performance.
 
-**GPT similarity** quantifies semantic similarity between ground truth sentences and model predictions. This metric considers meaning rather than exact wording, allowing for paraphrasing while penalizing responses that miss key concepts.
-
-Additional NLP metrics you might encounter include:
-- **BLEU** (Bilingual Evaluation Understudy) - commonly used for translation tasks
-- **METEOR** (Metric for Evaluation of Translation with Explicit Ordering) - accounts for synonyms and paraphrasing
-- **ROUGE** (Recall-Oriented Understudy for Gisting Evaluation) - emphasizes recall for summarization tasks
-- **F1-score** - measures shared words between generated and ground truth answers
-
-:::image type="content" source="../media/model-benchmarks.png" alt-text="Screenshot of model benchmarks in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
+:::image type="content" source="../media/model-leaderboard.png" alt-text="Screenshot of model leaderboard in Microsoft Foundry portal." lightbox="../media/model-benchmarks.png":::
 
 ## Safety and risk benchmarks
 
 Safety metrics ensure models don't generate harmful, biased, or inappropriate content. These benchmarks are crucial for applications exposed to end users, especially in regulated industries or customer-facing scenarios.
 
 Microsoft Foundry evaluates models across multiple safety dimensions:
 
-**Content harm defect rate** measures the percentage of instances where output exceeds a severity threshold (default is Medium) for categories including:
-- Self-harm-related content
-- Hateful and unfair content
-- Violent content
-- Sexual content
+**Harmful behavior detection** uses the HarmBench benchmark to measure how well models resist generating unsafe content. The evaluation calculates **Attack Success Rate (ASR)**, where lower values indicate safer, more robust models. HarmBench tests three functional areas:
+- **Standard harmful behaviors** - cybercrime, illegal activities, general harm
+- **Contextually harmful behaviors** - misinformation, harassment, bullying  
+- **Copyright violations** - reproducing copyrighted material
+
+**Toxic content detection** uses the ToxiGen dataset to measure how well models identify adversarial and implicit hate speech. Higher F1 scores indicate better detection performance across references to minority groups.
 
-**Protected material** detection identifies whether models reproduce copyrighted or proprietary content. The defect rate calculates the percentage of instances where output contains protected material.
+**Sensitive domain knowledge** uses the WMDP (Weapons of Mass Destruction Proxy) benchmark to measure model knowledge in biosecurity, cybersecurity, and chemical security. Higher WMDP scores indicate more knowledge of potentially dangerous capabilities.
 
-**Indirect attack** (jailbreak) resistance measures how well models maintain safety guardrails when users attempt to manipulate them into generating harmful content through indirect prompting techniques.
+Safety scores help you understand model robustness, especially important for customer-facing applications where harmful output poses significant concerns.
 
 ## Cost benchmarks
 
@@ -73,15 +73,23 @@ Performance metrics measure how quickly and efficiently models respond to reques
 - **Total tokens per second (TTPS)** - combined input and output tokens processed per second
 - **Time between tokens** - interval between receiving consecutive tokens
 
-High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
+The leaderboard summarizes performance using mean time to first token (lower is better) and mean generated tokens per second (higher is better). High-throughput, low-latency models provide better user experiences in interactive applications. For batch processing jobs where speed matters less than cost, you can prioritize other factors.
+
+## Use leaderboards and comparison features
+
+The model leaderboard lets you view top models for specific metrics. You can sort by quality, safety, estimated cost, and throughput to identify models that best match your requirements.
 
-## Use leaderboards and comparison charts
+**Scenario leaderboards** help you find models optimized for specific use cases like reasoning, coding, math, question answering, or groundedness. If your application maps to a particular scenario, start with the relevant scenario leaderboard rather than relying solely on overall quality index.
 
-The **Browse leaderboards** feature lets you view top models for specific metrics. You can filter leaderboards by scenario (such as question answering or summarization) to find models optimized for your use case.
+**Trade-off charts** display two metrics simultaneously, such as quality versus cost or quality versus throughput. These visualizations help you find the optimal balance for your requirements. Use the dropdown to compare quality against cost, throughput, or safety. Models closer to the top-right corner of the chart perform well on both metrics. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
 
-**Trade-off charts** display two metrics simultaneously, such as quality versus cost or latency versus throughput. These visualizations help you find the optimal balance for your requirements. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
+**Side-by-side comparison** lets you select two or three models from the leaderboard and compare them across multiple dimensions:
+- Performance benchmarks (quality, safety, throughput)
+- Model details (context window, training data, supported languages)
+- Supported endpoints (deployment options)
+- Feature support (function calling, structured output, vision)
 
-**Comparison tables** show detailed results for each metric across multiple models, making it easy to see exact numbers and compare candidates side by side.
+Select models by checking boxes next to their names, then choose **Compare** to open the detailed comparison view.
 
 ## Evaluate with your own data