MicrosoftDocs
diff --git a/‎learn-pr/wwl-data-ai/model-catalog-evaluate/7-knowledge-check.yml‎
Lines changed: 33 additions & 33 deletions b/‎learn-pr/wwl-data-ai/model-catalog-evaluate/7-knowledge-check.yml‎
Lines changed: 33 additions & 33 deletions
diff --git a/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md‎
Lines changed: 29 additions & 27 deletions b/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/2-explore-model-catalog.md‎
Lines changed: 29 additions & 27 deletions
diff --git a/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md‎
Lines changed: 7 additions & 10 deletions b/‎learn-pr/wwl-data-ai/model-catalog-evaluate/includes/3-select-models-benchmarks.md‎
Lines changed: 7 additions & 10 deletions
@@ -13,36 +13,36 @@ durationInMinutes: 3
 quiz:
   title: Check your knowledge
   questions:
-  - content: "Which deployment type in Microsoft Foundry portal requires an Azure Marketplace subscription for models from partners and the community?"
-    choices:
-    - content: "Managed compute"
-      isCorrect: false
-      explanation: "Incorrect. Managed compute doesn't necessarily require Azure Marketplace subscriptions. It uses Azure virtual machines and is billed for hosting and inference costs."
-    - content: "Serverless API"
-      isCorrect: true
-      explanation: "Correct. Serverless API deployments for models from partners and community require Azure Marketplace subscriptions, while models sold directly by Azure don't require this subscription."
-    - content: "Provisioned"
-      isCorrect: false
-      explanation: "Incorrect. Provisioned deployments reserve dedicated capacity but don't specifically require Azure Marketplace subscriptions."
-  - content: "Which evaluation metric measures whether model responses are based on provided context rather than speculation?"
-    choices:
-    - content: "Fluency"
-      isCorrect: false
-      explanation: "Incorrect. Fluency evaluates linguistic correctness and natural language quality, not whether responses are grounded in context."
-    - content: "Groundedness"
-      isCorrect: true
-      explanation: "Correct. Groundedness determines whether responses are based on provided context rather than speculation or the model's general knowledge."
-    - content: "Coherence"
-      isCorrect: false
-      explanation: "Incorrect. Coherence assesses whether responses flow logically and maintain consistent ideas, not whether they're based on provided context."
-  - content: "What type of model should you select if your application needs to process both text and images?"
-    choices:
-    - content: "Small Language Model (SLM)"
-      isCorrect: false
-      explanation: "Incorrect. SLMs are efficient text-focused models but don't inherently process images alongside text."
-    - content: "Embedding model"
-      isCorrect: false
-      explanation: "Incorrect. Embedding models convert text into numerical representations for semantic search and similarity tasks, not for processing images."
-    - content: "Multimodal model"
-      isCorrect: true
-      explanation: "Correct. Multimodal models like GPT-4o and Phi-3-vision can process multiple data types including both text and images."
+    - content: "Which model benchmark indicates the model's ability to process prompts and return comprehensive responses quickly?"
+      choices:
+        - content: "Quality index"
+          isCorrect: false
+          explanation: "Incorrect. Quality index evaluates the overall quality of the model's responses, not its speed or efficiency."
+        - content: "Cost"
+          isCorrect: false
+          explanation: "Incorrect. Cost evaluates the expense of using the model, not its speed or efficiency."
+        - content: "Throughput"
+          isCorrect: true
+          explanation: "Correct. Throughput indicates the model's ability to process prompts and return comprehensive responses quickly."
+    - content: "Which deployment type in Microsoft Foundry is best for general use while offering the largest quota?"
+      choices:
+        - content: "Data Zone Batch"
+          isCorrect: false
+          explanation: "Incorrect. Data Zone Batch deployments are designed for batch processing and may not offer the largest quota for general use."
+        - content: "Global Standard"
+          isCorrect: true
+          explanation: "Correct. Global Standard deployments offer the largest quota and are suitable for general use."
+        - content: "Developer"
+          isCorrect: false
+          explanation: "Incorrect. Developer deployments are intended for development and testing of fine-tuned models, not for general use with the largest quota."
+    - content: "Which evaluation metric measures linguistic correctness and natural language quality?"
+      choices:
+        - content: "Fluency"
+          isCorrect: true
+          explanation: "Correct. Fluency evaluates linguistic correctness and natural language quality."
+        - content: "Groundedness"
+          isCorrect: false
+          explanation: "Incorrect. Groundedness determines whether responses are based on provided context rather than speculation or the model's general knowledge."
+        - content: "Relevance"
+          isCorrect: false
+          explanation: "Incorrect. Relevance assesses whether responses are pertinent to the given context or query, not linguistic correctness or natural language quality."
@@ -1,57 +1,59 @@
-The model catalog in Microsoft Foundry portal serves as your central hub for discovering and comparing AI models. With over 1,900 models available from various providers, you need effective ways to filter and find models that match your specific requirements.
+The Foundry Models catalog serves as your central hub for discovering and comparing AI models. With over 1,900 models available from various providers, you need effective ways to filter and find models that match your specific requirements.
 
-## Access the model catalog
+The model catalog includes two broad categories of model:
 
-You access the model catalog from the Microsoft Foundry portal at [ai.azure.com](https://ai.azure.com). After signing in and selecting your project, choose **Discover** from the top navigation. The catalog displays model cards showing key information about each model, including the provider, capabilities, and deployment options.
+- **Foundry Models sold directly by Azure**
 
-:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
-
-## Filter models by key attributes
+    These models are billed directly through your Azure subscription, and include Azure OpenAI models as well as models from Microsoft and other providers.
 
-The model catalog provides several filters to help you narrow your search:
+- **Foundry Models from partners and community**
 
-**Collection** filters let you browse models by provider, such as Azure OpenAI, Meta, Mistral, Cohere, or Hugging Face. This helps when you have preferences or requirements for specific model families.
+    These models are provided by trusted partners and the community; each with their own licensing and pricing.
 
-**Industry** filters show models trained on industry-specific datasets. These specialized models often outperform general-purpose models in their respective domains.
+## Finding models in the model catalog
 
-**Capabilities** filters highlight unique model features. You can filter for reasoning capabilities (complex problem-solving), tool calling (API and function integration), or multimodal processing (text, images, audio).
+The model catalog user interface in the Foundry Portal provides an easy way to search for the right model for your needs. Each model has a *model card* showing its key information; including the provider, capabilities, benchmark metrics, responsible AI considerations, and deployment options.
 
-**Inference tasks** and **Fine-tune tasks** filters let you find models suited for specific activities like text generation, summarization, translation, or entity extraction.
+:::image type="content" source="../media/model-catalog.png" alt-text="Screenshot of the model catalog in Microsoft Foundry portal.":::
 
-## Understand model types
+You can search for models by keyword, and you can filter based on the following attributes:
 
-As you explore the catalog, you encounter different categories of models designed for various use cases.
+- **Collection**: Models are organized into collections, such as models that are provided directly in Azure, or models in the Hugging Face repository.
+- **Capabilities**: Specific model abilities, including *reasoning* (complex problem-solving), *tool calling* (API and function integration), or *multimodal processing* (text, images, audio).
+- **Source**: The model provider, including Azure OpenAI, Microsoft, Cohere, Mistral, Meta, Anthropic, and others.
+- **Inference tasks**: Specific tasks like text generation, summarization, translation, image-generation, speech synthesis, or other common AI tasks.
+- **Fine-tuning methods**: Supported techniques for fine-tuning a model.
+- **Industry**: Models trained on industry-specific datasets. These specialized models often outperform general-purpose models in their respective domains.
 
-### Large Language Models and Small Language Models
+## Understand generative AI model types
 
-**Large Language Models (LLMs)** like GPT-4, Mistral Large, and Llama 3 70B are powerful models designed for tasks requiring deep reasoning, complex content generation, and extensive context understanding. These models excel at sophisticated applications but require more computational resources.
+As you explore the catalog, you encounter different categories of models designed for various use cases. In broad terms, you can categorize language models as:
 
-**Small Language Models (SLMs)** like Phi-3, Mistral OSS models, and Llama 3 8B offer efficiency and cost-effectiveness while handling common natural language processing tasks. They're ideal for scenarios where speed and cost matter more than handling the most complex reasoning tasks. SLMs can run on lower-end hardware or edge devices.
+- **Large Language Models (LLMs)** like GPT-5, Mistral Large, and Llama 3 70B that are designed for tasks requiring deep reasoning, complex content generation, and extensive context understanding. These models excel at sophisticated applications but require more computational resources.
+- **Small Language Models (SLMs)** like Phi-4, Mistral OSS models, and Llama 3 8B that offer efficiency and cost-effectiveness while handling common natural language processing tasks. They're ideal for scenarios where speed and cost matter more than handling the most complex reasoning tasks. SLMs can run on lower-end hardware or edge devices.
 
 ### Chat completion and reasoning models
 
 Most language models in the catalog are **chat completion** models designed to generate coherent, contextually appropriate text responses. These models power conversational interfaces and content generation applications.
 
 For scenarios requiring higher performance in complex tasks like mathematics, coding, science, strategy, and logistics, **reasoning models** like Claude Opus 4.6 provide enhanced problem-solving capabilities. These models can break down complex problems and show their reasoning process.
 
-### Multimodal models
-
-Beyond text-only processing, **multimodal models** like GPT-4o and Phi-3-vision can handle multiple data types including images, audio, and text. Use these models when your application needs to analyze visual content, such as document understanding, image description, or chart explanation.
-
 ### Specialized models
 
 The catalog also includes task-specific models:
 
-**Image generation models** like DALL·E 3 create visual content from text descriptions. Use these for generating marketing materials, illustrations, or design mockups.
-
 **Embedding models** like Ada and Cohere convert text into numerical representations. These models enable semantic search, recommendation systems, and Retrieval Augmented Generation (RAG) scenarios where you need to find relevant information based on meaning rather than exact keyword matches.
 
-### Regional and domain-specific models
+**Image generation models** like GPT-image-1 create images from text descriptions. Use these for generating marketing materials, illustrations, or design mockups.
 
-Some models are optimized for specific languages, regions, or industries. When you need specialized performance in a particular domain or language, these models often outperform general-purpose alternatives. Examples include models trained on medical literature, legal documents, or specific language corpora.
+**Video generation models** like Sora 2 create video content from text descriptions.
+
+**Image analysis models** like GPT-4.1 can accept *multimodal* input, including text and images; and generate natural language output based on prompts that include images for analysis.
 
-## Use the search and compare features
+**Text to speech models** like GPT-4o-tts can convert text-based input to synthesized speech.
 
-Beyond filters, the model catalog offers search functionality to find models by name or keywords. You can open multiple model cards to compare their specifications, benchmarks, and capabilities side by side. This comparison helps you make informed decisions about which model best fits your use case, budget, and performance requirements.
+**Speech to text models** like GPT-4o-transcribe can convert audio data containing speech into text transcriptions.
 
-When you identify promising candidates, you can view detailed benchmark results, test models in the playground, or proceed directly to deployment. The structured approach of filtering, comparing, and testing helps ensure you select the right model for your generative AI application.
+### Regional and domain-specific models
+
+Some models are optimized for specific languages, regions, or industries. When you need specialized performance in a particular domain or language, these models often outperform general-purpose alternatives. Examples include models trained on medical literature, legal documents, or specific language corpora.
@@ -4,7 +4,7 @@ Before deploying a model, you want to understand how it performs across differen
 
 You can explore benchmarks in two ways within the Microsoft Foundry portal:
 
-From the **model catalog**, select **Go to leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
+In the **model catalog**, view the **Model leaderboard** to see comparative rankings across all available models. This view helps you identify top-performing models for specific metrics or scenarios. The leaderboard displays top models ranked by quality, safety, estimated cost, and throughput.
 
 For detailed benchmarks on a specific model, open its model card and select the **Benchmarks** tab. This view shows how the individual model performs across various metrics and datasets, with comparison charts placing it relative to similar models.
 
@@ -15,6 +15,7 @@ Quality benchmarks assess how well a model generates accurate, coherent, and con
 The **Quality index** provides a high-level overview by averaging accuracy scores across multiple benchmark datasets that measure reasoning, knowledge, question answering, mathematical capabilities, and coding skills. Higher quality index values indicate stronger overall performance across general-purpose language tasks.
 
 Quality benchmarks use datasets such as:
+
 - **Arena-Hard** - adversarial question answering
 - **BIG-Bench Hard** - reasoning capabilities
 - **GPQA** - graduate-level multi-discipline questions
@@ -27,13 +28,14 @@ Benchmark scores are normalized indexes ranging from zero to one, where higher v
 
 :::image type="content" source="../media/model-leaderboard.png" alt-text="Screenshot of model leaderboard in Microsoft Foundry portal." lightbox="../media/model-leaderboard.png":::
 
-## Safety and risk benchmarks
+## Safety benchmarks
 
 Safety metrics ensure models don't generate harmful, biased, or inappropriate content. These benchmarks are crucial for applications exposed to end users, especially in regulated industries or customer-facing scenarios.
 
 Microsoft Foundry evaluates models across multiple safety dimensions:
 
 **Harmful behavior detection** uses the HarmBench benchmark to measure how well models resist generating unsafe content. The evaluation calculates **Attack Success Rate (ASR)**, where lower values indicate safer, more robust models. HarmBench tests three functional areas:
+
 - **Standard harmful behaviors** - cybercrime, illegal activities, general harm
 - **Contextually harmful behaviors** - misinformation, harassment, bullying  
 - **Copyright violations** - reproducing copyrighted material
@@ -61,6 +63,7 @@ Cost benchmarks help you identify models that deliver the quality you need at a
 Performance metrics measure how quickly and efficiently models respond to requests. These benchmarks matter for real-time applications where user experience depends on responsiveness.
 
 **Latency** measurements include:
+
 - **Latency mean** - average time in seconds to process a request
 - **Latency P50** (median) - 50% of requests complete faster than this time
 - **Latency P90** - 90% of requests complete faster than this time
@@ -69,6 +72,7 @@ Performance metrics measure how quickly and efficiently models respond to reques
 - **Time to first token (TTFT)** - time until the first token arrives when using streaming
 
 **Throughput** measurements include:
+
 - **Generated tokens per second (GTPS)** - output tokens generated per second
 - **Total tokens per second (TTPS)** - combined input and output tokens processed per second
 - **Time between tokens** - interval between receiving consecutive tokens
@@ -84,17 +88,10 @@ The model leaderboard lets you view top models for specific metrics. You can sor
 **Trade-off charts** display two metrics simultaneously, such as quality versus cost or quality versus throughput. These visualizations help you find the optimal balance for your requirements. Use the dropdown to compare quality against cost, throughput, or safety. Models closer to the top-right corner of the chart perform well on both metrics. A model that's slightly less accurate but significantly faster or cheaper might better serve your needs.
 
 **Side-by-side comparison** lets you select two or three models from the leaderboard and compare them across multiple dimensions:
+
 - Performance benchmarks (quality, safety, throughput)
 - Model details (context window, training data, supported languages)
 - Supported endpoints (deployment options)
 - Feature support (function calling, structured output, vision)
 
 Select models by checking boxes next to their names, then choose **Compare** to open the detailed comparison view.
-
-## Evaluate with your own data
-
-While public benchmark results provide valuable guidance, you can also evaluate models using your own test data. From a model's **Benchmarks** tab, select **Try with your own data** to run evaluations on scenarios specific to your application.
-
-This custom evaluation uses your own prompts, expected responses, and evaluation criteria. The results show how the model performs on your actual use case, complementing the public benchmark data with application-specific insights.
-
-By combining public benchmarks with custom testing, you gather the evidence needed to select a model confidently. You understand not only how a model performs generally, but specifically how well it addresses your unique requirements for quality, safety, cost, and performance.