MicrosoftDocs
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/1-introduction.yml‎
Lines changed: 5 additions & 4 deletions b/‎learn-pr/wwl-data-ai/introduction-language/1-introduction.yml‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/2-how-it-works.yml‎
Lines changed: 8 additions & 7 deletions b/‎learn-pr/wwl-data-ai/introduction-language/2-how-it-works.yml‎
Lines changed: 8 additions & 7 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/3-statistical-techniques.yml‎
Lines changed: 7 additions & 7 deletions b/‎learn-pr/wwl-data-ai/introduction-language/3-statistical-techniques.yml‎
Lines changed: 7 additions & 7 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/4-semantic-models.yml‎
Lines changed: 7 additions & 6 deletions b/‎learn-pr/wwl-data-ai/introduction-language/4-semantic-models.yml‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/4b-exercise.yml‎
Lines changed: 4 additions & 4 deletions b/‎learn-pr/wwl-data-ai/introduction-language/4b-exercise.yml‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/5-knowledge-check.yml‎
Lines changed: 13 additions & 13 deletions b/‎learn-pr/wwl-data-ai/introduction-language/5-knowledge-check.yml‎
Lines changed: 13 additions & 13 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/6-summary.yml‎
Lines changed: 7 additions & 6 deletions b/‎learn-pr/wwl-data-ai/introduction-language/6-summary.yml‎
Lines changed: 7 additions & 6 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md‎
Lines changed: 19 additions & 11 deletions b/‎learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md‎
Lines changed: 19 additions & 11 deletions
diff --git a/‎learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md‎
Lines changed: 30 additions & 21 deletions b/‎learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md‎
Lines changed: 30 additions & 21 deletions
@@ -3,13 +3,14 @@ uid: learn.wwl.introduction-language.introduction
 title: Introduction
 metadata:
   title: Introduction
-  description: "Introduction"
-  ms.date: 8/15/2025
-  author: madiepev
-  ms.author: madiepev
+  description: "Introduction to text analysis."
+  ms.date: 12/16/2025
+  author: GraemeMalcolm
+  ms.author: gmalc
   ms.topic: unit
   ms.custom:
   - N/A
+  zone_pivot_groups: video-or-text
 durationInMinutes: 1
 content: |
   [!include[](includes/1-introduction.md)]
@@ -1,15 +1,16 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.introduction-language.how-it-works
-title: Understand how language is processed
+title: Tokenization
 metadata:
-  title: Understand how language is processed
-  description: "Understand how language is processed"
-  ms.date: 8/15/2025
-  author: wwlpublish
-  ms.author: sheryang
+  title: Tokenization
+  description: "Tokenization is the process of breaking down text into smaller units for processing and analysis."
+  ms.date: 12/16/2025
+  author: GraemeMalcolm
+  ms.author: gmalc
   ms.topic: unit
   ms.custom:
-  - N/A
+    - N/A
+  zone_pivot_groups: video-or-text
 durationInMinutes: 3
 content: |
   [!include[](includes/2-how-it-works.md)]
@@ -1,16 +1,16 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.introduction-language.statistical-techniques
-title: Understand statistical techniques for NLP
+title: "Statistical text analysis."
 metadata:
-  title: Understand statistical techniques for NLP
-  description: Learn about the history of NLP by first exploring the statistical techniques that were developed.
-  author: madiepev
-  ms.author: madiepev
-  ms.date: 8/15/2025
-  ms.update-cycle: 180-days
+  title: "Statistical text analysis."
+  description: "By using statistical techniques, you can analyze text data to infer meaning."
+  author: GraemeMalcolm
+  ms.author: gmalc
+  ms.date: 12/16/2025
   ms.topic: unit
   ms.collection:
     - wwl-ai-copilot
+  zone_pivot_groups: video-or-text
 durationInMinutes: 3
 content: |
   [!include[](includes/3-statistical-techniques.md)]
@@ -1,15 +1,16 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.introduction-language.semantic-models
-title: Understand semantic language models
+title: "Semantic language models"
 metadata:
-  title: Understand semantic language models
-  description: "Understand semantic language models"
-  ms.date: 8/15/2025
-  author: wwlpublish
+  title: "Semantic language models"
+  description: "Semantic language models embed meaning and context."
+  ms.date: 12/16/2025
+  author: GraemeMalcolm
   ms.author: gmalc
   ms.topic: unit
   ms.custom:
-  - N/A
+    - N/A
+  zone_pivot_groups: video-or-text
 durationInMinutes: 4
 content: |
   [!include[](includes/4-semantic-models.md)]
@@ -1,12 +1,12 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.introduction-language.exercise
-title: Exercise - Explore a natural language processing scenario
+title: Exercise - Explore text analytics
 metadata:
-  title: Exercise - Explore a natural language processing scenario
-  description: Explore a natural language processing scenario.
+  title: Exercise - Explore text analytics
+  description: Analyze text using AI.
   author: GraemeMalcolm
   ms.author: gmalc
-  ms.date: 08/29/2025
+  ms.date: 12/16/2025
   ms.topic: unit
 durationInMinutes: 15
 content: |
 
@@ -4,17 +4,17 @@ title: Module assessment
 metadata:
   title: Module assessment
   description: "Knowledge check"
-  ms.date: 8/15/2025
-  author: wwlpublish
-  ms.author: sheryang
+  ms.date: 12/16/2025
+  author: GraemeMalcolm
+  ms.author: gmalc
   ms.topic: unit
   ms.custom:
   - N/A
 durationInMinutes: 3
 quiz:
   title: "Check your knowledge"
   questions:
-  - content: "What is the primary purpose of tokenization in natural language processing (NLP)?"
+  - content: "What is the purpose of tokenization?"
     choices:
     - content: "To translate text into another language."
       isCorrect: false
@@ -25,25 +25,25 @@ quiz:
     - content: "To break down text into smaller units for analysis."
       isCorrect: true
       explanation: "Correct."
-  - content: "Which of the following techniques is used to determine the importance of words in a document within the context of a larger collection of documents?"
+  - content: "Which of the following techniques is used to determine the importance of words in a specific document within the context of a larger collection of documents?"
     choices:
     - content: "Naïve Bayes"
       isCorrect: false
       explanation: "Incorrect. Naïve Bayes calculates the probability of a document belonging to a particular class based on the probabilities of individual words given that class."
     - content: "TF-IDF (Term Frequency-Inverse Document Frequency)"
       isCorrect: true
-      explanation: "Correct. TF-IDF is a technique used to determine the importance of words in a document within the context of a larger collection of documents."
+      explanation: "Correct. TF-IDF is a technique used to determine the importance of words in a specific document within the context of a larger collection of documents."
     - content: " Word2Vec"
       isCorrect: false
       explanation: "Incorrect. Word2Vec is a technique for generating word embeddings, which are dense vector representations of words that capture semantic relationships between words. "
-  - content: "Which of the following best describes the role of embeddings in natural language processing (NLP)?"
+  - content: "Which of the following best describes the role of embedding vectors in natural language processing (NLP)?"
     choices:
-      - content: "They visualize text data in two-dimensional space for easier interpretation."
+      - content: "They duplicate tokens in multiple languages."
         isCorrect: false
-        explanation: "Incorrect."
-      - content: "They summarize large text corpora into short, meaningful sentences." 
+        explanation: "Incorrect. Embedding vectors don't duplicate tokens."
+      - content: "They define stopwords that should be ignored." 
         isCorrect: false
-        explanation: "Incorrect."
-      - content: "They convert language tokens into vectors that capture semantic relationships."
+        explanation: "Incorrect. Embedding vectors don't define stopwords."
+      - content: "They capture semantic token relationships in multiple dimensions."
         isCorrect: true
-        explanation: "Correct."
+        explanation: "Correct. Embedding vectors capture semantic relationships between tokens in multiple dimensions."
@@ -1,15 +1,16 @@
 ### YamlMime:ModuleUnit
 uid: learn.wwl.introduction-language.summary
-title: Summary
+title: "Summary"
 metadata:
-  title: Summary
-  description: "Summary"
-  ms.date: 8/15/2025
-  author: wwlpublish
-  ms.author: sheryang
+  title: "Summary"
+  description: "Summary of text analysis concepts."
+  ms.date: 12/16/2025
+  author: GraemeMalcolm
+  ms.author: gmalc
   ms.topic: unit
   ms.custom:
   - N/A
+  zone_pivot_groups: video-or-text
 durationInMinutes: 1
 content: |
   [!include[](includes/6-summary.md)]
@@ -1,16 +1,24 @@
-In order for computer systems to interpret the subject of a text in a similar way humans do, they use **natural language processing** (NLP), an area within AI that deals with understanding written or spoken language, and responding in kind. *Text analysis* describes NLP processes that extract information from unstructured text.   
+::: zone pivot="video"
 
-Some common NLP text analysis use cases are:
+>[!VIDEO https://learn-video.azurefd.net/vod/player?id=9e7cf5af-750f-42a3-8644-a79e993d74cc]
 
-:::image type="content" source="../media/natural-language-processing.png" alt-text="Diagram visualizing six common use cases for natural language processing tasks.":::
+::: zone-end
 
-1. **Speech-to-text and text-to-speech conversion**. For example, generate subtitles for videos.
-1. **Machine translation**. For example, translate text from English to Japanese.
-1. **Text classification**. For example, label an email as spam or not spam.
-1. **Entity extraction**. For example, extract keywords or names from a document.
-1. **Question answering**. For example, provide answers to questions like "What is the capital of France?"
-1. **Text summarization**. For example, generate a short one-paragraph summary from a multi-page document.
+::: zone pivot="text"
 
-Historically, NLP has been challenging as our language is complex and computers find it hard to *understand* text. In this module, you learn how developments in AI and specifically NLP have led to the models we use today.
+Within artificial intelligence (AI), text analysis is a subset of natural language processing (NLP) that enables machines to extract meaning, structure, and insights from unstructured text. Organizations use text analysis to transform customer feedback, support tickets, contracts, and social media posts into actionable intelligence.
 
-Next, let's examine some general principles and common techniques used to perform text analysis and other NLP tasks. 
+Techniques to process and analyze text evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Some common use cases for text analysis include:
+
+- **Key term extraction**: Identifying important words and phrases in text, to help determine the topics and themes it discusses.
+- **Entity detection**: Identifying named entities mentioned in text; for example, places, people, dates, and organizations.
+- **Text classification**: Categorizing text documents based on their contents. For example, filtering email as *spam* or *not spam*.
+- **Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as *positive*, *neutral*, or *negative*.
+- **Text summarization**: Reducing the volume of text while retaining its salient points. For example, generating a short one-paragraph summary from a multi-page document.
+
+Text analysis is challenging because language is complex, and computers find it hard to understand. Ultimately, all text analysis techniques are based on the requirement to extract *meaning* from natural language text.
+
+::: zone-end
+
+> [!NOTE]
+> We recognize that different people like to learn in different ways. You can choose to complete this module in video-based format or you can read the content as text and images. The text contains greater detail than the videos, so in some cases you might want to refer to it as supplemental material to the video presentation.
@@ -1,30 +1,39 @@
-Some of the earliest techniques used to analyze text with computers involve statistical analysis of a body of text (a *corpus*) to infer some kind of semantic meaning. Put simply, if you can determine the most commonly used words in a given document, you can often get a good idea of what the document is about.
+::: zone pivot="video"
 
-## Tokenization
+>[!VIDEO https://learn-video.azurefd.net/vod/player?id=74a44e51-3a04-4a19-8b33-40ee0672c782]
 
-The first step in analyzing a corpus is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the training text as a token, though in reality, tokens can be generated for partial words, or combinations of words and punctuation.
+> [!NOTE]
+> See the **Text and images** tab for more details!
 
-For example, consider this phrase from a famous US presidential speech: `"we choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
+::: zone-end
 
-```
-1. we 
-2. choose
-3. to
-4. go
-5. the
-6. moon
-```
+::: zone pivot="text"
 
-Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"we choose to go to the moon"` can be represented by the tokens :::no-loc text="{1,2,3,4,3,5,6}":::.
+The first step in analyzing a body of text (referred to as a *corpus*) is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the text as a token. In reality, tokens can be generated for partial words or combinations of words and punctuation.
 
-We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following concepts that may apply to tokenization depending on the specific kind of NLP problem you're trying to solve:
+For example, consider this phrase from a famous US presidential speech: `"We choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
 
-|**Concept**|**Description**|
-|-|-|
-|**Text normalization**| Before generating tokens, you may choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning may be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he has worked. You may also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
-|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the"`, `"a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution may be better able to identify the important words.|
-|**n-grams**| Multi-term phrases such as `"I have"` or `"he walked"`. A single word phrase is a `unigram`, a two-word phrase is a `bi-gram`, a three-word phrase is a `tri-gram`, and so on. By considering words as groups, a machine learning model can make better sense of the text.|
-| **Stemming**| A technique in which algorithms are applied to consolidate words before counting them, so that words with the same root, like `"power"`, `"powered"`, and `"powerful"`, are interpreted as being the same token.|
+1. `We` 
+2. `choose`
+3. `to`
+4. `go`
+3. `to`
+5. `the`
+6. `moon`
+
+Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"We choose to go to the moon"` can be represented by the tokens.
 
-Next, let's see how statistical techniques enable us to model language.
+With each token assigned a discrete value, we can easily count their frequency in the text and use that to determine the most commonly used terms; which might help identify the main subject of the text.
 
+We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following pre-processing techniques that might apply to tokenization depending on the specific text analysis problem you're trying to solve:
+
+|**Technique**|**Description**|
+|-|-|
+|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he's worked. You might also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
+|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the", "a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
+|**N-gram extraction**| Finding multi-term phrases such as `"artificial intelligence"` or `"natural language processing"`. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
+| **Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like `"powering"`, `"powered"`, and `"powerful"`, are interpreted as being the same token (`"power"`).|
+| **Lemmatization** | Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, `"running"`: → `"run"`, `"global"` → `"globe"`).|
+| **Parts of speech (POS) tagging** | Labeling each token with its grammatical category, such as noun, verb, adjective, or adverb. This technique uses linguistic rules and often statistical models to determine the correct tag based on both the token itself and its context within the sentence. |
+
+::: zone-end