Skip to content

Commit fc3106f

Browse files
authored
Merge pull request #52963 from GraemeMalcolm/main
Updated text analysis module
2 parents ca57c41 + 3e67355 commit fc3106f

22 files changed

Lines changed: 386 additions & 126 deletions

learn-pr/wwl-data-ai/introduction-language/1-introduction.yml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@ uid: learn.wwl.introduction-language.introduction
33
title: Introduction
44
metadata:
55
title: Introduction
6-
description: "Introduction"
7-
ms.date: 8/15/2025
8-
author: madiepev
9-
ms.author: madiepev
6+
description: "Introduction to text analysis."
7+
ms.date: 12/16/2025
8+
author: GraemeMalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
ms.custom:
1212
- N/A
13+
zone_pivot_groups: video-or-text
1314
durationInMinutes: 1
1415
content: |
1516
[!include[](includes/1-introduction.md)]
Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.introduction-language.how-it-works
3-
title: Understand how language is processed
3+
title: Tokenization
44
metadata:
5-
title: Understand how language is processed
6-
description: "Understand how language is processed"
7-
ms.date: 8/15/2025
8-
author: wwlpublish
9-
ms.author: sheryang
5+
title: Tokenization
6+
description: "Tokenization is the process of breaking down text into smaller units for processing and analysis."
7+
ms.date: 12/16/2025
8+
author: GraemeMalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
ms.custom:
12-
- N/A
12+
- N/A
13+
zone_pivot_groups: video-or-text
1314
durationInMinutes: 3
1415
content: |
1516
[!include[](includes/2-how-it-works.md)]
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,16 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.introduction-language.statistical-techniques
3-
title: Understand statistical techniques for NLP
3+
title: "Statistical text analysis."
44
metadata:
5-
title: Understand statistical techniques for NLP
6-
description: Learn about the history of NLP by first exploring the statistical techniques that were developed.
7-
author: madiepev
8-
ms.author: madiepev
9-
ms.date: 8/15/2025
10-
ms.update-cycle: 180-days
5+
title: "Statistical text analysis."
6+
description: "By using statistical techniques, you can analyze text data to infer meaning."
7+
author: GraemeMalcolm
8+
ms.author: gmalc
9+
ms.date: 12/16/2025
1110
ms.topic: unit
1211
ms.collection:
1312
- wwl-ai-copilot
13+
zone_pivot_groups: video-or-text
1414
durationInMinutes: 3
1515
content: |
1616
[!include[](includes/3-statistical-techniques.md)]
Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.introduction-language.semantic-models
3-
title: Understand semantic language models
3+
title: "Semantic language models"
44
metadata:
5-
title: Understand semantic language models
6-
description: "Understand semantic language models"
7-
ms.date: 8/15/2025
8-
author: wwlpublish
5+
title: "Semantic language models"
6+
description: "Semantic language models embed meaning and context."
7+
ms.date: 12/16/2025
8+
author: GraemeMalcolm
99
ms.author: gmalc
1010
ms.topic: unit
1111
ms.custom:
12-
- N/A
12+
- N/A
13+
zone_pivot_groups: video-or-text
1314
durationInMinutes: 4
1415
content: |
1516
[!include[](includes/4-semantic-models.md)]

learn-pr/wwl-data-ai/introduction-language/4b-exercise.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.introduction-language.exercise
3-
title: Exercise - Explore a natural language processing scenario
3+
title: Exercise - Explore text analytics
44
metadata:
5-
title: Exercise - Explore a natural language processing scenario
6-
description: Explore a natural language processing scenario.
5+
title: Exercise - Explore text analytics
6+
description: Analyze text using AI.
77
author: GraemeMalcolm
88
ms.author: gmalc
9-
ms.date: 08/29/2025
9+
ms.date: 12/16/2025
1010
ms.topic: unit
1111
durationInMinutes: 15
1212
content: |

learn-pr/wwl-data-ai/introduction-language/5-knowledge-check.yml

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,17 @@ title: Module assessment
44
metadata:
55
title: Module assessment
66
description: "Knowledge check"
7-
ms.date: 8/15/2025
8-
author: wwlpublish
9-
ms.author: sheryang
7+
ms.date: 12/16/2025
8+
author: GraemeMalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
ms.custom:
1212
- N/A
1313
durationInMinutes: 3
1414
quiz:
1515
title: "Check your knowledge"
1616
questions:
17-
- content: "What is the primary purpose of tokenization in natural language processing (NLP)?"
17+
- content: "What is the purpose of tokenization?"
1818
choices:
1919
- content: "To translate text into another language."
2020
isCorrect: false
@@ -25,25 +25,25 @@ quiz:
2525
- content: "To break down text into smaller units for analysis."
2626
isCorrect: true
2727
explanation: "Correct."
28-
- content: "Which of the following techniques is used to determine the importance of words in a document within the context of a larger collection of documents?"
28+
- content: "Which of the following techniques is used to determine the importance of words in a specific document within the context of a larger collection of documents?"
2929
choices:
3030
- content: "Naïve Bayes"
3131
isCorrect: false
3232
explanation: "Incorrect. Naïve Bayes calculates the probability of a document belonging to a particular class based on the probabilities of individual words given that class."
3333
- content: "TF-IDF (Term Frequency-Inverse Document Frequency)"
3434
isCorrect: true
35-
explanation: "Correct. TF-IDF is a technique used to determine the importance of words in a document within the context of a larger collection of documents."
35+
explanation: "Correct. TF-IDF is a technique used to determine the importance of words in a specific document within the context of a larger collection of documents."
3636
- content: " Word2Vec"
3737
isCorrect: false
3838
explanation: "Incorrect. Word2Vec is a technique for generating word embeddings, which are dense vector representations of words that capture semantic relationships between words. "
39-
- content: "Which of the following best describes the role of embeddings in natural language processing (NLP)?"
39+
- content: "Which of the following best describes the role of embedding vectors in natural language processing (NLP)?"
4040
choices:
41-
- content: "They visualize text data in two-dimensional space for easier interpretation."
41+
- content: "They duplicate tokens in multiple languages."
4242
isCorrect: false
43-
explanation: "Incorrect."
44-
- content: "They summarize large text corpora into short, meaningful sentences."
43+
explanation: "Incorrect. Embedding vectors don't duplicate tokens."
44+
- content: "They define stopwords that should be ignored."
4545
isCorrect: false
46-
explanation: "Incorrect."
47-
- content: "They convert language tokens into vectors that capture semantic relationships."
46+
explanation: "Incorrect. Embedding vectors don't define stopwords."
47+
- content: "They capture semantic token relationships in multiple dimensions."
4848
isCorrect: true
49-
explanation: "Correct."
49+
explanation: "Correct. Embedding vectors capture semantic relationships between tokens in multiple dimensions."
Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
### YamlMime:ModuleUnit
22
uid: learn.wwl.introduction-language.summary
3-
title: Summary
3+
title: "Summary"
44
metadata:
5-
title: Summary
6-
description: "Summary"
7-
ms.date: 8/15/2025
8-
author: wwlpublish
9-
ms.author: sheryang
5+
title: "Summary"
6+
description: "Summary of text analysis concepts."
7+
ms.date: 12/16/2025
8+
author: GraemeMalcolm
9+
ms.author: gmalc
1010
ms.topic: unit
1111
ms.custom:
1212
- N/A
13+
zone_pivot_groups: video-or-text
1314
durationInMinutes: 1
1415
content: |
1516
[!include[](includes/6-summary.md)]
Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,24 @@
1-
In order for computer systems to interpret the subject of a text in a similar way humans do, they use **natural language processing** (NLP), an area within AI that deals with understanding written or spoken language, and responding in kind. *Text analysis* describes NLP processes that extract information from unstructured text.
1+
::: zone pivot="video"
22

3-
Some common NLP text analysis use cases are:
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=9e7cf5af-750f-42a3-8644-a79e993d74cc]
44
5-
:::image type="content" source="../media/natural-language-processing.png" alt-text="Diagram visualizing six common use cases for natural language processing tasks.":::
5+
::: zone-end
66

7-
1. **Speech-to-text and text-to-speech conversion**. For example, generate subtitles for videos.
8-
1. **Machine translation**. For example, translate text from English to Japanese.
9-
1. **Text classification**. For example, label an email as spam or not spam.
10-
1. **Entity extraction**. For example, extract keywords or names from a document.
11-
1. **Question answering**. For example, provide answers to questions like "What is the capital of France?"
12-
1. **Text summarization**. For example, generate a short one-paragraph summary from a multi-page document.
7+
::: zone pivot="text"
138

14-
Historically, NLP has been challenging as our language is complex and computers find it hard to *understand* text. In this module, you learn how developments in AI and specifically NLP have led to the models we use today.
9+
Within artificial intelligence (AI), text analysis is a subset of natural language processing (NLP) that enables machines to extract meaning, structure, and insights from unstructured text. Organizations use text analysis to transform customer feedback, support tickets, contracts, and social media posts into actionable intelligence.
1510

16-
Next, let's examine some general principles and common techniques used to perform text analysis and other NLP tasks.
11+
Techniques to process and analyze text evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Some common use cases for text analysis include:
12+
13+
- **Key term extraction**: Identifying important words and phrases in text, to help determine the topics and themes it discusses.
14+
- **Entity detection**: Identifying named entities mentioned in text; for example, places, people, dates, and organizations.
15+
- **Text classification**: Categorizing text documents based on their contents. For example, filtering email as *spam* or *not spam*.
16+
- **Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as *positive*, *neutral*, or *negative*.
17+
- **Text summarization**: Reducing the volume of text while retaining its salient points. For example, generating a short one-paragraph summary from a multi-page document.
18+
19+
Text analysis is challenging because language is complex, and computers find it hard to understand. Ultimately, all text analysis techniques are based on the requirement to extract *meaning* from natural language text.
20+
21+
::: zone-end
22+
23+
> [!NOTE]
24+
> We recognize that different people like to learn in different ways. You can choose to complete this module in video-based format or you can read the content as text and images. The text contains greater detail than the videos, so in some cases you might want to refer to it as supplemental material to the video presentation.
Lines changed: 30 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,39 @@
1-
Some of the earliest techniques used to analyze text with computers involve statistical analysis of a body of text (a *corpus*) to infer some kind of semantic meaning. Put simply, if you can determine the most commonly used words in a given document, you can often get a good idea of what the document is about.
1+
::: zone pivot="video"
22

3-
## Tokenization
3+
>[!VIDEO https://learn-video.azurefd.net/vod/player?id=74a44e51-3a04-4a19-8b33-40ee0672c782]
44
5-
The first step in analyzing a corpus is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the training text as a token, though in reality, tokens can be generated for partial words, or combinations of words and punctuation.
5+
> [!NOTE]
6+
> See the **Text and images** tab for more details!
67
7-
For example, consider this phrase from a famous US presidential speech: `"we choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
8+
::: zone-end
89

9-
```
10-
1. we
11-
2. choose
12-
3. to
13-
4. go
14-
5. the
15-
6. moon
16-
```
10+
::: zone pivot="text"
1711

18-
Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"we choose to go to the moon"` can be represented by the tokens :::no-loc text="{1,2,3,4,3,5,6}":::.
12+
The first step in analyzing a body of text (referred to as a *corpus*) is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the text as a token. In reality, tokens can be generated for partial words or combinations of words and punctuation.
1913

20-
We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following concepts that may apply to tokenization depending on the specific kind of NLP problem you're trying to solve:
14+
For example, consider this phrase from a famous US presidential speech: `"We choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
2115

22-
|**Concept**|**Description**|
23-
|-|-|
24-
|**Text normalization**| Before generating tokens, you may choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning may be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he has worked. You may also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
25-
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the"`, `"a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution may be better able to identify the important words.|
26-
|**n-grams**| Multi-term phrases such as `"I have"` or `"he walked"`. A single word phrase is a `unigram`, a two-word phrase is a `bi-gram`, a three-word phrase is a `tri-gram`, and so on. By considering words as groups, a machine learning model can make better sense of the text.|
27-
| **Stemming**| A technique in which algorithms are applied to consolidate words before counting them, so that words with the same root, like `"power"`, `"powered"`, and `"powerful"`, are interpreted as being the same token.|
16+
1. `We`
17+
2. `choose`
18+
3. `to`
19+
4. `go`
20+
3. `to`
21+
5. `the`
22+
6. `moon`
23+
24+
Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"We choose to go to the moon"` can be represented by the tokens.
2825

29-
Next, let's see how statistical techniques enable us to model language.
26+
With each token assigned a discrete value, we can easily count their frequency in the text and use that to determine the most commonly used terms; which might help identify the main subject of the text.
3027

28+
We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following pre-processing techniques that might apply to tokenization depending on the specific text analysis problem you're trying to solve:
29+
30+
|**Technique**|**Description**|
31+
|-|-|
32+
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he's worked. You might also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
33+
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the", "a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
34+
|**N-gram extraction**| Finding multi-term phrases such as `"artificial intelligence"` or `"natural language processing"`. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
35+
| **Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like `"powering"`, `"powered"`, and `"powerful"`, are interpreted as being the same token (`"power"`).|
36+
| **Lemmatization** | Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, `"running"`: → `"run"`, `"global"``"globe"`).|
37+
| **Parts of speech (POS) tagging** | Labeling each token with its grammatical category, such as noun, verb, adjective, or adverb. This technique uses linguistic rules and often statistical models to determine the correct tag based on both the token itself and its context within the sentence. |
38+
39+
::: zone-end

0 commit comments

Comments
 (0)