Skip to content

Commit 8d9620f

Browse files
committed
Fixes
1 parent 3e67355 commit 8d9620f

2 files changed

Lines changed: 42 additions & 42 deletions

File tree

learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ We've used a simple example in which tokens are identified for each distinct wor
3030
|**Technique**|**Description**|
3131
|-|-|
3232
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he's worked. You might also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
33-
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the", "a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
33+
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the"`, `"a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
3434
|**N-gram extraction**| Finding multi-term phrases such as `"artificial intelligence"` or `"natural language processing"`. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
3535
| **Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like `"powering"`, `"powered"`, and `"powerful"`, are interpreted as being the same token (`"power"`).|
3636
| **Lemmatization** | Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, `"running"`: → `"run"`, `"global"``"globe"`).|

learn-pr/wwl-data-ai/introduction-language/includes/3-statistical-techniques.md

Lines changed: 41 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,20 @@ Perhaps the most obvious way to ascertain the topics discussed in a document is
1717

1818
For example, consider the following text:
1919

20-
> *:::no-loc text="AI in modern business delivers transformative benefits by enhancing efficiency, decision-making, and customer experiences. Businesses can leverage AI to automate repetitive tasks, freeing employees to focus on strategic work, while predictive analytics and machine learning models enable data-driven decisions that improve accuracy and speed. AI-powered tools like Copilot streamline workflows across marketing, finance, and operations, reducing costs and boosting productivity. Additionally, intelligent applications personalize customer interactions, driving engagement and loyalty. By embedding AI into core processes, businesses benefit from the ability to innovate faster, adapt to market changes, and maintain a competitive edge in an increasingly digital economy.":::*
20+
> *`AI in modern business delivers transformative benefits by enhancing efficiency, decision-making, and customer experiences. Businesses can leverage AI to automate repetitive tasks, freeing employees to focus on strategic work, while predictive analytics and machine learning models enable data-driven decisions that improve accuracy and speed. AI-powered tools like Copilot streamline workflows across marketing, finance, and operations, reducing costs and boosting productivity. Additionally, intelligent applications personalize customer interactions, driving engagement and loyalty. By embedding AI into core processes, businesses benefit from the ability to innovate faster, adapt to market changes, and maintain a competitive edge in an increasingly digital economy.`*
2121
2222
After tokenizing, normalizing, and applying lemmatization to the text, the frequency of each term can be counted and tabulated; producing the following partial results:
2323

2424
|Term|Frequency|
2525
|-|-|
26-
|:::no-loc text="ai":::|4|
27-
|:::no-loc text="business":::|3|
28-
|:::no-loc text="benefit":::|2|
29-
|:::no-loc text="customer":::|2|
30-
|:::no-loc text="decision":::|2|
31-
|:::no-loc text="market":::|2|
32-
|:::no-loc text="ability":::|1|
33-
|:::no-loc text="accuracy":::|1|
26+
|`ai`|4|
27+
|`business`|3|
28+
|`benefit`|2|
29+
|`customer`|2|
30+
|`decision`|2|
31+
|`market`|2|
32+
|`ability`|1|
33+
|`accuracy`|1|
3434
|...|...|
3535

3636
From these results, the most frequently occurring terms indicate that the text discusses AI and its business benefits.
@@ -43,65 +43,65 @@ For example, consider the following two text samples:
4343

4444
> **Sample A:**
4545
>
46-
> *:::no-loc text="Microsoft Copilot Studio enables declarative AI agent creation using natural language, prompts, and templates. With this declarative approach, an AI agent is configured rather than programmed: makers define intents, actions, and data connections, then publish the agent to channels. Microsoft Copilot Studio simplifies agent orchestration, governance, and lifecycles so an AI agent can be iterated quickly. Using Microsoft Copilot Studio helps modern businesses deploy Microsoft AI agent solutions fast.":::*
46+
> *`Microsoft Copilot Studio enables declarative AI agent creation using natural language, prompts, and templates. With this declarative approach, an AI agent is configured rather than programmed: makers define intents, actions, and data connections, then publish the agent to channels. Microsoft Copilot Studio simplifies agent orchestration, governance, and lifecycles so an AI agent can be iterated quickly. Using Microsoft Copilot Studio helps modern businesses deploy Microsoft AI agent solutions fast.`*
4747
4848
> **Sample B:**
4949
>
50-
> *:::no-loc text="Microsoft Foundry enables code‑based AI agent development with SDKs and APIs. Developers write code to implement agent conversations, tool calling, state management, and custom pipelines. In Microsoft Foundry, engineers can use Python or Microsoft C#, integrate Microsoft AI services, and manage CI/CD to deploy the AI agent. This code-first development model supports extensibility and performance while building Microsoft Foundry AI agent applications.":::*
50+
> *`Microsoft Foundry enables code‑based AI agent development with SDKs and APIs. Developers write code to implement agent conversations, tool calling, state management, and custom pipelines. In Microsoft Foundry, engineers can use Python or Microsoft C#, integrate Microsoft AI services, and manage CI/CD to deploy the AI agent. This code-first development model supports extensibility and performance while building Microsoft Foundry AI agent applications.`*
5151
5252
The top three most frequent terms in these samples are shown in the following tables:
5353

5454
**Sample A**:
5555

5656
|Term | Frequency |
5757
|-|-|
58-
|:::no-loc text="agent":::| 6|
59-
|:::no-loc text="ai":::| 4|
60-
|:::no-loc text="microsoft":::|4|
58+
|`agent`| 6|
59+
|`ai`| 4|
60+
|`microsoft`|4|
6161

6262
**Sample B**:
6363

6464
|Term | Frequency |
6565
|-|-|
66-
|:::no-loc text="microsoft":::|5|
67-
|:::no-loc text="agent":::| 4|
68-
|:::no-loc text="ai":::| 4|
66+
|`microsoft`|5|
67+
|`agent`| 4|
68+
|`ai`| 4|
6969

70-
As you can see from the results, the most common words in both samples are the same (:::no-loc text=""agent"":::, :::no-loc text=""Microsoft"":::, and :::no-loc text=""AI"":::). This tells us that both documents cover a similar overall theme, but doesn't help us discriminate between the individual documents. Examining the counts of less frequently used terms might help, but you can easily imagine an analysis of a corpus based on Microsoft's AI documentation, which would result in a large number of terms that are common across all documents; making it hard to determine the specific topics covered in each document.
70+
As you can see from the results, the most common words in both samples are the same (`"agent"`, `"Microsoft"`, and `"AI"`). This tells us that both documents cover a similar overall theme, but doesn't help us discriminate between the individual documents. Examining the counts of less frequently used terms might help, but you can easily imagine an analysis of a corpus based on Microsoft's AI documentation, which would result in a large number of terms that are common across all documents; making it hard to determine the specific topics covered in each document.
7171

7272
To address this problem, *Term Frequency - Inverse Document Frequency* (TF-IDF) is a technique that calculates scores based on how often a word or term appears in one document compared to its more general frequency across the entire collection of documents. Using this technique, a high degree of relevance is assumed for words that appear frequently in a particular document, but relatively infrequently across a wide range of other documents. To calculate TF-IDF for terms in an individual document, you can use the following three-step process:
7373

74-
1. **Calculate Term Frequency (TF)**: This is simply how many times a word appears in a document. For example, if the word :::no-loc text=""agent""::: appears 6 times in a document, then `tf(agent) = 6`.
74+
1. **Calculate Term Frequency (TF)**: This is simply how many times a word appears in a document. For example, if the word `"agent"` appears 6 times in a document, then `tf(agent) = 6`.
7575

7676
2. **Calculate Inverse Document Frequency (IDF)**: This checks how common or rare a word is across all documents. If a word appears in every document, it’s not special. The formula used to calculate IDF is `idf(t) = log(N / df(t))` (where `N` is total number of documents and `df(t)` is the number of documents that contain the word `t`)
7777

7878
3. **Combine them to calculate TF-IDF**: Multiply TF and IDF to get the score: `tfidf(t, d) = tf(t, d) * log(N / df(t))`
7979

80-
A high TF-IDF score indicates that a word appears often in one document but rarely in others. A low score indicates that word is common in many documents. In two samples about AI agents, because :::no-loc text=""AI"":::, :::no-loc text=""Microsoft"":::, and :::no-loc text=""agent""::: appear in both samples (`N = 2, df(t) = 2`), their IDF is `log(2/2) = 0`, so they carry no discriminative weight in TF‑IDF. The top three TF-IDF results for the samples are:
80+
A high TF-IDF score indicates that a word appears often in one document but rarely in others. A low score indicates that word is common in many documents. In two samples about AI agents, because `"AI"`, `"Microsoft"`, and `"agent"` appear in both samples (`N = 2, df(t) = 2`), their IDF is `log(2/2) = 0`, so they carry no discriminative weight in TF‑IDF. The top three TF-IDF results for the samples are:
8181

8282
**Sample A:**
8383

8484
|Term|TF-IDF|
8585
|-|-|
86-
|:::no-loc text="copilot":::|2.0794|
87-
|:::no-loc text="studio":::|2.0794|
88-
|:::no-loc text="declarative":::|1.3863|
86+
|`copilot`|2.0794|
87+
|`studio`|2.0794|
88+
|`declarative`|1.3863|
8989

9090
**Sample B:**
9191

9292
|Term|TF-IDF|
9393
|-|-|
94-
|:::no-loc text="code":::|2.0794|
95-
|:::no-loc text="develop":::|2.0794|
96-
|:::no-loc text="foundry":::|2.0794|
94+
|`code`|2.0794|
95+
|`develop`|2.0794|
96+
|`foundry`|2.0794|
9797

9898
From these results, it's clearer that sample A is about declarative agent creation with Copilot Studio, while sample B is about code-based agent development with Microsoft Foundry.
9999

100100
## "Bag-of-words" machine learning techniques
101101

102102
*Bag-of-words* is the name given to a feature extraction technique that represents text tokens as a vector of word frequencies or occurrences, ignoring grammar and word order. This representation becomes the input for machine learning algorithms like Naive Bayes, a probabilistic classifier that applies Bayes’ theorem to predict the probable class of a document based on word frequency.
103103

104-
For example, you might use this technique to train a machine learning model that performs email spam filtering. The words :::no-loc text=""miracle cure"":::, :::no-loc text=""lose weight fast"":::, and :::no-loc text=""anti-aging`"::: may appear more frequently in spam emails about dubious health products than your regular emails, and a trained model might flag messages containing these words as potential spam.
104+
For example, you might use this technique to train a machine learning model that performs email spam filtering. The words `"miracle cure"`, `"lose weight fast"`, and `"anti-aging`` may appear more frequently in spam emails about dubious health products than your regular emails, and a trained model might flag messages containing these words as potential spam.
105105

106106
You can implement *sentiment analysis* by using the same method to classify text by emotional tone. The bag-of-words provides the features, and model uses those features to estimate probabilities and assign sentiment labels like "positive" or "negative".
107107

@@ -119,32 +119,32 @@ The TextRank algorithm applies the same principle as Google's PageRank algorithm
119119

120120
For example, consider the following document about cloud computing:
121121

122-
> *:::no-loc text="Cloud computing provides on-demand access to computing resources. Computing resources include servers, storage, and networking. Azure is Microsoft's cloud computing platform. Organizations use cloud platforms to reduce infrastructure costs. Cloud computing enables scalability and flexibility.":::*
122+
> *`Cloud computing provides on-demand access to computing resources. Computing resources include servers, storage, and networking. Azure is Microsoft's cloud computing platform. Organizations use cloud platforms to reduce infrastructure costs. Cloud computing enables scalability and flexibility.`*
123123
124124
To generate a summary of this document, the TextRank process begins by splitting this document into sentences:
125125

126-
1. *:::no-loc text="Cloud computing provides on-demand access to computing resources.":::*
127-
1. *:::no-loc text="Computing resources include servers, storage, and networking.":::*
128-
1. *:::no-loc text="Azure is Microsoft's cloud computing platform.":::*
129-
1. *:::no-loc text="Organizations use cloud platforms to reduce infrastructure costs.":::*
130-
1. *:::no-loc text="Cloud computing enables scalability and flexibility.":::*
126+
1. *`Cloud computing provides on-demand access to computing resources.`*
127+
1. *`Computing resources include servers, storage, and networking.`*
128+
1. *`Azure is Microsoft's cloud computing platform.`*
129+
1. *`Organizations use cloud platforms to reduce infrastructure costs.`*
130+
1. *`Cloud computing enables scalability and flexibility.`*
131131

132132
Next, edges are created between sentences with weights based on similarity (word overlap). For this example, the edge weights might be:
133133

134-
- Sentence 1 <-> Sentence 2: 0.5 (shares :::no-loc text=""computing resources"":::)
135-
- Sentence 1 <-> Sentence 3: 0.6 (shares :::no-loc text=""cloud computing"":::)
136-
- Sentence 1 <-> Sentence 4: 0.2 (shares :::no-loc text=""cloud"":::)
137-
- Sentence 1 <-> Sentence 5: 0.7 (shares :::no-loc text=""cloud computing"":::)
134+
- Sentence 1 <-> Sentence 2: 0.5 (shares `"computing resources"`)
135+
- Sentence 1 <-> Sentence 3: 0.6 (shares `"cloud computing"`)
136+
- Sentence 1 <-> Sentence 4: 0.2 (shares `"cloud"`)
137+
- Sentence 1 <-> Sentence 5: 0.7 (shares `"cloud computing"`)
138138
- Sentence 2 <-> Sentence 3: 0.2 (limited overlap)
139139
- Sentence 2 <-> Sentence 4: 0.1 (limited overlap)
140-
- Sentence 2 <-> Sentence 5: 0.1 (shares :::no-loc text=""computing"":::)
141-
- Sentence 3 <-> Sentence 4: 0.5 (shares :::no-loc text=""cloud platforms"":::)
142-
- Sentence 3 <-> Sentence 5: 0.4 (shares :::no-loc text=""cloud computing"":::)
140+
- Sentence 2 <-> Sentence 5: 0.1 (shares `"computing"`)
141+
- Sentence 3 <-> Sentence 4: 0.5 (shares `"cloud platforms"`)
142+
- Sentence 3 <-> Sentence 5: 0.4 (shares `"cloud computing"`)
143143
- Sentence 4 <-> Sentence 5: 0.3 (limited overlap)
144144

145145
![Diagram of connected sentence nodes.](../media/text-rank.png)
146146

147-
After calculating TextRank scores iteratively using these weights, sentences 1, 3, and 5 might receive the highest scores because they connect well to other sentences through shared terminology and concepts. These sentences would be selected to form a concise summary: *:::no-loc text=""Cloud computing provides on-demand access to computing resources. Azure is Microsoft's cloud computing platform. Cloud computing enables scalability and flexibility."":::*
147+
After calculating TextRank scores iteratively using these weights, sentences 1, 3, and 5 might receive the highest scores because they connect well to other sentences through shared terminology and concepts. These sentences would be selected to form a concise summary: *`"Cloud computing provides on-demand access to computing resources. Azure is Microsoft's cloud computing platform. Cloud computing enables scalability and flexibility."`*
148148

149149
> [!NOTE]
150150
> Generating a document summary by selecting the most relevant sentences is a form of *extractive* summarization. In this approach, no new text is generated - the summary consists of a subset of the original text. More recent developments in semantic modeling also enable *abstractive* summarization, in which new language that summarizes the key themes of the source document is generated.

0 commit comments

Comments
 (0)