Skip to content

Commit 3e67355

Browse files
committed
Acrolinx fixes
1 parent 6b95db5 commit 3e67355

4 files changed

Lines changed: 42 additions & 42 deletions

File tree

learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@
88

99
Within artificial intelligence (AI), text analysis is a subset of natural language processing (NLP) that enables machines to extract meaning, structure, and insights from unstructured text. Organizations use text analysis to transform customer feedback, support tickets, contracts, and social media posts into actionable intelligence.
1010

11-
Techniques to process and analyze text have evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Along the way, some common use cases for text analysis have emerged; including:
11+
Techniques to process and analyze text evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Some common use cases for text analysis include:
1212

1313
- **Key term extraction**: Identifying important words and phrases in text, to help determine the topics and themes it discusses.
1414
- **Entity detection**: Identifying named entities mentioned in text; for example, places, people, dates, and organizations.
15-
- **Text classification**: Categorizing text documents based on their contents. For example, filtering email as "spam" or "not spam".
16-
- **Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as "positive", "neutral", or "negative".
15+
- **Text classification**: Categorizing text documents based on their contents. For example, filtering email as *spam* or *not spam*.
16+
- **Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as *positive*, *neutral*, or *negative*.
1717
- **Text summarization**: Reducing the volume of text while retaining its salient points. For example, generating a short one-paragraph summary from a multi-page document.
1818

1919
Text analysis is challenging because language is complex, and computers find it hard to understand. Ultimately, all text analysis techniques are based on the requirement to extract *meaning* from natural language text.

learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,29 +11,29 @@
1111

1212
The first step in analyzing a body of text (referred to as a *corpus*) is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the text as a token. In reality, tokens can be generated for partial words or combinations of words and punctuation.
1313

14-
For example, consider this phrase from a famous US presidential speech: :::no-loc text=""We choose to go to the moon"":::. The phrase can be broken down into the following tokens, with numeric identifiers:
14+
For example, consider this phrase from a famous US presidential speech: `"We choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
1515

16-
1. :::no-loc text="We":::
17-
2. :::no-loc text="choose":::
18-
3. :::no-loc text="to":::
19-
4. :::no-loc text="go":::
20-
3. :::no-loc text="to":::
21-
5. :::no-loc text="the":::
22-
6. :::no-loc text="moon":::
16+
1. `We`
17+
2. `choose`
18+
3. `to`
19+
4. `go`
20+
3. `to`
21+
5. `the`
22+
6. `moon`
2323

24-
Notice that :::no-loc text=""to""::: (token number 3) is used twice in the corpus. The phrase :::no-loc text=""We choose to go to the moon""::: can be represented by the tokens.
24+
Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"We choose to go to the moon"` can be represented by the tokens.
2525

2626
With each token assigned a discrete value, we can easily count their frequency in the text and use that to determine the most commonly used terms; which might help identify the main subject of the text.
2727

2828
We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following pre-processing techniques that might apply to tokenization depending on the specific text analysis problem you're trying to solve:
2929

3030
|**Technique**|**Description**|
3131
|-|-|
32-
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence :::no-loc text=""Mr Banks has worked in many banks."":::. You may want your analysis to differentiate between the person :::no-loc text=""Mr Banks""::: and the :::no-loc text=""banks""::: in which he's worked. You might also want to consider :::no-loc text=""banks.""::: as a separate token to :::no-loc text=""banks""::: because the inclusion of a period provides the information that the word comes at the end of a sentence|
33-
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, :::no-loc text=""the", "a"":::, or :::no-loc text=""it""::: make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
34-
|**N-gram extraction**| Finding multi-term phrases such as :::no-loc text=""artificial intelligence""::: or :::no-loc text=""natural language processing"":::. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
35-
| **Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like :::no-loc text=""powering"":::, :::no-loc text=""powered"":::, and :::no-loc text=""powerful"":::, are interpreted as being the same token (:::no-loc text=""power"":::).|
36-
| **Lemmatization** | Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, :::no-loc text=""running"":::::::no-loc text=""run"":::, :::no-loc text=""global""::::::no-loc text=""globe"":::).|
32+
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he's worked. You might also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
33+
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the", "a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
34+
|**N-gram extraction**| Finding multi-term phrases such as `"artificial intelligence"` or `"natural language processing"`. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
35+
| **Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like `"powering"`, `"powered"`, and `"powerful"`, are interpreted as being the same token (`"power"`).|
36+
| **Lemmatization** | Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, `"running"`:`"run"`, `"global"``"globe"`).|
3737
| **Parts of speech (POS) tagging** | Labeling each token with its grammatical category, such as noun, verb, adjective, or adverb. This technique uses linguistic rules and often statistical models to determine the correct tag based on both the token itself and its context within the sentence. |
3838

3939
::: zone-end

learn-pr/wwl-data-ai/introduction-language/includes/4-semantic-models.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -21,84 +21,84 @@ For example, consider the following three-dimensional embeddings for some common
2121

2222
|Word|Vector|
2323
|-|-|
24-
|:::no-loc text="dog":::|[0.8, 0.6, 0.1]|
25-
|:::no-loc text="puppy":::|[0.9, 0.7, 0.4]|
26-
|:::no-loc text="cat":::|[0.7, 0.5, 0.2]|
27-
|:::no-loc text="kitten":::|[0.8, 0.6, 0.5]|
28-
|:::no-loc text="young":::|[0.1, 0.1, 0.3]|
29-
|:::no-loc text="ball":::|[0.3, 0.9, 0.1]|
30-
|:::no-loc text="tree":::|[0.2, 0.1, 0.9]|
24+
|`dog`|[0.8, 0.6, 0.1]|
25+
|`puppy`|[0.9, 0.7, 0.4]|
26+
|`cat`|[0.7, 0.5, 0.2]|
27+
|`kitten`|[0.8, 0.6, 0.5]|
28+
|`young`|[0.1, 0.1, 0.3]|
29+
|`ball`|[0.3, 0.9, 0.1]|
30+
|`tree`|[0.2, 0.1, 0.9]|
3131

3232
We can visualize these vectors in three-dimensional space as shown here:
3333

34-
![Diagram of a 3D visualization of word vectors.](../media/word-vectors-3d.png)
34+
![Diagram of a 3D visualization of word vectors.](../media/vectors.png)
3535

36-
The vectors for :::no-loc text=""dog""::: and :::no-loc text=""cat""::: are similar (both domestic animals), as are :::no-loc text=""puppy""::: and :::no-loc text=""kitten""::: (both young animals). The words :::no-loc text=""tree"":::, :::no-loc text=""young"":::, and :::no-loc text="ball""::: have distinctly different vector orientations, reflecting their different semantic meanings.
36+
The vectors for `"dog"` and `"cat"` are similar (both domestic animals), as are `"puppy"` and `"kitten"` (both young animals). The words `"tree"`, `"young"`, and `ball"` have distinctly different vector orientations, reflecting their different semantic meanings.
3737

3838
The semantic characteristic encoded in the vectors makes it possible to use vector-based operations that compare words and enable analytical comparisons.
3939

4040
### Finding related terms
4141

4242
Since the orientation of vectors is determined by their dimension values, words with similar semantic meanings tend to have similar orientations. This means you can use calculations such as the *cosine similarity* between vectors to make meaningful comparisons.
4343

44-
For example, to determine the "odd one out" between :::no-loc text=""dog"":::, :::no-loc text=""cat"":::, and :::no-loc text=""tree"":::, you can calculate the cosine similarity between pairs of vectors. The cosine similarity is calculated as:
44+
For example, to determine the "odd one out" between `"dog"`, `"cat"`, and `"tree"`, you can calculate the cosine similarity between pairs of vectors. The cosine similarity is calculated as:
4545

4646
`cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)`
4747

4848
Where `A · B` is the dot product and `||A||` is the magnitude of vector A.
4949

5050
Calculating similarities between the three words:
5151

52-
- **:::no-loc text="dog":::** [0.8, 0.6, 0.1] and **:::no-loc text="cat":::** [0.7, 0.5, 0.2]:
52+
- **`dog`** [0.8, 0.6, 0.1] and **`cat`** [0.7, 0.5, 0.2]:
5353
- Dot product: (0.8 × 0.7) + (0.6 × 0.5) + (0.1 × 0.2) = 0.56 + 0.30 + 0.02 = 0.88
54-
- Magnitude of :::no-loc text="dog":::: √(0.8² + 0.6² + 0.1²) = √(0.64 + 0.36 + 0.01) = √1.01 ≈ 1.005
55-
- Magnitude of :::no-loc text="cat":::: √(0.7² + 0.5² + 0.2²) = √(0.49 + 0.25 + 0.04) = √0.78 ≈ 0.883
54+
- Magnitude of `dog`: √(0.8² + 0.6² + 0.1²) = √(0.64 + 0.36 + 0.01) = √1.01 ≈ 1.005
55+
- Magnitude of `cat`: √(0.7² + 0.5² + 0.2²) = √(0.49 + 0.25 + 0.04) = √0.78 ≈ 0.883
5656
- Cosine similarity: 0.88 / (1.005 × 0.883) ≈ **0.992** (high similarity)
5757

58-
- **:::no-loc text="dog":::** [0.8, 0.6, 0.1] and **:::no-loc text="tree":::** [0.2, 0.1, 0.9]:
58+
- **`dog`** [0.8, 0.6, 0.1] and **`tree`** [0.2, 0.1, 0.9]:
5959
- Dot product: (0.8 × 0.2) + (0.6 × 0.1) + (0.1 × 0.9) = 0.16 + 0.06 + 0.09 = 0.31
60-
- Magnitude of :::no-loc text="tree":::: √(0.2² + 0.1² + 0.9²) = √(0.04 + 0.01 + 0.81) = √0.86 ≈ 0.927
60+
- Magnitude of `tree`: √(0.2² + 0.1² + 0.9²) = √(0.04 + 0.01 + 0.81) = √0.86 ≈ 0.927
6161
- Cosine similarity: 0.31 / (1.005 × 0.927) ≈ **0.333** (low similarity)
6262

63-
- **:::no-loc text="cat":::** [0.7, 0.5, 0.2] and **:::no-loc text="tree":::** [0.2, 0.1, 0.9]:
63+
- **`cat`** [0.7, 0.5, 0.2] and **`tree`** [0.2, 0.1, 0.9]:
6464
- Dot product: (0.7 × 0.2) + (0.5 × 0.1) + (0.2 × 0.9) = 0.14 + 0.05 + 0.18 = 0.37
6565
- Cosine similarity: 0.37 / (0.883 × 0.927) ≈ **0.452** (low similarity)
6666

6767
![Diagram of cosine similarity visualization showing dog, cat, and tree vectors.](../media/cosine-similarity.png)
6868

69-
The results show that :::no-loc text=""dog""::: and :::no-loc text=""cat""::: are highly similar (0.992), while :::no-loc text=""tree""::: has lower similarity to both :::no-loc text=""dog""::: (0.333) and :::no-loc text=""cat""::: (0.452). Therefore, **:::no-loc text="tree":::** is clearly the odd one out.
69+
The results show that `"dog"` and `"cat"` are highly similar (0.992), while `"tree"` has lower similarity to both `"dog"` (0.333) and `"cat"` (0.452). Therefore, **`tree`** is clearly the odd one out.
7070

7171
### Vector translation through addition and subtraction
7272

7373
You can add or subtract vectors to produce new vector-based results; which can then be used to find tokens with matching vectors. This technique enables intuitive arithmetic-based logic to determine appropriate terms based on linguistic relationships.
7474

7575
For example, using the vectors from earlier:
7676

77-
- **:::no-loc text="dog":::** + **:::no-loc text="young":::** = [0.8, 0.6, 0.1] + [0.1, 0.1, 0.3] = [0.9, 0.7, 0.4] = **:::no-loc text="puppy":::**
78-
- **:::no-loc text="cat":::** + **:::no-loc text="young":::** = [0.7, 0.5, 0.2] + [0.1, 0.1, 0.3] = [0.8, 0.6, 0.5] = **:::no-loc text="kitten":::**
77+
- **`dog`** + **`young`** = [0.8, 0.6, 0.1] + [0.1, 0.1, 0.3] = [0.9, 0.7, 0.4] = **`puppy`**
78+
- **`cat`** + **`young`** = [0.7, 0.5, 0.2] + [0.1, 0.1, 0.3] = [0.8, 0.6, 0.5] = **`kitten`**
7979

8080
![Diagram of vector addition showing dog + young = puppy and cat + young = kitten.](../media/vector-addition.png)
8181

82-
These operations work because the vector for :::no-loc text=""young""::: encodes the semantic transformation from an adult animal to its young counterpart.
82+
These operations work because the vector for `"young"` encodes the semantic transformation from an adult animal to its young counterpart.
8383

8484
> [!NOTE]
8585
> In practice, vector arithmetic rarely produces exact matches; instead, you would search for the word whose vector is *closest* (most similar) to the result.
8686
8787
The arithmetic works in reverse as well:
8888

89-
- **:::no-loc text="puppy":::** - **:::no-loc text="young":::** = [0.9, 0.7, 0.4] - [0.1, 0.1, 0.3] = [0.8, 0.6, 0.1] = **:::no-loc text="dog":::**
90-
- **:::no-loc text="kitten":::** - **:::no-loc text="young":::** = [0.8, 0.6, 0.5] - [0.1, 0.1, 0.3] = [0.7, 0.5, 0.2] = **:::no-loc text="cat":::**
89+
- **`puppy`** - **`young`** = [0.9, 0.7, 0.4] - [0.1, 0.1, 0.3] = [0.8, 0.6, 0.1] = **`dog`**
90+
- **`kitten`** - **`young`** = [0.8, 0.6, 0.5] - [0.1, 0.1, 0.3] = [0.7, 0.5, 0.2] = **`cat`**
9191

9292
### Analogical reasoning
9393

94-
Vector arithmetic can also answer analogy questions like "*:::no-loc text="puppy":::* is to *:::no-loc text="dog":::* as *:::no-loc text="kitten":::* is to *?*"
94+
Vector arithmetic can also answer analogy questions like "*`puppy`* is to *`dog`* as *`kitten`* is to *?*"
9595

96-
To solve this, calculate: **:::no-loc text="kitten":::** - **:::no-loc text="puppy":::** + **:::no-loc text="dog":::**
96+
To solve this, calculate: **`kitten`** - **`puppy`** + **`dog`**
9797

9898
- [0.8, 0.6, 0.5] - [0.9, 0.7, 0.4] + [0.8, 0.6, 0.1]
9999
- = [-0.1, -0.1, 0.1] + [0.8, 0.6, 0.1]
100100
- = [0.7, 0.5, 0.2]
101-
- = **:::no-loc text="cat":::**
101+
- = **`cat`**
102102

103103
![Diagram of vector arithmetic showing kitten - puppy + dog = cat.](../media/vector-analogy.png)
104104

learn-pr/wwl-data-ai/introduction-language/media/word-vectors-3d.png renamed to learn-pr/wwl-data-ai/introduction-language/media/vectors.png

File renamed without changes.

0 commit comments

Comments
 (0)