You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/introduction-language/includes/1-introduction.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,12 +8,12 @@
8
8
9
9
Within artificial intelligence (AI), text analysis is a subset of natural language processing (NLP) that enables machines to extract meaning, structure, and insights from unstructured text. Organizations use text analysis to transform customer feedback, support tickets, contracts, and social media posts into actionable intelligence.
10
10
11
-
Techniques to process and analyze text have evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Along the way, some common use cases for text analysis have emerged; including:
11
+
Techniques to process and analyze text evolved over many years, from simple statistical calculations based on term-frequency to vector-based language models that encapsulate semantic meaning. Some common use cases for text analysis include:
12
12
13
13
-**Key term extraction**: Identifying important words and phrases in text, to help determine the topics and themes it discusses.
14
14
-**Entity detection**: Identifying named entities mentioned in text; for example, places, people, dates, and organizations.
15
-
-**Text classification**: Categorizing text documents based on their contents. For example, filtering email as "spam" or "not spam".
16
-
-**Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as "positive", "neutral", or "negative".
15
+
-**Text classification**: Categorizing text documents based on their contents. For example, filtering email as *spam* or *not spam*.
16
+
-**Sentiment analysis**: A particular form of text classification that predicts the *sentiment* of text - for example, categorizing social media posts as *positive*, *neutral*, or *negative*.
17
17
-**Text summarization**: Reducing the volume of text while retaining its salient points. For example, generating a short one-paragraph summary from a multi-page document.
18
18
19
19
Text analysis is challenging because language is complex, and computers find it hard to understand. Ultimately, all text analysis techniques are based on the requirement to extract *meaning* from natural language text.
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/introduction-language/includes/2-how-it-works.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,29 +11,29 @@
11
11
12
12
The first step in analyzing a body of text (referred to as a *corpus*) is to break it down into *tokens*. For the sake of simplicity, you can think of each distinct word in the text as a token. In reality, tokens can be generated for partial words or combinations of words and punctuation.
13
13
14
-
For example, consider this phrase from a famous US presidential speech: :::no-loc text=""We choose to go to the moon"":::. The phrase can be broken down into the following tokens, with numeric identifiers:
14
+
For example, consider this phrase from a famous US presidential speech: `"We choose to go to the moon"`. The phrase can be broken down into the following tokens, with numeric identifiers:
15
15
16
-
1.:::no-loc text="We":::
17
-
2.:::no-loc text="choose":::
18
-
3.:::no-loc text="to":::
19
-
4.:::no-loc text="go":::
20
-
3.:::no-loc text="to":::
21
-
5.:::no-loc text="the":::
22
-
6.:::no-loc text="moon":::
16
+
1.`We`
17
+
2.`choose`
18
+
3.`to`
19
+
4.`go`
20
+
3.`to`
21
+
5.`the`
22
+
6.`moon`
23
23
24
-
Notice that :::no-loc text=""to""::: (token number 3) is used twice in the corpus. The phrase :::no-loc text=""We choose to go to the moon""::: can be represented by the tokens.
24
+
Notice that `"to"` (token number 3) is used twice in the corpus. The phrase `"We choose to go to the moon"` can be represented by the tokens.
25
25
26
26
With each token assigned a discrete value, we can easily count their frequency in the text and use that to determine the most commonly used terms; which might help identify the main subject of the text.
27
27
28
28
We've used a simple example in which tokens are identified for each distinct word in the text. However, consider the following pre-processing techniques that might apply to tokenization depending on the specific text analysis problem you're trying to solve:
29
29
30
30
|**Technique**|**Description**|
31
31
|-|-|
32
-
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence :::no-loc text=""Mr Banks has worked in many banks."":::. You may want your analysis to differentiate between the person :::no-loc text=""Mr Banks""::: and the :::no-loc text=""banks""::: in which he's worked. You might also want to consider :::no-loc text=""banks.""::: as a separate token to :::no-loc text=""banks""::: because the inclusion of a period provides the information that the word comes at the end of a sentence|
33
-
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, :::no-loc text=""the", "a"":::, or :::no-loc text=""it""::: make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
34
-
|**N-gram extraction**| Finding multi-term phrases such as :::no-loc text=""artificial intelligence""::: or :::no-loc text=""natural language processing"":::. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
35
-
|**Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like :::no-loc text=""powering"":::, :::no-loc text=""powered"":::, and :::no-loc text=""powerful"":::, are interpreted as being the same token (:::no-loc text=""power"":::).|
36
-
|**Lemmatization**| Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, :::no-loc text=""running"":::: → :::no-loc text=""run"":::, :::no-loc text=""global""::: → :::no-loc text=""globe"":::).|
32
+
|**Text normalization**| Before generating tokens, you might choose to *normalize* the text by removing punctuation and changing all words to lower case. For analysis that relies purely on word frequency, this approach improves overall performance. However, some semantic meaning could be lost - for example, consider the sentence `"Mr Banks has worked in many banks."`. You may want your analysis to differentiate between the person `"Mr Banks"` and the `"banks"` in which he's worked. You might also want to consider `"banks."` as a separate token to `"banks"` because the inclusion of a period provides the information that the word comes at the end of a sentence|
33
+
|**Stop word removal**| Stop words are words that should be excluded from the analysis. For example, `"the", "a"`, or `"it"` make text easier for people to read but add little semantic meaning. By excluding these words, a text analysis solution might be better able to identify the important words.|
34
+
|**N-gram extraction**| Finding multi-term phrases such as `"artificial intelligence"` or `"natural language processing"`. A single word phrase is a *unigram*, a two-word phrase is a *bigram*, a three-word phrase is a *trigram*, and so on. In many cases, by considering frequently appearing sequences of words as groups, a text analysis algorithm can make better sense of the text.|
35
+
|**Stemming**| A technique used to consolidate words by stripping endings like "s", "ing", "ed", and so on, before counting them; so that words with the same etymological root, like `"powering"`, `"powered"`, and `"powerful"`, are interpreted as being the same token (`"power"`).|
36
+
|**Lemmatization**| Another approach to reducing words to their base or dictionary form (called a *lemma*). Unlike stemming, which simply chops off word endings, lemmatization uses linguistic rules and vocabulary to ensure the resulting form is a valid word (for example, `"running"`: → `"run"`, `"global"` → `"globe"`).|
37
37
|**Parts of speech (POS) tagging**| Labeling each token with its grammatical category, such as noun, verb, adjective, or adverb. This technique uses linguistic rules and often statistical models to determine the correct tag based on both the token itself and its context within the sentence. |
Copy file name to clipboardExpand all lines: learn-pr/wwl-data-ai/introduction-language/includes/4-semantic-models.md
+25-25Lines changed: 25 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,84 +21,84 @@ For example, consider the following three-dimensional embeddings for some common
21
21
22
22
|Word|Vector|
23
23
|-|-|
24
-
|:::no-loc text="dog":::|[0.8, 0.6, 0.1]|
25
-
|:::no-loc text="puppy":::|[0.9, 0.7, 0.4]|
26
-
|:::no-loc text="cat":::|[0.7, 0.5, 0.2]|
27
-
|:::no-loc text="kitten":::|[0.8, 0.6, 0.5]|
28
-
|:::no-loc text="young":::|[0.1, 0.1, 0.3]|
29
-
|:::no-loc text="ball":::|[0.3, 0.9, 0.1]|
30
-
|:::no-loc text="tree":::|[0.2, 0.1, 0.9]|
24
+
|`dog`|[0.8, 0.6, 0.1]|
25
+
|`puppy`|[0.9, 0.7, 0.4]|
26
+
|`cat`|[0.7, 0.5, 0.2]|
27
+
|`kitten`|[0.8, 0.6, 0.5]|
28
+
|`young`|[0.1, 0.1, 0.3]|
29
+
|`ball`|[0.3, 0.9, 0.1]|
30
+
|`tree`|[0.2, 0.1, 0.9]|
31
31
32
32
We can visualize these vectors in three-dimensional space as shown here:
33
33
34
-

34
+

35
35
36
-
The vectors for :::no-loc text=""dog""::: and :::no-loc text=""cat""::: are similar (both domestic animals), as are :::no-loc text=""puppy""::: and :::no-loc text=""kitten""::: (both young animals). The words :::no-loc text=""tree"":::, :::no-loc text=""young"":::, and :::no-loc text="ball""::: have distinctly different vector orientations, reflecting their different semantic meanings.
36
+
The vectors for `"dog"` and `"cat"` are similar (both domestic animals), as are `"puppy"` and `"kitten"` (both young animals). The words `"tree"`, `"young"`, and `ball"` have distinctly different vector orientations, reflecting their different semantic meanings.
37
37
38
38
The semantic characteristic encoded in the vectors makes it possible to use vector-based operations that compare words and enable analytical comparisons.
39
39
40
40
### Finding related terms
41
41
42
42
Since the orientation of vectors is determined by their dimension values, words with similar semantic meanings tend to have similar orientations. This means you can use calculations such as the *cosine similarity* between vectors to make meaningful comparisons.
43
43
44
-
For example, to determine the "odd one out" between :::no-loc text=""dog"":::, :::no-loc text=""cat"":::, and :::no-loc text=""tree"":::, you can calculate the cosine similarity between pairs of vectors. The cosine similarity is calculated as:
44
+
For example, to determine the "odd one out" between `"dog"`, `"cat"`, and `"tree"`, you can calculate the cosine similarity between pairs of vectors. The cosine similarity is calculated as:
45
45
46
46
`cosine_similarity(A, B) = (A · B) / (||A|| * ||B||)`
47
47
48
48
Where `A · B` is the dot product and `||A||` is the magnitude of vector A.
49
49
50
50
Calculating similarities between the three words:
51
51
52
-
-**:::no-loc text="dog":::**[0.8, 0.6, 0.1] and **:::no-loc text="cat":::**[0.7, 0.5, 0.2]:
52
+
-**`dog`**[0.8, 0.6, 0.1] and **`cat`**[0.7, 0.5, 0.2]:

68
68
69
-
The results show that :::no-loc text=""dog""::: and :::no-loc text=""cat""::: are highly similar (0.992), while :::no-loc text=""tree""::: has lower similarity to both :::no-loc text=""dog""::: (0.333) and :::no-loc text=""cat""::: (0.452). Therefore, **:::no-loc text="tree":::** is clearly the odd one out.
69
+
The results show that `"dog"` and `"cat"` are highly similar (0.992), while `"tree"` has lower similarity to both `"dog"` (0.333) and `"cat"` (0.452). Therefore, **`tree`** is clearly the odd one out.
70
70
71
71
### Vector translation through addition and subtraction
72
72
73
73
You can add or subtract vectors to produce new vector-based results; which can then be used to find tokens with matching vectors. This technique enables intuitive arithmetic-based logic to determine appropriate terms based on linguistic relationships.

81
81
82
-
These operations work because the vector for :::no-loc text=""young""::: encodes the semantic transformation from an adult animal to its young counterpart.
82
+
These operations work because the vector for `"young"` encodes the semantic transformation from an adult animal to its young counterpart.
83
83
84
84
> [!NOTE]
85
85
> In practice, vector arithmetic rarely produces exact matches; instead, you would search for the word whose vector is *closest* (most similar) to the result.
Vector arithmetic can also answer analogy questions like "*:::no-loc text="puppy":::* is to *:::no-loc text="dog":::* as *:::no-loc text="kitten":::* is to *?*"
94
+
Vector arithmetic can also answer analogy questions like "*`puppy`* is to *`dog`* as *`kitten`* is to *?*"
0 commit comments