Skip to content

Commit 71dc086

Browse files
Merge pull request #53693 from GraemeMalcolm/main
Update for technical accuracy
2 parents a89005f + ca370a6 commit 71dc086

2 files changed

Lines changed: 1 addition & 4 deletions

File tree

learn-pr/wwl-data-ai/fundamentals-generative-ai/includes/3-language-models.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ and so on.
5050
As you add more training data, more tokens will be added to the vocabulary and assigned identifiers; so you might end up with tokens for words like *puppy*, *skateboard*, *car*, and others.
5151

5252
> [!NOTE]
53-
> In this simple example, we've tokenized the example text based on *words*. In reality there would also be sub-words, punctuation, and other tokens.
53+
> In this simple example, we've tokenized the example text based on *words*. In reality there would also be sub-words, punctuation, and other tokens.
5454
5555
## Transforming tokens with a *transformer*
5656

@@ -112,9 +112,6 @@ The result of the encoding process is a set of embeddings; vectors that include
112112
| puppy | 127 | [5, 3, 2 ] |
113113
| car | 128 | [-2, -2, 1 ] |
114114
| skateboard | 129 | [-3, -2, 2 ] |
115-
| bark | 203 | [2, -2, 3 ] |
116-
117-
If you're observant, you might have spotted that our results include two embeddings for the token "bark". It's important to understand that the embeddings represent a token within a particular *context*; and some tokens might be used to mean multiple things. For example, the *bark* of a *dog* is different from the *bark* of a *tree*! Tokens that are commonly used in multiple contexts can produce multiple embeddings.
118115

119116
We can think of the elements of the embeddings as dimensions in a multi-dimensional vector-space. In our simple example, our embeddings only have three elements, so we can visualize them as vectors in three-dimensional space, like this:
120117

38.1 KB
Loading

0 commit comments

Comments
 (0)