You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/apis/language-model-best-practices.md
+26-17Lines changed: 26 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,27 +11,31 @@ This topic provides developer guidance and describes various best practices for
11
11
12
12
## Handling non-deterministic output
13
13
14
-
Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing factor built into the APIs.
14
+
Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing seed value built into the APIs.
15
15
16
16
The Phi Silica model is sensitive to any randomness, with small changes to the input and options producing large changes in the output. For example, the introduction of a single space or typo in a prompt might turn a 100 token answer into a 1000 token answer.
17
17
18
18
### Why outputs vary
19
19
20
-
The default sampling parameters introduce randomness into token selection:
20
+
The default sampling parameters (described in the following table) control the creativity of the model. Apparent randomness is generated by the random seed value of the [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) API.
21
21
22
22
| Parameter | Default | Effect on variability |
23
23
| --- | --- | --- |
24
24
| Temperature | 0.9 | Higher values increase randomness; lower values produce more focused output. |
25
-
| TopP | 0.9 | Controls cumulative probability threshold for token candidates. |
25
+
| TopP | 0.9 | Controls the cumulative probability threshold for token candidates. |
26
26
| TopK | 40 | Limits how many tokens are considered at each step; lower values reduce variability. |
27
27
28
28
### Guidance
29
29
30
-
-**Do not write logic that depends on exact output matching.** The same prompt can produce different text on every call.
31
-
- Lowering `Temperature` and `TopK` reduces variability but does not guarantee determinism. There is no exposed seed parameter.
32
-
- Setting `Temperature = 0` is not guaranteed to produce identical outputs across calls.
30
+
The following guidance can help you address non-deterministic output.
33
31
34
-
### Reducing variability
32
+
-**Do not write logic that depends on exact output matching.** The API assigns a new random seed on each call, so the same prompt can produce different text every time. Small changes to the prompt — even a single extra space — can also cause large differences in output length and content. Never compare response text with exact string matching — use case-insensitive substring checks, regex, or semantic comparison instead.
33
+
-**Lower `Temperature` and `TopK` to reduce variability** when your scenario requires more consistent output. This narrows the range of possible responses but does not guarantee identical results across calls.
34
+
-**`Temperature = 0` produces deterministic output on the same machine with the same execution provider (EP) version.** However, expect different results across different hardware or after an EP update, due to differences in how numerical operations are ordered and accumulated.
35
+
36
+
#### Reducing variability
37
+
38
+
You can narrow the range of possible outputs by tightening the sampling parameters as shown in the following snippet. Setting a low `Temperature` keeps the model focused on its highest-confidence tokens, and setting `TopK = 1` restricts selection to the single most likely token at each step. This won't produce identical output across calls, but it significantly reduces how much responses diverge.
"Fragile" string comparison in AI typically refers to when exact matching fails due to slight variations in data, such as typos, formatting differences, or semantic shifts, which can break automated processes. Addressing this requires replacing rigid equality checks twith more robust, intelligent, or fuzzy techniques.
57
+
58
+
For example, while it's tempting to branch on the exact text of a response — especially when you've asked a yes/no question, an equality check will fail unpredictably because the model can return "Yes", "yes.", "Yes, that's correct", or other variations. Instead, parse or classify the response in a way that assumes variation (check whether the response contains "yes" case-insensitively, or use the model for structured extraction).
51
59
52
60
```c#
53
-
// DO NOT do this - output is non-deterministic
61
+
// DO NOT do this - output is non-deterministic.
54
62
varresult=awaitlanguageModel.GenerateResponseAsync("Is 2 > 1? Answer yes or no.");
55
63
if (result.Text=="Yes") // Fragile: response may be "yes", "Yes.", "Yes, 2 is greater", etc.
56
64
{
57
65
// ...
58
66
}
59
67
```
60
68
61
-
Instead, parse or classify the response in a way that tolerates variation (e.g., check whether
62
-
the response contains "yes" case-insensitively, or use the model for structured extraction).
63
-
64
-
### Semantic comparison with embeddings
69
+
#### Semantic comparison with embeddings
65
70
66
-
Rather than comparing response text directly, use `GenerateEmbeddingVectors` and cosine
67
-
similarity to determine whether two outputs are semantically equivalent. This approach is
68
-
resilient to differences in wording, punctuation, and formatting.
71
+
When you need to determine whether two responses mean the same thing — not just whether they contain the same words — use embeddings. The [GenerateEmbeddingVectors](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel.generateembeddingvectors) method (demonstrated in the following snippet) converts text into a numeric vector that captures its meaning, so you can compare responses with cosine similarity instead of string matching. Two answers that say the same thing in different words will have a high similarity score, while unrelated answers will score low. This makes embeddings a reliable way to evaluate consistency across non-deterministic outputs.
69
72
70
73
```c#
71
74
usingMicrosoft.Windows.AI.Text;
@@ -79,14 +82,20 @@ double CosineSimilarity(float[] a, float[] b)
0 commit comments