Skip to content

Commit 4003004

Browse files
rewritten non-deterministic section based on review feedback and Claude Opus 4.6 assistance
1 parent fc4ea8f commit 4003004

1 file changed

Lines changed: 26 additions & 17 deletions

File tree

docs/apis/language-model-best-practices.md

Lines changed: 26 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -11,27 +11,31 @@ This topic provides developer guidance and describes various best practices for
1111

1212
## Handling non-deterministic output
1313

14-
Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing factor built into the APIs.
14+
Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing seed value built into the APIs.
1515

1616
The Phi Silica model is sensitive to any randomness, with small changes to the input and options producing large changes in the output. For example, the introduction of a single space or typo in a prompt might turn a 100 token answer into a 1000 token answer.
1717

1818
### Why outputs vary
1919

20-
The default sampling parameters introduce randomness into token selection:
20+
The default sampling parameters (described in the following table) control the creativity of the model. Apparent randomness is generated by the random seed value of the [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) API.
2121

2222
| Parameter | Default | Effect on variability |
2323
| --- | --- | --- |
2424
| Temperature | 0.9 | Higher values increase randomness; lower values produce more focused output. |
25-
| TopP | 0.9 | Controls cumulative probability threshold for token candidates. |
25+
| TopP | 0.9 | Controls the cumulative probability threshold for token candidates. |
2626
| TopK | 40 | Limits how many tokens are considered at each step; lower values reduce variability. |
2727

2828
### Guidance
2929

30-
- **Do not write logic that depends on exact output matching.** The same prompt can produce different text on every call.
31-
- Lowering `Temperature` and `TopK` reduces variability but does not guarantee determinism. There is no exposed seed parameter.
32-
- Setting `Temperature = 0` is not guaranteed to produce identical outputs across calls.
30+
The following guidance can help you address non-deterministic output.
3331

34-
### Reducing variability
32+
- **Do not write logic that depends on exact output matching.** The API assigns a new random seed on each call, so the same prompt can produce different text every time. Small changes to the prompt — even a single extra space — can also cause large differences in output length and content. Never compare response text with exact string matching — use case-insensitive substring checks, regex, or semantic comparison instead.
33+
- **Lower `Temperature` and `TopK` to reduce variability** when your scenario requires more consistent output. This narrows the range of possible responses but does not guarantee identical results across calls.
34+
- **`Temperature = 0` produces deterministic output on the same machine with the same execution provider (EP) version.** However, expect different results across different hardware or after an EP update, due to differences in how numerical operations are ordered and accumulated.
35+
36+
#### Reducing variability
37+
38+
You can narrow the range of possible outputs by tightening the sampling parameters as shown in the following snippet. Setting a low `Temperature` keeps the model focused on its highest-confidence tokens, and setting `TopK = 1` restricts selection to the single most likely token at each step. This won't produce identical output across calls, but it significantly reduces how much responses diverge.
3539

3640
```c#
3741
using Microsoft.Windows.AI.Text;
@@ -47,25 +51,24 @@ async Task<string> GenerateWithLowVariability(LanguageModel languageModel, strin
4751
}
4852
```
4953

50-
### Anti-pattern: fragile string comparison
54+
#### Fragile string comparison
55+
56+
"Fragile" string comparison in AI typically refers to when exact matching fails due to slight variations in data, such as typos, formatting differences, or semantic shifts, which can break automated processes. Addressing this requires replacing rigid equality checks twith more robust, intelligent, or fuzzy techniques.
57+
58+
For example, while it's tempting to branch on the exact text of a response — especially when you've asked a yes/no question, an equality check will fail unpredictably because the model can return "Yes", "yes.", "Yes, that's correct", or other variations. Instead, parse or classify the response in a way that assumes variation (check whether the response contains "yes" case-insensitively, or use the model for structured extraction).
5159

5260
```c#
53-
// DO NOT do this - output is non-deterministic
61+
// DO NOT do this - output is non-deterministic.
5462
var result = await languageModel.GenerateResponseAsync("Is 2 > 1? Answer yes or no.");
5563
if (result.Text == "Yes") // Fragile: response may be "yes", "Yes.", "Yes, 2 is greater", etc.
5664
{
5765
// ...
5866
}
5967
```
6068

61-
Instead, parse or classify the response in a way that tolerates variation (e.g., check whether
62-
the response contains "yes" case-insensitively, or use the model for structured extraction).
63-
64-
### Semantic comparison with embeddings
69+
#### Semantic comparison with embeddings
6570

66-
Rather than comparing response text directly, use `GenerateEmbeddingVectors` and cosine
67-
similarity to determine whether two outputs are semantically equivalent. This approach is
68-
resilient to differences in wording, punctuation, and formatting.
71+
When you need to determine whether two responses mean the same thing — not just whether they contain the same words — use embeddings. The [GenerateEmbeddingVectors](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel.generateembeddingvectors) method (demonstrated in the following snippet) converts text into a numeric vector that captures its meaning, so you can compare responses with cosine similarity instead of string matching. Two answers that say the same thing in different words will have a high similarity score, while unrelated answers will score low. This makes embeddings a reliable way to evaluate consistency across non-deterministic outputs.
6972

7073
```c#
7174
using Microsoft.Windows.AI.Text;
@@ -79,14 +82,20 @@ double CosineSimilarity(float[] a, float[] b)
7982
magA += a[i] * a[i];
8083
magB += b[i] * b[i];
8184
}
82-
return dot / (Math.Sqrt(magA) * Math.Sqrt(magB));
85+
double denominator = Math.Sqrt(magA) * Math.Sqrt(magB);
86+
return denominator == 0 ? 0 : dot / denominator;
8387
}
8488

8589
async Task<bool> AreResponsesSemanticallyEqual(
8690
LanguageModel languageModel, string prompt, double threshold = 0.9)
8791
{
8892
var result1 = await languageModel.GenerateResponseAsync(prompt);
93+
if (result1.Status != LanguageModelResponseStatus.Complete)
94+
return false;
95+
8996
var result2 = await languageModel.GenerateResponseAsync(prompt);
97+
if (result2.Status != LanguageModelResponseStatus.Complete)
98+
return false;
9099

91100
var embedding1 = languageModel.GenerateEmbeddingVectors(result1.Text);
92101
var embedding2 = languageModel.GenerateEmbeddingVectors(result2.Text);

0 commit comments

Comments
 (0)