rewritten non-deterministic section based on review feedback and Claude Opus 4.6 assistance

Karl-Bridge-Microsoft · Karl-Bridge-Microsoft · commit 40030043f3d4 · 2026-03-26T16:45:13.000-07:00
diff --git a/docs/apis/language-model-best-practices.md b/docs/apis/language-model-best-practices.md
@@ -11,27 +11,31 @@ This topic provides developer guidance and describes various best practices for
 
 ## Handling non-deterministic output
 
-Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing factor built into the APIs.
+Most code behaves predictably — the same input always produces the same output. The [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) APIs don't work that way, as the exact same prompt can yield a different response each time it's submitted due to a randomizing seed value built into the APIs.
 
 The Phi Silica model is sensitive to any randomness, with small changes to the input and options producing large changes in the output. For example, the introduction of a single space or typo in a prompt might turn a 100 token answer into a 1000 token answer.
 
 ### Why outputs vary
 
-The default sampling parameters introduce randomness into token selection:
+The default sampling parameters (described in the following table) control the creativity of the model. Apparent randomness is generated by the random seed value of the [LanguageModel](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel) API.
 
 | Parameter | Default | Effect on variability |
 | --- | --- | --- |
 | Temperature | 0.9 | Higher values increase randomness; lower values produce more focused output. |
-| TopP | 0.9 | Controls cumulative probability threshold for token candidates. |
+| TopP | 0.9 | Controls the cumulative probability threshold for token candidates. |
 | TopK | 40 | Limits how many tokens are considered at each step; lower values reduce variability. |
 
 ### Guidance
 
-- **Do not write logic that depends on exact output matching.** The same prompt can produce different text on every call.
-- Lowering `Temperature` and `TopK` reduces variability but does not guarantee determinism. There is no exposed seed parameter.
-- Setting `Temperature = 0` is not guaranteed to produce identical outputs across calls.
+The following guidance can help you address non-deterministic output.
 
-### Reducing variability
+- **Do not write logic that depends on exact output matching.** The API assigns a new random seed on each call, so the same prompt can produce different text every time. Small changes to the prompt — even a single extra space — can also cause large differences in output length and content. Never compare response text with exact string matching — use case-insensitive substring checks, regex, or semantic comparison instead.
+- **Lower `Temperature` and `TopK` to reduce variability** when your scenario requires more consistent output. This narrows the range of possible responses but does not guarantee identical results across calls.
+- **`Temperature = 0` produces deterministic output on the same machine with the same execution provider (EP) version.** However, expect different results across different hardware or after an EP update, due to differences in how numerical operations are ordered and accumulated.
+
+#### Reducing variability
+
+You can narrow the range of possible outputs by tightening the sampling parameters as shown in the following snippet. Setting a low `Temperature` keeps the model focused on its highest-confidence tokens, and setting `TopK = 1` restricts selection to the single most likely token at each step. This won't produce identical output across calls, but it significantly reduces how much responses diverge.
 
 ```c#
 using Microsoft.Windows.AI.Text;
@@ -47,25 +51,24 @@ async Task<string> GenerateWithLowVariability(LanguageModel languageModel, strin
 }
 ```
 
-### Anti-pattern: fragile string comparison
+#### Fragile string comparison
+
+"Fragile" string comparison in AI typically refers to when exact matching fails due to slight variations in data, such as typos, formatting differences, or semantic shifts, which can break automated processes. Addressing this requires replacing rigid equality checks twith more robust, intelligent, or fuzzy techniques.
+
+For example, while it's tempting to branch on the exact text of a response — especially when you've asked a yes/no question, an equality check will fail unpredictably because the model can return "Yes", "yes.", "Yes, that's correct", or other variations. Instead, parse or classify the response in a way that assumes variation (check whether the response contains "yes" case-insensitively, or use the model for structured extraction).
 
 ```c#
-// DO NOT do this - output is non-deterministic
+// DO NOT do this - output is non-deterministic.
 var result = await languageModel.GenerateResponseAsync("Is 2 > 1? Answer yes or no.");
 if (result.Text == "Yes")  // Fragile: response may be "yes", "Yes.", "Yes, 2 is greater", etc.
 {
     // ...
 }
 ```
 
-Instead, parse or classify the response in a way that tolerates variation (e.g., check whether
-the response contains "yes" case-insensitively, or use the model for structured extraction).
-
-### Semantic comparison with embeddings
+#### Semantic comparison with embeddings
 
-Rather than comparing response text directly, use `GenerateEmbeddingVectors` and cosine
-similarity to determine whether two outputs are semantically equivalent. This approach is
-resilient to differences in wording, punctuation, and formatting.
+When you need to determine whether two responses mean the same thing — not just whether they contain the same words — use embeddings. The [GenerateEmbeddingVectors](/windows/windows-app-sdk/api/winrt/microsoft.windows.ai.text.languagemodel.generateembeddingvectors) method (demonstrated in the following snippet) converts text into a numeric vector that captures its meaning, so you can compare responses with cosine similarity instead of string matching. Two answers that say the same thing in different words will have a high similarity score, while unrelated answers will score low. This makes embeddings a reliable way to evaluate consistency across non-deterministic outputs.
 
 ```c#
 using Microsoft.Windows.AI.Text;
@@ -79,14 +82,20 @@ double CosineSimilarity(float[] a, float[] b)
         magA += a[i] * a[i];
         magB += b[i] * b[i];
     }
-    return dot / (Math.Sqrt(magA) * Math.Sqrt(magB));
+    double denominator = Math.Sqrt(magA) * Math.Sqrt(magB);
+    return denominator == 0 ? 0 : dot / denominator;
 }
 
 async Task<bool> AreResponsesSemanticallyEqual(
     LanguageModel languageModel, string prompt, double threshold = 0.9)
 {
     var result1 = await languageModel.GenerateResponseAsync(prompt);
+    if (result1.Status != LanguageModelResponseStatus.Complete)
+        return false;
+
     var result2 = await languageModel.GenerateResponseAsync(prompt);
+    if (result2.Status != LanguageModelResponseStatus.Complete)
+        return false;
 
     var embedding1 = languageModel.GenerateEmbeddingVectors(result1.Text);
     var embedding2 = languageModel.GenerateEmbeddingVectors(result2.Text);