From 7c5cf248487e3a89a3f747fb16ae14e6a36d484c Mon Sep 17 00:00:00 2001
From: Eric Lam <voidful.stack@gmail.com>
Date: Tue, 12 May 2026 09:22:20 +0800
Subject: [PATCH 1/2] Add practical long-audio ASR inference notes

---
 docs/vibevoice-asr.md | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/docs/vibevoice-asr.md b/docs/vibevoice-asr.md
index 5e659448..0d618768 100644
--- a/docs/vibevoice-asr.md
+++ b/docs/vibevoice-asr.md
@@ -87,6 +87,32 @@ python demo/vibevoice_asr_gradio_demo.py --model_path microsoft/VibeVoice-ASR --
 python demo/vibevoice_asr_inference_from_file.py --model_path microsoft/VibeVoice-ASR --audio_files [add an audio path here] 
 ```
 
+### Practical notes for long audio
+
+For recordings that exceed the configured single-pass limit, or when GPU memory
+is constrained, a practical fallback is to split the audio into bounded chunks
+and stitch the structured outputs after inference.
+
+A robust chunked workflow should:
+
+1. keep each chunk within the duration and token length validated for the target
+   deployment, for example 30-minute chunks;
+2. run ASR independently for each chunk;
+3. add the chunk start time back to every predicted segment timestamp;
+4. concatenate the timestamp-adjusted segments;
+5. validate timestamp coverage, timestamp monotonicity, and repeated-text loops,
+   not only WER.
+
+When chunking, speaker labels may be local to each chunk. Applications that need
+globally consistent speaker identities should add a separate speaker-linking or
+diarization step across chunks.
+
+For Hugging Face generation, memory use can also depend on prefill-time
+intermediate tensors. If your inference stack supports it, setting
+`logits_to_keep=1` can avoid computing full vocabulary logits for every prefill
+position. Chunked prefill can further reduce activation peaks, at the cost of
+additional runtime.
+
 
 ## Finetuning
 LoRA (Low-Rank Adaptation) fine-tuning is supported. See [Finetuning](../finetuning-asr/README.md) for detailed guide.

From 2d056c8b8c675e1166cdd515c80949b10ba0cc50 Mon Sep 17 00:00:00 2001
From: Eric Lam <voidful.stack@gmail.com>
Date: Tue, 12 May 2026 09:26:26 +0800
Subject: [PATCH 2/2] Document YaRN long-audio ASR findings

---
 docs/vibevoice-asr.md | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/docs/vibevoice-asr.md b/docs/vibevoice-asr.md
index 0d618768..62773647 100644
--- a/docs/vibevoice-asr.md
+++ b/docs/vibevoice-asr.md
@@ -107,6 +107,23 @@ When chunking, speaker labels may be local to each chunk. Applications that need
 globally consistent speaker identities should add a separate speaker-linking or
 diarization step across chunks.
 
+For single-pass runs near the context boundary, YaRN RoPE scaling can improve
+long-audio robustness when the model's existing context configuration is used as
+the base. In one 11-item long-form stress test, setting
+`rope_type=yarn`, `factor=1.5`, and
+`original_max_position_embeddings=131072` preserved 30-minute quality while
+removing the observed 90-minute collapse cases:
+
+| Setting | e22 90m WER | e22 coverage | TED 90m WER | TED coverage | 11-item mean WER | Collapses |
+|---|---:|---:|---:|---:|---:|---:|
+| No RoPE override | 0.5824 | 77.6% | 0.8250 | 21.8% | 0.2328 | 2 |
+| YaRN, factor=1.5, original_max=131072 | 0.4859 | 82.0% | 0.3422 | 91.0% | 0.2542 | 0 |
+
+This is a robustness trade-off rather than a memory optimization: YaRN changes
+position scaling, but it does not reduce KV-cache size or activation memory.
+Validate the factor on the target audio distribution before using it as the
+default path.
+
 For Hugging Face generation, memory use can also depend on prefill-time
 intermediate tensors. If your inference stack supports it, setting
 `logits_to_keep=1` can avoid computing full vocabulary logits for every prefill