Skip to content

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797 #7811

@hongjun-bae

Description

@hongjun-bae

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797

What version of Hls.js are you using?

v1.6.13 with PR #7797 (commit e97638fec) cherry-picked on top of v1.6.0.

What browser (including version) are you using?

Chrome 147.0.0.0

What OS (including version) are you using?

macOS (Apple Silicon)

Test stream

Live LL-HLS fMP4/CMAF stream (~1s parts, ~2s segments, single audiovideo track, avc1.64002A + mp4a.40.2, PTS timescale 48000). Not publicly shareable. Corruption was injected via Proxy.app (Proxyman) — a single live segment/part in the middle of playback was rewritten so that its moof/mdat contained no decodable samples (same technique used to reproduce the original PR #7797 bug).

Configuration

{
  "autoStartLoad": true,
  "startFragPrefetch": true,
  "lowLatencyMode": true,
  "manifestLoadingMaxRetry": 3,
  "levelLoadingMaxRetry": 1,
  "fragLoadingMaxRetry": 3,
  "maxMaxBufferLength": 60
}

Steps to reproduce

  1. Start playing the live LL-HLS stream. Playback begins normally.
  2. With Proxyman, inject a corrupted .m4v part (empty moof/mdat) at some live position A. PR Prevent loop-loading of segments with dropped appends #7797 kicks in: Nothing buffered for … is logged, fragment is marked as a gap, gap-controller skipping hole advances currentTime past A, and playback resumes after ~5s of stall. First recovery works.
  3. Once playback is stable again, inject a second corrupted part at position B (≥ ~15s after A).
  4. Observe Network tab and console: after the second injection, hls.js does not mark the fragment as a gap quickly. Instead it enters a passthrough-remuxer Timestamps at … != initPTS … ping-pong loop that oscillates between two initPTS values, repeatedly rewrites SourceBuffer.timestampOffset, and never expands the buffered TimeRange. Playback is stuck for ~20s until stream-controller's "too far from the end of live sliding playlist" safety net kicks in and force-seeks to the live edge.

Repro rate: ~100% on the second injection once the initPTS has already been set for the session. On VOD the "too far from live edge" safety net does not exist, so playback would remain permanently stuck — we have not verified VOD repro but the code path is live-only.

Expected behaviour

After PR #7797 landed, any dropped-append segment — regardless of whether it's the first or the second in a session — should be detected as an empty append and marked as a gap within 1–2 retries, so that gap-controller can skip it. Recovery latency should be comparable to the first injection (~5s), not ~20s, and it should not depend on the live-edge safety net.

What actually happened?

The first corruption recovers cleanly via PR #7797 + gap-controller skipping hole. The second corruption triggers a different failure path: passthrough-remuxer.isInvalidInitPts incorrectly resets initPTS based on the tfdt of the corrupted segment, which permanently poisons fragment.startPTS (turns into a large negative number), and then every subsequent iteration toggles initPTS between the "correct" and the "poisoned" value. PR #7797's Nothing buffered detection still fires but only after several retries, and the addAsGap marking on the corrupted segment does not help because the next (healthy) segment is also treated as failing to append (it inherits the poisoned timestampOffset).

Timeline from the log (attached as 7797-to-be-repro.log)

Startup (22:08:02–03):

22:08:02.719 [level-controller]: manifest loaded, 5 level(s), 1080p SDR avc1,mp4a @8384000
22:08:03.020 [passthrough-remuxer]: Found initPTS at playlist time: 26.008671 offset: 15637.974974833334 (750622798.792/48000) trackId: 2

initPTS = 750622798.792/48000 ≈ 15637.97s. This is the healthy anchor for the rest of the session.

First corruption — recovers via PR #7797 (22:08:08–13):

22:08:08.417 [gap-controller]: Playback stalling at @32.945342 due to low buffer (len 0.08, buffer:[26.008-33.027])
22:08:13.517 [gap-controller]: skipping hole, adjusting currentTime from 32.945342 to 34.058671
22:08:13.618 [gap-controller]: playback not stuck anymore @34.072998, after 5291ms

Total recovery: ~5.3s. Nothing buffered for … log lines confirm PR #7797 fired; addAsGap marked the corrupt fragment, _loadFragForPlayback returned true, gap-controller skipped the hole.

Second corruption — stuck for ~20s (22:08:25–51):

22:08:25.366 [passthrough-remuxer]: No audio or video samples found for initPTS at playlist time: 48.016008
22:08:27.517 [gap-controller]: Playback stalling at @47.944303 due to low buffer
              (buffer:[32.057-33.027][34.009-48.025])
22:08:30.370 [passthrough-remuxer]: No audio or video samples found for initPTS at playlist time: 49.033008

22:08:31.341 [buffer-controller]: Nothing buffered for main level: 4 sn: 7842 retries 1
22:08:31.557 [buffer-controller]: Nothing buffered for main level: 4 sn: 7842 retries 3
22:08:31.698 [passthrough-remuxer]: Timestamps at playlist time: -15589.958966833334 31275.958612666665
              != initPTS: 15637.974974833334 (750622798.792/48000) trackId: 2
22:08:31.812 [passthrough-remuxer]: Found initPTS at playlist time: 52.014004333332196
              offset: 15637.974995666667 (750622799.792/48000) trackId: 2
22:08:31.862 [passthrough-remuxer]: Timestamps at playlist time: -15589.958966833334 31275.958612666665
              != initPTS: 15637.974995666667 (750622799.792/48000)
22:08:31.912 [passthrough-remuxer]: Timestamps at playlist time: 52.014004333332196 15637.974995666667
              != initPTS: 31275.958612666665 (1501246013.408/48000)
22:08:31.915 [buffer-controller]: Nothing buffered for main level: 4 sn: 7843 retries 2

Note three things:

  1. initPTS oscillates between two values that are exactly 2× apart: 750622798.792/48000 (≈ 15637.97s) vs 1501246013.408/48000 (≈ 31275.96s). 1501246013.408 = 750622798.792 + 750623214.616 ≈ 2× the original. This pattern is identical to the PR Prevent loop-loading of segments with dropped appends #7797 reproduction — the corrupted segment carries a tfdt that is far enough from the live initPTS to trip isInvalidInitPts's Math.abs(startTime - timeOffset) > minDuration guard. isInvalidInitPts then resets initPTS, which makes the next healthy segment look invalid relative to the new anchor, which resets it back, ad infinitum.
  2. fragment.startPTS is poisoned with a negative value: after the first bad remux of sn 7841, we see Parsed main sn: 7841 of level 4 (frag:[-15589.959-50.009]). On the next iteration stream-controller re-loads sn 7841 with timeOffset = -15589.959 which is the very input that lets isInvalidInitPts continue firing. The loop is self-sustaining.
  3. PR Prevent loop-loading of segments with dropped appends #7797's addAsGap does fire for the corrupt segment (Nothing buffered for … sn: 7842), but sn 7843 — which is not corrupt — is also reported as Nothing buffered because its append uses the wrong timestampOffset set by the ping-pong, so the SourceBuffer silently drops its samples. The gap marking cannot propagate forward: only the directly-empty fragment gets gap=true.

Final recovery via live-edge safety net (22:08:51):

22:08:51.356 [stream-controller]: Playback: 47.944 is located too far from the end of live sliding playlist:
              80.028, reset currentTime to : 74.979
22:08:51.362 [stream-controller]: Media seeking to 74.979, state: FRAG_LOADING, out of buffer
22:08:51.665 [stream-controller]: Resetting level fragment error count of 4 on frag buffered
22:08:51.868 [stream-controller]: Parsed main sn: 7855 part: 1 of level 4 (part:[77.044-78.027]INDEPENDENT=NO)

Recovery is only via the live-edge reset, ~20s after the corruption. VOD has no such net.

Root cause analysis

The regression introduced by the second injection is in src/remux/passthrough-remuxer.ts:

// passthrough-remuxer.ts L241-258
if (
  (accurateTimeOffset || !initPTS) &&
  (isInvalidInitPts(initPTS, decodeTime, timeOffset, duration) ||
    timescale !== initPTS.timescale)
) {
  if (initPTS) {
    this.warn(`Timestamps at playlist time: …`);
  }
  initPTS = null;                 // ← unconditional reset
  initSegment.initPTS = baseTime; //   even when the "new" PTS is
  initSegment.timescale = timescale; //   obviously wrong (e.g. 2× the original)
  initSegment.trackId = trackId;
}

isInvalidInitPts uses a symmetric Math.abs(startTime - timeOffset) > minDuration test. It cannot distinguish between:

  • a legitimate initPTS-rollover (long playback / 33-bit PTS wrap)
  • a spurious tfdt in a corrupt segment whose timeOffset is itself derived from the previous iteration's poisoned fragment.startPTS

Combined with the fact that stream-controller dutifully persists the remuxer-reported startPTS = decodeTime - initPTS.baseTime/initPTS.timescale back into fragment.startPTS (stream-controller.ts L1367 frag.setElementaryStreamInfo), a single bad segment can poison the timeline for every subsequent iteration.

PR #7797 closes the "dropped-append" side of the problem (FragmentTracker.detectPartialFragmentsaddAsGap), but does nothing to defend the initPTS/startPTS anchor itself. Once the anchor is poisoned, healthy segments that immediately follow look corrupt and addAsGap only tags the directly-empty one. stream-controller then oscillates between the two (Loading main sn: 7843 of level 4 (frag:[52.014-54.019])Loading main sn: 7841 of level 4 (frag:[-15589.959-50.009])) until the live-edge safety net force-seeks.

Suggested fix

Two directions, either independently or combined:

1. Make isInvalidInitPts asymmetric and cc-bounded.

Reject an incoming candidate when it would move the anchor by more than a few multiples of frag.duration and the new candidate is suspiciously close to an integer multiple of the existing one (Math.abs(newBase - k·oldBase) < halfSegmentDuration for small k), which is the tfdt-rollover / parallel-encoder signature shown above. Under cc continuity, tfdt should drift by at most a few segment durations between remuxes — anything larger is a corruption signal, not a legitimate anchor change.

2. Defend fragment.startPTS in stream-controller.

In _bufferFragmentData / updateLevelTiming, if the newly reported startPTS disagrees with the playlist-declared fragment.start by more than a few segment durations, do not write it back — instead keep the playlist timing, mark the fragment partial, and let FragmentTracker decide whether to addAsGap. A remux result that produces frag:[-15589.959-50.009] for a playlist fragment at [52-54] is never right and should not propagate.

3. (Optional) Extend the "too far from live edge" safety net to stalls in VOD.

Today this logic is live-only. A VOD client experiencing the same ping-pong loop has no escape hatch. Consider generalising it to "buffered range has not advanced for N seconds while stream-controller is in IDLE/FRAG_LOADING" and emit a FRAG_GAP-style error that allows the app to startLoad(seekPos) its way out.

Related issues / PRs

Log

Full 8,285-line repro log (startup 22:08:02 → first clean recovery 22:08:08–13 → second broken recovery 22:08:25–51 → live-edge safety-net reset) is not attached here because it contains internal URLs and tokens. Sent privately to @robwalch via Slack DM; happy to share a redacted version on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions