initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797

# initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797

## What version of Hls.js are you using?

v1.6.13 with PR #7797 (commit `e97638fec`) cherry-picked on top of v1.6.0.

## What browser (including version) are you using?

Chrome 147.0.0.0 

## What OS (including version) are you using?

macOS (Apple Silicon)

## Test stream

Live LL-HLS fMP4/CMAF stream (~1s parts, ~2s segments, single audiovideo track, avc1.64002A + mp4a.40.2, PTS timescale 48000). Not publicly shareable. Corruption was injected via **Proxy.app (Proxyman)** — a single live segment/part in the middle of playback was rewritten so that its `moof`/`mdat` contained no decodable samples (same technique used to reproduce the original PR #7797 bug).

## Configuration

```json
{
  "autoStartLoad": true,
  "startFragPrefetch": true,
  "lowLatencyMode": true,
  "manifestLoadingMaxRetry": 3,
  "levelLoadingMaxRetry": 1,
  "fragLoadingMaxRetry": 3,
  "maxMaxBufferLength": 60
}
```

## Steps to reproduce

1. Start playing the live LL-HLS stream. Playback begins normally.
2. With Proxyman, inject a corrupted `.m4v` part (empty moof/mdat) at some live position *A*. PR #7797 kicks in: `Nothing buffered for …` is logged, fragment is marked as a gap, `gap-controller` `skipping hole` advances `currentTime` past *A*, and playback resumes after ~5s of stall. **First recovery works.**
3. Once playback is stable again, inject a **second** corrupted part at position *B* (≥ ~15s after *A*).
4. Observe Network tab and console: after the second injection, hls.js does **not** mark the fragment as a gap quickly. Instead it enters a `passthrough-remuxer` `Timestamps at … != initPTS …` ping-pong loop that oscillates between two initPTS values, repeatedly rewrites `SourceBuffer.timestampOffset`, and never expands the buffered TimeRange. Playback is stuck for ~20s until `stream-controller`'s `"too far from the end of live sliding playlist"` safety net kicks in and force-seeks to the live edge.

**Repro rate:** ~100% on the second injection once the initPTS has already been set for the session. On VOD the `"too far from live edge"` safety net does not exist, so playback would remain permanently stuck — we have not verified VOD repro but the code path is live-only.

## Expected behaviour

After PR #7797 landed, any dropped-append segment — regardless of whether it's the first or the second in a session — should be detected as an empty append and marked as a gap within 1–2 retries, so that `gap-controller` can skip it. Recovery latency should be comparable to the first injection (~5s), not ~20s, and it should not depend on the live-edge safety net.

## What actually happened?

The first corruption recovers cleanly via PR #7797 + `gap-controller skipping hole`. The second corruption triggers **a different failure path**: `passthrough-remuxer.isInvalidInitPts` incorrectly resets `initPTS` based on the tfdt of the corrupted segment, which permanently poisons `fragment.startPTS` (turns into a large negative number), and then every subsequent iteration toggles `initPTS` between the "correct" and the "poisoned" value. PR #7797's `Nothing buffered` detection still fires but only after several retries, and the `addAsGap` marking on the *corrupted* segment does not help because the *next* (healthy) segment is also treated as failing to append (it inherits the poisoned timestampOffset).

### Timeline from the log (attached as `7797-to-be-repro.log`)

**Startup (22:08:02–03):**
```
22:08:02.719 [level-controller]: manifest loaded, 5 level(s), 1080p SDR avc1,mp4a @8384000
22:08:03.020 [passthrough-remuxer]: Found initPTS at playlist time: 26.008671 offset: 15637.974974833334 (750622798.792/48000) trackId: 2
```
initPTS = 750622798.792/48000 ≈ 15637.97s. This is the healthy anchor for the rest of the session.

**First corruption — recovers via PR #7797 (22:08:08–13):**
```
22:08:08.417 [gap-controller]: Playback stalling at @32.945342 due to low buffer (len 0.08, buffer:[26.008-33.027])
22:08:13.517 [gap-controller]: skipping hole, adjusting currentTime from 32.945342 to 34.058671
22:08:13.618 [gap-controller]: playback not stuck anymore @34.072998, after 5291ms
```
Total recovery: **~5.3s**. `Nothing buffered for …` log lines confirm PR #7797 fired; `addAsGap` marked the corrupt fragment, `_loadFragForPlayback` returned true, `gap-controller` skipped the hole.

**Second corruption — stuck for ~20s (22:08:25–51):**
```
22:08:25.366 [passthrough-remuxer]: No audio or video samples found for initPTS at playlist time: 48.016008
22:08:27.517 [gap-controller]: Playback stalling at @47.944303 due to low buffer
              (buffer:[32.057-33.027][34.009-48.025])
22:08:30.370 [passthrough-remuxer]: No audio or video samples found for initPTS at playlist time: 49.033008

22:08:31.341 [buffer-controller]: Nothing buffered for main level: 4 sn: 7842 retries 1
22:08:31.557 [buffer-controller]: Nothing buffered for main level: 4 sn: 7842 retries 3
22:08:31.698 [passthrough-remuxer]: Timestamps at playlist time: -15589.958966833334 31275.958612666665
              != initPTS: 15637.974974833334 (750622798.792/48000) trackId: 2
22:08:31.812 [passthrough-remuxer]: Found initPTS at playlist time: 52.014004333332196
              offset: 15637.974995666667 (750622799.792/48000) trackId: 2
22:08:31.862 [passthrough-remuxer]: Timestamps at playlist time: -15589.958966833334 31275.958612666665
              != initPTS: 15637.974995666667 (750622799.792/48000)
22:08:31.912 [passthrough-remuxer]: Timestamps at playlist time: 52.014004333332196 15637.974995666667
              != initPTS: 31275.958612666665 (1501246013.408/48000)
22:08:31.915 [buffer-controller]: Nothing buffered for main level: 4 sn: 7843 retries 2
```
Note three things:

1. **initPTS oscillates between two values that are exactly 2× apart:** `750622798.792/48000` (≈ 15637.97s) vs `1501246013.408/48000` (≈ 31275.96s). `1501246013.408 = 750622798.792 + 750623214.616` ≈ 2× the original. This pattern is identical to the PR #7797 reproduction — the corrupted segment carries a tfdt that is far enough from the live initPTS to trip `isInvalidInitPts`'s `Math.abs(startTime - timeOffset) > minDuration` guard. `isInvalidInitPts` then resets `initPTS`, which makes the *next* healthy segment look invalid relative to the new anchor, which resets it back, ad infinitum.
2. **`fragment.startPTS` is poisoned with a negative value:** after the first bad remux of sn 7841, we see `Parsed main sn: 7841 of level 4 (frag:[-15589.959-50.009])`. On the next iteration stream-controller re-loads sn 7841 with `timeOffset = -15589.959` which is the very input that lets `isInvalidInitPts` continue firing. The loop is self-sustaining.
3. **PR #7797's `addAsGap` does fire for the corrupt segment** (`Nothing buffered for … sn: 7842`), but sn 7843 — which is **not** corrupt — is also reported as `Nothing buffered` because its append uses the wrong `timestampOffset` set by the ping-pong, so the SourceBuffer silently drops its samples. The gap marking cannot propagate forward: only the directly-empty fragment gets `gap=true`.

**Final recovery via live-edge safety net (22:08:51):**
```
22:08:51.356 [stream-controller]: Playback: 47.944 is located too far from the end of live sliding playlist:
              80.028, reset currentTime to : 74.979
22:08:51.362 [stream-controller]: Media seeking to 74.979, state: FRAG_LOADING, out of buffer
22:08:51.665 [stream-controller]: Resetting level fragment error count of 4 on frag buffered
22:08:51.868 [stream-controller]: Parsed main sn: 7855 part: 1 of level 4 (part:[77.044-78.027]INDEPENDENT=NO)
```
Recovery is only via the live-edge reset, ~**20s** after the corruption. VOD has no such net.

## Root cause analysis

The regression introduced by the second injection is in `src/remux/passthrough-remuxer.ts`:

```ts
// passthrough-remuxer.ts L241-258
if (
  (accurateTimeOffset || !initPTS) &&
  (isInvalidInitPts(initPTS, decodeTime, timeOffset, duration) ||
    timescale !== initPTS.timescale)
) {
  if (initPTS) {
    this.warn(`Timestamps at playlist time: …`);
  }
  initPTS = null;                 // ← unconditional reset
  initSegment.initPTS = baseTime; //   even when the "new" PTS is
  initSegment.timescale = timescale; //   obviously wrong (e.g. 2× the original)
  initSegment.trackId = trackId;
}
```

`isInvalidInitPts` uses a symmetric `Math.abs(startTime - timeOffset) > minDuration` test. It cannot distinguish between:
- a **legitimate** initPTS-rollover (long playback / 33-bit PTS wrap)
- a **spurious** tfdt in a corrupt segment whose timeOffset is *itself* derived from the previous iteration's poisoned `fragment.startPTS`

Combined with the fact that stream-controller dutifully persists the remuxer-reported `startPTS = decodeTime - initPTS.baseTime/initPTS.timescale` back into `fragment.startPTS` (`stream-controller.ts` L1367 `frag.setElementaryStreamInfo`), a single bad segment can poison the timeline for every subsequent iteration.

PR #7797 closes the "dropped-append" side of the problem (`FragmentTracker.detectPartialFragments` → `addAsGap`), but does nothing to defend the initPTS/startPTS anchor itself. Once the anchor is poisoned, healthy segments that immediately follow look corrupt and `addAsGap` only tags the directly-empty one. stream-controller then oscillates between the two (`Loading main sn: 7843 of level 4 (frag:[52.014-54.019])` ↔ `Loading main sn: 7841 of level 4 (frag:[-15589.959-50.009])`) until the live-edge safety net force-seeks.

## Suggested fix

Two directions, either independently or combined:

**1. Make `isInvalidInitPts` asymmetric and cc-bounded.**

Reject an incoming candidate when it would move the anchor by more than a few multiples of `frag.duration` *and* the new candidate is suspiciously close to an integer multiple of the existing one (`Math.abs(newBase - k·oldBase) < halfSegmentDuration` for small k), which is the tfdt-rollover / parallel-encoder signature shown above. Under `cc` continuity, tfdt should drift by at most a few segment durations between remuxes — anything larger is a corruption signal, not a legitimate anchor change.

**2. Defend `fragment.startPTS` in stream-controller.**

In `_bufferFragmentData` / `updateLevelTiming`, if the newly reported `startPTS` disagrees with the playlist-declared `fragment.start` by more than a few segment durations, do **not** write it back — instead keep the playlist timing, mark the fragment partial, and let `FragmentTracker` decide whether to `addAsGap`. A remux result that produces `frag:[-15589.959-50.009]` for a playlist fragment at `[52-54]` is never right and should not propagate.

**3. (Optional) Extend the "too far from live edge" safety net to stalls in VOD.**

Today this logic is live-only. A VOD client experiencing the same ping-pong loop has no escape hatch. Consider generalising it to "buffered range has not advanced for N seconds while stream-controller is in `IDLE`/`FRAG_LOADING`" and emit a `FRAG_GAP`-style error that allows the app to `startLoad(seekPos)` its way out.

## Related issues / PRs

- #7797 (the fix being tested here — resolves the first-injection case)
- #7546 / #6711 / #7774 (older related loop-loading reports)
- bug report and the first PR #7797 repro log are attached to the Slack thread that prompted this issue.

## Log

Full 8,285-line repro log (startup 22:08:02 → first clean recovery 22:08:08–13 → second broken recovery 22:08:25–51 → live-edge safety-net reset) is not attached here because it contains internal URLs and tokens. Sent privately to @robwalch via Slack DM; happy to share a redacted version on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797 #7811

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797

What version of Hls.js are you using?

What browser (including version) are you using?

What OS (including version) are you using?

Test stream

Configuration

Steps to reproduce

Expected behaviour

What actually happened?

Timeline from the log (attached as `7797-to-be-repro.log`)

Root cause analysis

Suggested fix

Related issues / PRs

Log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797 #7811

Description

initPTS ping-pong prevents recovery on repeated dropped-append after PR #7797

What version of Hls.js are you using?

What browser (including version) are you using?

What OS (including version) are you using?

Test stream

Configuration

Steps to reproduce

Expected behaviour

What actually happened?

Timeline from the log (attached as 7797-to-be-repro.log)

Root cause analysis

Suggested fix

Related issues / PRs

Log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Timeline from the log (attached as `7797-to-be-repro.log`)