Skip to content

fix(mediorum): skip legacy transient retry ops#339

Open
RolfAris wants to merge 3 commits into
OpenAudio:mainfrom
RolfAris:fix/mediorum-suppress-transient-retry-ops-upstream
Open

fix(mediorum): skip legacy transient retry ops#339
RolfAris wants to merge 3 commits into
OpenAudio:mainfrom
RolfAris:fix/mediorum-suppress-transient-retry-ops-upstream

Conversation

@RolfAris

@RolfAris RolfAris commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Problem

Legacy peers can still send repeated transcode retry updates for bad uploads.

Those rows are not useful replay history once all rows in the op are:

Field Value
table uploads
action update
status busy or error
error_count > 5
320 result missing

Persisting those remote retry snapshots makes ops grow quickly while adding no durable bootstrap value.

Change

Apply those remote retry ops to current state, but do not persist them into local ops.

The predicate is intentionally narrow:

Case Persist?
local host op yes
explicit Transient op no
remote retry spam over limit, no 320 no
error_count == 5 boundary yes
done/result op yes
mixed batch yes
malformed/unexpected payload yes

Evidence

20 clean hourly canary samples, all ok, from 2026-06-03T18:15:21Z through 2026-06-04T13:15:23Z:

Metric Result
avg byte reduction 96.64%
worst byte reduction 91.24%
avg row reduction 83.10%
worst row reduction 74.01%

Latest 1h sample:

Node Role rows bytes
val005 treatment: source + receiver fix 378 563,409
val008 control: source-only 4,109 35,699,538

Last-hour classifier:

Metric val005 treatment val008 control
persisted upload-update rows 390 4,278
suppressible retry rows 0 3,888
persisted JSON bytes 665,580 418,588,008

For the current top 10 toxic uploads, val005 treatment persisted 0 retry rows while val008 control persisted 48-74 rows per upload. Current upload state still replicated on both nodes: status=error, no 320, audio analysis not queued.

Tests

go test ./pkg/mediorum/crudr -count=1

@RolfAris RolfAris force-pushed the fix/mediorum-suppress-transient-retry-ops-upstream branch from 82caf5a to 7b92d0b Compare June 9, 2026 19:47
@RolfAris

RolfAris commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Re-validated against current main (now includes #347, async legacy-upload replication). Two paired StoreAll nodes, same base differing only by this change, 3h06m:

since cutover uploads/update ops suppressible rows bytes
with this change 1,087 0 1.48 MB
control 19,113 17,964 158.76 MB

−99.1% op-log byte growth; 0 suppressible rows persisted vs 17,964 on the control (~1.2 GB/day/node of superseded retry snapshots). This is higher than the earlier 96.64% because the control here carries no source-side cap.

#347 doesn't touch the suppressed path — it changes upload intake, while this gates persistence in ApplyOp. A suppressed op is still applied to current state and ApplyOp returns success, so the sweep cursor advances past it and the peer never re-sends; both nodes held identical legitimate-op volume and stayed in consensus across the window. Rebased onto current main; go test ./pkg/mediorum/crudr green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant