You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Legacy peers can still send repeated transcode retry updates for bad uploads.
Those rows are not useful replay history once all rows in the op are:
Field
Value
table
uploads
action
update
status
busy or error
error_count
> 5
320 result
missing
Persisting those remote retry snapshots makes ops grow quickly while adding no durable bootstrap value.
Change
Apply those remote retry ops to current state, but do not persist them into local ops.
The predicate is intentionally narrow:
Case
Persist?
local host op
yes
explicit Transient op
no
remote retry spam over limit, no 320
no
error_count == 5 boundary
yes
done/result op
yes
mixed batch
yes
malformed/unexpected payload
yes
Evidence
20 clean hourly canary samples, all ok, from 2026-06-03T18:15:21Z through 2026-06-04T13:15:23Z:
Metric
Result
avg byte reduction
96.64%
worst byte reduction
91.24%
avg row reduction
83.10%
worst row reduction
74.01%
Latest 1h sample:
Node
Role
rows
bytes
val005
treatment: source + receiver fix
378
563,409
val008
control: source-only
4,109
35,699,538
Last-hour classifier:
Metric
val005 treatment
val008 control
persisted upload-update rows
390
4,278
suppressible retry rows
0
3,888
persisted JSON bytes
665,580
418,588,008
For the current top 10 toxic uploads, val005 treatment persisted 0 retry rows while val008 control persisted 48-74 rows per upload. Current upload state still replicated on both nodes: status=error, no 320, audio analysis not queued.
Re-validated against current main (now includes #347, async legacy-upload replication). Two paired StoreAll nodes, same base differing only by this change, 3h06m:
since cutover
uploads/update ops
suppressible rows
bytes
with this change
1,087
0
1.48 MB
control
19,113
17,964
158.76 MB
−99.1% op-log byte growth; 0 suppressible rows persisted vs 17,964 on the control (~1.2 GB/day/node of superseded retry snapshots). This is higher than the earlier 96.64% because the control here carries no source-side cap.
#347 doesn't touch the suppressed path — it changes upload intake, while this gates persistence in ApplyOp. A suppressed op is still applied to current state and ApplyOp returns success, so the sweep cursor advances past it and the peer never re-sends; both nodes held identical legitimate-op volume and stayed in consensus across the window. Rebased onto current main; go test ./pkg/mediorum/crudr green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Legacy peers can still send repeated transcode retry updates for bad uploads.
Those rows are not useful replay history once all rows in the op are:
uploadsupdatebusyorerror> 5Persisting those remote retry snapshots makes
opsgrow quickly while adding no durable bootstrap value.Change
Apply those remote retry ops to current state, but do not persist them into local
ops.The predicate is intentionally narrow:
Transientop320error_count == 5boundaryEvidence
20 clean hourly canary samples, all
ok, from2026-06-03T18:15:21Zthrough2026-06-04T13:15:23Z:Latest 1h sample:
Last-hour classifier:
For the current top 10 toxic uploads, val005 treatment persisted
0retry rows while val008 control persisted48-74rows per upload. Current upload state still replicated on both nodes:status=error, no320, audio analysis not queued.Tests
go test ./pkg/mediorum/crudr -count=1