erasure_code: branch-free gf_mul via sentinel log + zeroed antilog#421
Open
scopedog wants to merge 1 commit into
Open
erasure_code: branch-free gf_mul via sentinel log + zeroed antilog#421scopedog wants to merge 1 commit into
scopedog wants to merge 1 commit into
Conversation
Replace the default (non-GF_LARGE_TABLES) gf_mul with a single
branch-free table lookup:
return gff_base[gflog_base[a] + gflog_base[b]];
Two table changes make this correct without any branch:
- gflog_base is widened to uint16_t and gflog_base[0] is set to a
sentinel (511) outside the normal log range [0,254].
- gff_base (antilog) holds two full periods so the index never needs a
reduction wrap, followed by a zeroed tail. A zero operand sends the
index into the zeroed tail, so the product is 0 without an
(a == 0 || b == 0) test.
This removes both the zero-guard and the reduction branch from the
hottest scalar multiply, which feeds the scalar dot-product/encode
fallback paths and matrix construction (gf_gen_rs_matrix,
gf_invert_matrix). The polynomial is unchanged (0x11D), so results are
bit-identical to the previous gf_mul.
Tables grow from 256 B + 256 B to ~1 KB (gff_base) + 512 B (gflog_base),
still far smaller than the 64 KB GF_LARGE_TABLES path, which this makes
largely redundant. gf_inv is unchanged.
Microbenchmark, ns/mul (lower is better), -O2 -march=native, min of 6
passes, gf_mul(data, coeff) order as called by the base functions.
Comparing this change to the previous default (stock) and the 64 KB
GF_LARGE_TABLES, over a table build and a region multiply (fixed
coefficient, streaming data) on dense and 50%-zero data:
build dense sparse(50% zero)
Zen 3 (Ryzen 7 5800X)
stock (default) 0.75 0.72 0.73
this change 0.36 0.36 0.42
GF_LARGE_TABLES 0.50 0.42 0.22
Raptor Lake (i5-1340P)
stock (default) 0.81 1.04 1.30
this change 0.45 0.54 0.54
GF_LARGE_TABLES 0.73 0.71 0.71
This change is ~2x the previous default everywhere and stays consistent
across data (no data-dependent branch). It delivers most of the 64 KB
GF_LARGE_TABLES benefit at ~1/40th the memory: faster on Raptor Lake,
while on Zen 3 the 64 KB table can still edge it on zero-heavy data.
Technique from the gf-nishida-16 library
(https://github.com/scopedog/gf-nishida-16).
Verified:
- gf_mul bit-identical to a carryless-multiply reference over all
65536 input pairs; gf_inv round-trips for every nonzero input
- erasure_code_test, erasure_code_update_test, gf_inverse_test,
gf_vect_mul_test, gf_vect_dot_prod_test all pass
Co-Authored-By: Claude Opus 4.8 <[email protected]>
Signed-off-by: Hiroshi Nishida <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The default (non-
GF_LARGE_TABLES)gf_mul()does a log + antilog lookupguarded by two branches: a
(a == 0 || b == 0)zero test and a> 254 ? i - 255reduction wrap. This replaces it with a single,fully branch-free table lookup:
The polynomial is unchanged (0x11D), so results are bit-identical to
the previous
gf_mul()— there is no change to encoded data.How it stays correct without branches
Two table changes:
gflog_baseis widened touint16_t, andgflog_base[0]is set to asentinel
511, outside the normal log range[0, 254].gff_base(antilog) holds two full periods, sogflog_base[a] + gflog_base[b](≤ 508 for nonzero operands) never needsa reduction wrap, and is followed by a zeroed tail.
When either operand is 0, the sentinel pushes the index into the zeroed
tail and the result is
0— no zero-guard needed. The doubled antilogremoves the reduction wrap. Max index is
511 + 511 = 1022.Scope / compatibility
GF_LARGE_TABLESpath is untouched.gf_inv()is unchanged.gff_base) + 512 B(
gflog_base) — still far smaller than the 64 KBGF_LARGE_TABLEStable, which this makes largely redundant (see numbers below). Only
erasure_code/ec_base.{c,h}are touched.This speeds up the scalar multiply that feeds the base dot-product/encode
fallback paths and matrix construction (
gf_gen_rs_matrix,gf_invert_matrix); it does not touch the SIMD paths.Performance
Microbenchmark, ns/mul (lower is better),
-O2 -march=native, minimum of6 passes, with the
gf_mul(data, coeff)argument order used by the basefunctions, over a table build and a region multiply (fixed coefficient,
streaming data) on dense and 50%-zero data:
~2× the previous default everywhere (up to ~2.9× on Raptor Lake's sparse
case), and consistent across data (no data-dependent branch) where the
default zero-guard degrades on zero-heavy input. It delivers most of the
64 KB
GF_LARGE_TABLESbenefit at ~1/40th the memory — faster on RaptorLake; on Zen 3 the 64 KB table can still edge it on zero-heavy data.
Testing
gf_mulchecked bit-identical to an independent carryless-multiplyreference (mod 0x11D) over all 65 536 input pairs.
gf_invround-trips for every nonzero input.erasure_code_test,erasure_code_update_test,gf_inverse_test,gf_vect_mul_test,gf_vect_dot_prod_testall pass.Provenance
The branch-free sentinel + zeroed-tail technique is from the gf-nishida-16
library: https://github.com/scopedog/gf-nishida-16
Commit carries
Signed-off-by:per the DCO.