Skip to content

erasure_code: branch-free gf_mul via sentinel log + zeroed antilog#421

Open
scopedog wants to merge 1 commit into
intel:masterfrom
scopedog:upstream-gfmul-nishida
Open

erasure_code: branch-free gf_mul via sentinel log + zeroed antilog#421
scopedog wants to merge 1 commit into
intel:masterfrom
scopedog:upstream-gfmul-nishida

Conversation

@scopedog
Copy link
Copy Markdown

@scopedog scopedog commented Jun 3, 2026

Summary

The default (non-GF_LARGE_TABLES) gf_mul() does a log + antilog lookup
guarded by two branches: a (a == 0 || b == 0) zero test and a
> 254 ? i - 255 reduction wrap. This replaces it with a single,
fully branch-free table lookup:

return gff_base[gflog_base[a] + gflog_base[b]];

The polynomial is unchanged (0x11D), so results are bit-identical to
the previous gf_mul() — there is no change to encoded data.

How it stays correct without branches

Two table changes:

  • gflog_base is widened to uint16_t, and gflog_base[0] is set to a
    sentinel 511, outside the normal log range [0, 254].
  • gff_base (antilog) holds two full periods, so
    gflog_base[a] + gflog_base[b] (≤ 508 for nonzero operands) never needs
    a reduction wrap, and is followed by a zeroed tail.

When either operand is 0, the sentinel pushes the index into the zeroed
tail and the result is 0 — no zero-guard needed. The doubled antilog
removes the reduction wrap. Max index is 511 + 511 = 1022.

Scope / compatibility

  • Only the default arm changes; the GF_LARGE_TABLES path is untouched.
  • gf_inv() is unchanged.
  • Public ABI unchanged.
  • Tables grow from 256 B + 256 B to ~1 KB (gff_base) + 512 B
    (gflog_base) — still far smaller than the 64 KB GF_LARGE_TABLES
    table, which this makes largely redundant (see numbers below). Only
    erasure_code/ec_base.{c,h} are touched.

This speeds up the scalar multiply that feeds the base dot-product/encode
fallback paths and matrix construction (gf_gen_rs_matrix,
gf_invert_matrix); it does not touch the SIMD paths.

Performance

Microbenchmark, ns/mul (lower is better), -O2 -march=native, minimum of
6 passes, with the gf_mul(data, coeff) argument order used by the base
functions, over a table build and a region multiply (fixed coefficient,
streaming data) on dense and 50%-zero data:

                          build   dense   sparse(50% zero)
  Zen 3 (Ryzen 7 5800X)
    stock (default)       0.75    0.72    0.73
    this change           0.36    0.36    0.42
    GF_LARGE_TABLES       0.50    0.42    0.22
  Raptor Lake (i5-1340P)
    stock (default)       0.81    1.04    1.30
    this change           0.45    0.54    0.54
    GF_LARGE_TABLES       0.73    0.71    0.71

~2× the previous default everywhere (up to ~2.9× on Raptor Lake's sparse
case), and consistent across data (no data-dependent branch) where the
default zero-guard degrades on zero-heavy input. It delivers most of the
64 KB GF_LARGE_TABLES benefit at ~1/40th the memory — faster on Raptor
Lake; on Zen 3 the 64 KB table can still edge it on zero-heavy data.

Testing

  • gf_mul checked bit-identical to an independent carryless-multiply
    reference (mod 0x11D) over all 65 536 input pairs.
  • gf_inv round-trips for every nonzero input.
  • erasure_code_test, erasure_code_update_test, gf_inverse_test,
    gf_vect_mul_test, gf_vect_dot_prod_test all pass.

Provenance

The branch-free sentinel + zeroed-tail technique is from the gf-nishida-16
library: https://github.com/scopedog/gf-nishida-16

Commit carries Signed-off-by: per the DCO.

Replace the default (non-GF_LARGE_TABLES) gf_mul with a single
branch-free table lookup:

    return gff_base[gflog_base[a] + gflog_base[b]];

Two table changes make this correct without any branch:
  - gflog_base is widened to uint16_t and gflog_base[0] is set to a
    sentinel (511) outside the normal log range [0,254].
  - gff_base (antilog) holds two full periods so the index never needs a
    reduction wrap, followed by a zeroed tail. A zero operand sends the
    index into the zeroed tail, so the product is 0 without an
    (a == 0 || b == 0) test.

This removes both the zero-guard and the reduction branch from the
hottest scalar multiply, which feeds the scalar dot-product/encode
fallback paths and matrix construction (gf_gen_rs_matrix,
gf_invert_matrix). The polynomial is unchanged (0x11D), so results are
bit-identical to the previous gf_mul.

Tables grow from 256 B + 256 B to ~1 KB (gff_base) + 512 B (gflog_base),
still far smaller than the 64 KB GF_LARGE_TABLES path, which this makes
largely redundant. gf_inv is unchanged.

Microbenchmark, ns/mul (lower is better), -O2 -march=native, min of 6
passes, gf_mul(data, coeff) order as called by the base functions.
Comparing this change to the previous default (stock) and the 64 KB
GF_LARGE_TABLES, over a table build and a region multiply (fixed
coefficient, streaming data) on dense and 50%-zero data:

                          build   dense   sparse(50% zero)
  Zen 3 (Ryzen 7 5800X)
    stock (default)       0.75    0.72    0.73
    this change           0.36    0.36    0.42
    GF_LARGE_TABLES       0.50    0.42    0.22
  Raptor Lake (i5-1340P)
    stock (default)       0.81    1.04    1.30
    this change           0.45    0.54    0.54
    GF_LARGE_TABLES       0.73    0.71    0.71

This change is ~2x the previous default everywhere and stays consistent
across data (no data-dependent branch). It delivers most of the 64 KB
GF_LARGE_TABLES benefit at ~1/40th the memory: faster on Raptor Lake,
while on Zen 3 the 64 KB table can still edge it on zero-heavy data.

Technique from the gf-nishida-16 library
(https://github.com/scopedog/gf-nishida-16).

Verified:
  - gf_mul bit-identical to a carryless-multiply reference over all
    65536 input pairs; gf_inv round-trips for every nonzero input
  - erasure_code_test, erasure_code_update_test, gf_inverse_test,
    gf_vect_mul_test, gf_vect_dot_prod_test all pass

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Signed-off-by: Hiroshi Nishida <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant