Skip to content

Support Lloyd-Max quantization#448

Draft
CC-Yeh wants to merge 14 commits into
mainfrom
lut
Draft

Support Lloyd-Max quantization#448
CC-Yeh wants to merge 14 commits into
mainfrom
lut

Conversation

@CC-Yeh

@CC-Yeh CC-Yeh commented May 26, 2026

Copy link
Copy Markdown
Contributor
M K N LUT NF4 ZP ScaleBias LUT vs ZP LUT vs ScaleBias
1 2048 2048 13.21 +/- 0.17 13.21 +/- 0.21 16.25 +/- 0.30 +0.0% -18.7%
2 2048 2048 20.66 +/- 0.18 21.25 +/- 0.44 19.77 +/- 0.42 -2.8% +4.5%
4 2048 2048 39.95 +/- 0.66 40.00 +/- 0.57 36.38 +/- 0.42 -0.1% +9.8%
1 4096 4096 82.27 +/- 2.41 81.00 +/- 1.48 88.96 +/- 3.30 +1.6% -7.5%
2 4096 4096 83.30 +/- 0.69 86.27 +/- 2.69 87.93 +/- 2.92 -3.4% -5.3%
4 4096 4096 159.31 +/- 0.48 159.28 +/- 0.95 147.03 +/- 0.29 +0.0% +8.3%
1 4096 14336 311.23 +/- 3.57 317.59 +/- 7.47 333.77 +/- 10.88 -2.0% -6.8%
2 4096 14336 313.65 +/- 10.87 310.81 +/- 2.51 334.17 +/- 15.47 +0.9% -6.1%
4 4096 14336 546.43 +/- 6.16 543.17 +/- 1.03 502.50 +/- 1.13 +0.6% +8.7%
1 14336 4096 307.90 +/- 3.87 315.62 +/- 5.92 328.01 +/- 2.63 -2.4% -6.1%
2 14336 4096 309.49 +/- 3.95 314.63 +/- 3.91 327.04 +/- 4.39 -1.6% -5.4%
4 14336 4096 584.91 +/- 36.89 545.00 +/- 1.63 503.72 +/- 1.03 +7.3% +16.1%
1 14336 14336 1102.04 +/- 25.39 1089.21 +/- 15.51 1143.24 +/- 12.77 +1.2% -3.6%
2 14336 14336 1106.88 +/- 58.83 1068.91 +/- 14.79 1154.25 +/- 56.39 +3.6% -4.1%
4 14336 14336 2043.22 +/- 33.13 1872.84 +/- 7.52 1742.68 +/- 20.34 +9.1% +17.2%

Summary

  • LUT is roughly tied with ZP overall: geomean +0.14% slower, wins 8/15.
  • LUT beats ZP at M=1, is roughly neutral at M=2, and loses at M=4.
  • LUT beats ScaleBias at M=1/M=2, but loses badly at M=4; overall geomean is +0.71% slower.

@CC-Yeh

CC-Yeh commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

What improved

  • Normal tg16 used one shared 16-entry threadgroup LUT across all simdgroups.
  • That was bad for batched rows: about +33% to +48% slower at M=4.
  • tg16-duplicate did not fix it: still about +38% to +53% slower at M=4.
  • tg16-vec4 was worse: about +43% to +57% slower at M=4.
  • tg16-ilp and no-barrier variants did not explain the win.
  • The winning change was simdgroup separation: each simdgroup gets its own 16-entry LUT slice and
    synchronizes with simdgroup_barrier.
  • That brought the old bad tg16 path down from roughly +33%..+48% slower to roughly +5%..+15%
    slower at M=4 in the old standalone LUT experiments, and near parity with ZP in the current
    apples-to-apples benchmark.

Bad variants

  • FLUTE-ish duplicate/select/shuffle variants were catastrophic.
  • Duplicate variants reached hundreds of percent slower.
  • select was often over +1100% slower in M=2/M=4 cells.

@CC-Yeh

CC-Yeh commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

Threadgroup codebook in the arg could hurt performance on non-codebook quants, need to benchmarks on different machine for it, if it is true, need to separate to two kernels maybe

@CC-Yeh

CC-Yeh commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

previous attempts in #394

@CC-Yeh CC-Yeh marked this pull request as ready for review May 26, 2026 18:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69f4ef5c8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +89 to +92
QuantizationMethod::Codebook => {
let codebook_value =
(*codebook.expect("Codebook quantized QMV requires a codebook").add(val_q)).to_f32();
scale * codebook_value

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject non-4-bit codebook QMV combinations

QuantizedMatmulQmvFast now accepts quant_method == Codebook for both BITS=4 and BITS=8, but the CPU path indexes the codebook with the unpacked quantized value (val_q) without any bounds check. With BITS=8, val_q can be 0..255, so callers that provide the expected 16-entry NF4 codebook will trigger out-of-bounds reads in qmv and produce undefined behavior. Please fail fast for Codebook unless BITS == 4 (or validate/require a 256-entry codebook).

Useful? React with 👍 / 👎.

@CC-Yeh CC-Yeh marked this pull request as draft May 26, 2026 18:46
@CC-Yeh CC-Yeh changed the title support qmv LUT Support Lloyd-Max quantization May 28, 2026
@sady4850

sady4850 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Numbers from Apple M1 Pro (32 GB, no MXU) — not in the matrix above.

Setup: criterion, gpu_execution_time(), bf16, group_size=64, 4-bit, --measurement-time 3 (the largest-shape cell re-checked at 10s). main = 8cf3630a, this branch = 6e50362c.

LUT vs ZP / ScaleBias (positive = LUT slower), same layout as your table:

M K N LUT vs ZP LUT vs ScaleBias
1 2048 2048 +7.4% +7.0%
2 2048 2048 +9.1% +11.3%
4 2048 2048 +9.5% +13.0%
1 4096 4096 +9.6% +12.5%
2 4096 4096 +10.1% +13.6%
4 4096 4096 +10.2% +13.7%
1 4096 14336 +6.3% +4.7%
2 4096 14336 +9.9% +13.6%
4 4096 14336 +9.9% +13.4%
1 14336 4096 +4.9% −0.2%
2 14336 4096 +16.1% +19.9%
4 14336 4096 +10.3% +13.2%
1 14336 14336 +1.5% −4.2%
2 14336 14336 +9.7% +4.4%
4 14336 14336 +10.0% +13.6%

Non-codebook paths, main → this branch (12 overlapping shapes, gs64):

geomean median max
ZP +0.2% +0.0% +1.7%
ScaleBias +0.6% +0.1% +3.6%

Can add M=8, other group sizes, or longer runs if useful.

CC-Yeh added 10 commits June 6, 2026 13:55
# Conflicts:
#	crates/backend-uzu/src/backends/metal/kernel/matmul/gemm/specialization.rs
#	crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/common/b_source.h
#	crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/gemv.metal
#	crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/kernel.rs
#	crates/backend-uzu/src/backends/metal/kernel/matmul/mod.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants