Support Lloyd-Max quantization by CC-Yeh · Pull Request #448 · trymirai/uzu

CC-Yeh · 2026-05-26T17:15:17Z

M	K	N	LUT NF4	ZP	ScaleBias	LUT vs ZP	LUT vs ScaleBias
1	2048	2048	13.21 +/- 0.17	13.21 +/- 0.21	16.25 +/- 0.30	+0.0%	-18.7%
2	2048	2048	20.66 +/- 0.18	21.25 +/- 0.44	19.77 +/- 0.42	-2.8%	+4.5%
4	2048	2048	39.95 +/- 0.66	40.00 +/- 0.57	36.38 +/- 0.42	-0.1%	+9.8%
1	4096	4096	82.27 +/- 2.41	81.00 +/- 1.48	88.96 +/- 3.30	+1.6%	-7.5%
2	4096	4096	83.30 +/- 0.69	86.27 +/- 2.69	87.93 +/- 2.92	-3.4%	-5.3%
4	4096	4096	159.31 +/- 0.48	159.28 +/- 0.95	147.03 +/- 0.29	+0.0%	+8.3%
1	4096	14336	311.23 +/- 3.57	317.59 +/- 7.47	333.77 +/- 10.88	-2.0%	-6.8%
2	4096	14336	313.65 +/- 10.87	310.81 +/- 2.51	334.17 +/- 15.47	+0.9%	-6.1%
4	4096	14336	546.43 +/- 6.16	543.17 +/- 1.03	502.50 +/- 1.13	+0.6%	+8.7%
1	14336	4096	307.90 +/- 3.87	315.62 +/- 5.92	328.01 +/- 2.63	-2.4%	-6.1%
2	14336	4096	309.49 +/- 3.95	314.63 +/- 3.91	327.04 +/- 4.39	-1.6%	-5.4%
4	14336	4096	584.91 +/- 36.89	545.00 +/- 1.63	503.72 +/- 1.03	+7.3%	+16.1%
1	14336	14336	1102.04 +/- 25.39	1089.21 +/- 15.51	1143.24 +/- 12.77	+1.2%	-3.6%
2	14336	14336	1106.88 +/- 58.83	1068.91 +/- 14.79	1154.25 +/- 56.39	+3.6%	-4.1%
4	14336	14336	2043.22 +/- 33.13	1872.84 +/- 7.52	1742.68 +/- 20.34	+9.1%	+17.2%

Summary

LUT is roughly tied with ZP overall: geomean +0.14% slower, wins 8/15.
LUT beats ZP at M=1, is roughly neutral at M=2, and loses at M=4.
LUT beats ScaleBias at M=1/M=2, but loses badly at M=4; overall geomean is +0.71% slower.

CC-Yeh · 2026-05-26T17:15:31Z

What improved

Normal tg16 used one shared 16-entry threadgroup LUT across all simdgroups.
That was bad for batched rows: about +33% to +48% slower at M=4.
tg16-duplicate did not fix it: still about +38% to +53% slower at M=4.
tg16-vec4 was worse: about +43% to +57% slower at M=4.
tg16-ilp and no-barrier variants did not explain the win.
The winning change was simdgroup separation: each simdgroup gets its own 16-entry LUT slice and
synchronizes with simdgroup_barrier.
That brought the old bad tg16 path down from roughly +33%..+48% slower to roughly +5%..+15%
slower at M=4 in the old standalone LUT experiments, and near parity with ZP in the current
apples-to-apples benchmark.

Bad variants

FLUTE-ish duplicate/select/shuffle variants were catastrophic.
Duplicate variants reached hundreds of percent slower.
select was often over +1100% slower in M=2/M=4 cells.

CC-Yeh · 2026-05-26T17:16:31Z

Threadgroup codebook in the arg could hurt performance on non-codebook quants, need to benchmarks on different machine for it, if it is true, need to separate to two kernels maybe

CC-Yeh · 2026-05-26T17:35:29Z

previous attempts in #394

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 69f4ef5c8d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T18:23:55Z

+                        QuantizationMethod::Codebook => {
+                            let codebook_value =
+                                (*codebook.expect("Codebook quantized QMV requires a codebook").add(val_q)).to_f32();
+                            scale * codebook_value


Reject non-4-bit codebook QMV combinations

QuantizedMatmulQmvFast now accepts quant_method == Codebook for both BITS=4 and BITS=8, but the CPU path indexes the codebook with the unpacked quantized value (val_q) without any bounds check. With BITS=8, val_q can be 0..255, so callers that provide the expected 16-entry NF4 codebook will trigger out-of-bounds reads in qmv and produce undefined behavior. Please fail fast for Codebook unless BITS == 4 (or validate/require a 256-entry codebook).

Useful? React with 👍 / 👎.

sady4850 · 2026-06-01T14:56:58Z

Numbers from Apple M1 Pro (32 GB, no MXU) — not in the matrix above.

Setup: criterion, gpu_execution_time(), bf16, group_size=64, 4-bit, --measurement-time 3 (the largest-shape cell re-checked at 10s). main = 8cf3630a, this branch = 6e50362c.

LUT vs ZP / ScaleBias (positive = LUT slower), same layout as your table:

M	K	N	LUT vs ZP	LUT vs ScaleBias
1	2048	2048	+7.4%	+7.0%
2	2048	2048	+9.1%	+11.3%
4	2048	2048	+9.5%	+13.0%
1	4096	4096	+9.6%	+12.5%
2	4096	4096	+10.1%	+13.6%
4	4096	4096	+10.2%	+13.7%
1	4096	14336	+6.3%	+4.7%
2	4096	14336	+9.9%	+13.6%
4	4096	14336	+9.9%	+13.4%
1	14336	4096	+4.9%	−0.2%
2	14336	4096	+16.1%	+19.9%
4	14336	4096	+10.3%	+13.2%
1	14336	14336	+1.5%	−4.2%
2	14336	14336	+9.7%	+4.4%
4	14336	14336	+10.0%	+13.6%

Non-codebook paths, main → this branch (12 overlapping shapes, gs64):

	geomean	median	max
ZP	+0.2%	+0.0%	+1.7%
ScaleBias	+0.6%	+0.1%	+3.6%

Can add M=8, other group sizes, or longer runs if useful.

This reverts commit 3ec6dd2.

# Conflicts: # crates/backend-uzu/src/backends/metal/kernel/matmul/gemm/specialization.rs # crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/common/b_source.h # crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/gemv.metal # crates/backend-uzu/src/backends/metal/kernel/matmul/gemv/kernel.rs # crates/backend-uzu/src/backends/metal/kernel/matmul/mod.rs

CC-Yeh mentioned this pull request May 26, 2026

[Experiment] QMV LUT Dequant #394

Closed

CC-Yeh marked this pull request as ready for review May 26, 2026 18:16

CC-Yeh requested review from LuckyIYI, eugenebokhan and uuuvn as code owners May 26, 2026 18:16

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

CC-Yeh marked this pull request as draft May 26, 2026 18:46

CC-Yeh changed the title ~~support qmv LUT~~ Support Lloyd-Max quantization May 28, 2026

CC-Yeh added 4 commits June 4, 2026 00:44

support qmv LUT

dee0e34

lloyd-max draft

7ff65bb

wip

2fc2544

Harden Lloyd-Max QMV integration on lut

cd34fc7

CC-Yeh force-pushed the lut branch from 6e50362 to cd34fc7 Compare June 6, 2026 05:54

CC-Yeh added 10 commits June 6, 2026 13:55

Merge branch 'main' into lut

6c132f0

cleanup

064fe95

Keep Lloyd-Max test runners local

9b7f5f0

Move Lloyd-Max fixtures into owning tests

3ec6dd2

Revert "Move Lloyd-Max fixtures into owning tests"

e1bf77c

This reverts commit 3ec6dd2.

cleanup

2e82531

format

17114c8

Merge branch 'main' into lut

28dab06

extract sum optimization

0fc20b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Lloyd-Max quantization#448

Support Lloyd-Max quantization#448
CC-Yeh wants to merge 14 commits into
mainfrom
lut

CC-Yeh commented May 26, 2026 •

edited

Loading

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

sady4850 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CC-Yeh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

CC-Yeh commented May 26, 2026

What improved

Bad variants

Uh oh!

CC-Yeh commented May 26, 2026

Uh oh!

CC-Yeh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

sady4850 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CC-Yeh commented May 26, 2026 •

edited

Loading

CC-Yeh commented May 26, 2026 •

edited

Loading