Add core serial TurboQuant attention and packed state by scouzi1966 · Pull Request #90 · scouzi1966/maclocal-api

scouzi1966 · 2026-04-05T00:07:05Z

Summary

Add the core serial TurboQuant behavior on top of the plumbing branch.

What this PR includes

explicit TurboQuant attention dispatch
packed TurboQuant codec state for prompt-cache serialization
recursive cache conversion for supported cache topologies
design notes for the attention/codecs slices

What this PR does not include

Metal fast-path execution
batch/concurrent TurboQuant cache execution
rotating/sliding-window TurboQuant support

Validation

MACAFM_MLX_METALLIB="$PWD/default.metallib" swift test --filter TurboQuantCacheTests --parallel --num-workers 1
MACAFM_MLX_METALLIB="$PWD/default.metallib" swift test --filter 'KVCacheTruncateTests|BatchedPrefillTests' --parallel --num-workers 1

Stack

This is PR 2 of a stacked TurboQuant series and targets PR 1.

Summary by Sourcery

Introduce packed TurboQuant KV cache state and explicit attention dispatch hooks, routing model attention through TurboQuant-specific decode/prefill paths while preserving compatibility with existing quantized and dense caches.

New Features:

Add TurboQuant-specific decode and prefill attention entrypoints and route attention through TurboQuant, quantized, or dense paths via a shared helper.

Enhancements:

Introduce packed TurboQuant KV cache representation using MSE-style norms and low-bit indices as the serialized source of truth while maintaining dense shadow buffers for fallback attention.
Support asymmetric fractional TurboQuant bit-widths for keys and values and extend cache state/metaState handling for backward-compatible prompt-cache serialization.
Expand TurboQuant tests to cover packed-state save/load, dense round-trips, and attention dispatch behavior for decode vs prefill flows.

Build:

Update MLX patch application script to include the new attention utilities patch and corresponding target path.

Documentation:

Add design notes documenting TurboQuant packed codec state and attention dispatch architecture and their intended scope and limitations.

Tests:

Add a fake TurboQuant cache and new tests to verify attention utilities correctly route single-token decode and multi-token prefill through TurboQuant-specific APIs and validate packed-state round-trips.

sourcery-ai · 2026-04-05T00:07:48Z

Reviewer's Guide

Implements explicit TurboQuant-aware attention routing and converts TurboQuantKVCache to use packed low-bit MSE-style codec state as the serialized source of truth, while keeping dense shadow buffers for fallback attention and maintaining backward-compatible cache metadata and serialization.

Sequence diagram for TurboQuant-aware attention dispatch

sequenceDiagram
    participant Caller
    participant AttentionUtils as attentionWithCacheUpdate
    participant KV as KVCache
    participant TurboKV as TurboQuantKVCacheProtocol
    participant QuantKV as QuantizedKVCacheProtocol
    participant FastAttn as MLXFast
    participant QuantAttn as quantizedScaledDotProductAttention

    Caller->>AttentionUtils: attentionWithCacheUpdate(queries, keys, values, cache, scale, mask)
    alt cache is nil
        AttentionUtils->>FastAttn: scaledDotProductAttention(queries, keys, values, scale, mask)
        FastAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    else cache is TurboQuantKVCacheProtocol
        AttentionUtils->>TurboKV: queries.dim(2) == 1?
        alt decode path
            AttentionUtils->>TurboKV: decodeAttention(queries, keys, values, scale, mask)
            TurboKV-->>AttentionUtils: output
        else prefill path
            AttentionUtils->>TurboKV: prefillAttention(queries, keys, values, scale, mask)
            TurboKV-->>AttentionUtils: output
        end
        AttentionUtils-->>Caller: output
    else cache is QuantizedKVCacheProtocol
        AttentionUtils->>QuantKV: updateQuantized(keys, values)
        QuantKV-->>AttentionUtils: quantizedKeys, quantizedValues
        AttentionUtils->>QuantAttn: quantizedScaledDotProductAttention(queries, quantizedKeys, quantizedValues, scale, mask, groupSize, bits, mode)
        QuantAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    else generic KVCache
        AttentionUtils->>KV: update(keys, values)
        KV-->>AttentionUtils: cachedKeys, cachedValues
        AttentionUtils->>FastAttn: scaledDotProductAttention(queries, cachedKeys, cachedValues, scale, mask)
        FastAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    end

Sequence diagram for TurboQuantKVCache packed update and fallback attention

sequenceDiagram
    participant Model
    participant TurboKV as TurboQuantKVCache
    participant CodecK as TurboQuantMSECodec(keys)
    participant CodecV as TurboQuantMSECodec(values)
    participant FastAttn as MLXFast

    Model->>TurboKV: decodeAttention(queries, keys, values, scale, mask)
    activate TurboKV
    TurboKV->>TurboKV: fallbackAttention(queries, keys, values, scale, mask)
    TurboKV->>TurboKV: update(keys, values)
    note over TurboKV: ensureCodecs(keyDim, valueDim)
    TurboKV->>CodecK: quantize(keys)
    CodecK-->>TurboKV: keyStateUpdate
    TurboKV->>CodecV: quantize(values)
    CodecV-->>TurboKV: valueStateUpdate
    TurboKV->>TurboKV: appendPackedState(keyState, keyStateUpdate, previousOffset)
    TurboKV->>TurboKV: appendPackedState(valueState, valueStateUpdate, previousOffset)
    TurboKV->>TurboKV: appendShadow(keys, values, previousOffset)
    TurboKV->>TurboKV: denseState()
    TurboKV-->>TurboKV: cachedKeys, cachedValues
    TurboKV->>FastAttn: scaledDotProductAttention(queries, cachedKeys, cachedValues, scale, mask)
    FastAttn-->>TurboKV: attentionOutput
    TurboKV-->>Model: attentionOutput
    deactivate TurboKV

Updated class diagram for TurboQuantKVCache and TurboQuant codecs

classDiagram
    class KVCache {
        <<protocol>>
        +offset: Int
        +state: [MLXArray]
        +metaState: [String]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +truncateToOffset()
    }

    class TurboQuantKVCacheProtocol {
        <<protocol>>
        +configuration: TurboQuantConfiguration
        +decodeAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +prefillAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
    }

    class QuantizedKVCacheProtocol {
        <<protocol>>
        +groupSize: Int
        +bits: Int
        +mode: Int
        +updateQuantized(keys: MLXArray, values: MLXArray) MLXArray MLXArray
    }

    class BaseKVCache {
        +offset: Int
        +state: [MLXArray]
        +metaState: [String]
        +innerState() [MLXArray]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +truncateToOffset()
    }

    class TurboQuantMSEState {
        +norms: MLXArray
        +indices: MLXArray
    }

    class TurboQuantMSECodec {
        +dim: Int
        +bits: Int
        +useRHT: Bool
        +signs: MLXArray?
        +codebook: MLXArray
        +midpoints: [Float]
        +TurboQuantMSECodec(dim: Int, bits: Int, seed: Int)
        +quantize(vectors: MLXArray) TurboQuantMSEState
        +dequantize(state: TurboQuantMSEState) MLXArray
        -rotateForward(array: MLXArray) MLXArray
        -rotateInverse(array: MLXArray) MLXArray
    }

    class TurboQuantConfiguration {
        +bits: Float
        +variant: TurboQuantVariant
        +metadataPath: String?
        +metadataVersion: Int
        +transformVersion: String
        +codebookVersion: String
    }

    class TurboQuantKVCache {
        +configuration: TurboQuantConfiguration
        +step: Int
        +didGrow: Bool
        -keyState: TurboQuantMSEState?
        -valueState: TurboQuantMSEState?
        -shadowKeys: MLXArray?
        -shadowValues: MLXArray?
        -legacyDenseState: (keys: MLXArray, values: MLXArray)?
        -keyCodec: TurboQuantMSECodec?
        -valueCodec: TurboQuantMSECodec?
        -keyDimension: Int?
        -valueDimension: Int?
        +innerState() [MLXArray]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +decodeAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +prefillAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +truncateToOffset()
        +toUnquantized() KVCacheSimple
        +state: [MLXArray]
        +metaState: [String]
        +debugDescription: String
        -ensureCodecs(keyDim: Int, valueDim: Int)
        -appendShadow(keys: MLXArray, values: MLXArray, previous: Int)
        -appendPackedState(current: TurboQuantMSEState?, update: TurboQuantMSEState, previous: Int)
        -rebuildFromDenseState(keys: MLXArray, values: MLXArray)
        -rehydrateShadowFromPackedState()
        -denseState() MLXArray MLXArray
        -fallbackAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
    }

    class KVCacheSimple {
        +state: [MLXArray]
    }

    KVCache <|-- BaseKVCache
    BaseKVCache <|-- TurboQuantKVCache
    TurboQuantKVCacheProtocol <|.. TurboQuantKVCache
    QuantizedKVCacheProtocol <|.. KVCache
    TurboQuantConfiguration <.. TurboQuantKVCache
    TurboQuantMSECodec <.. TurboQuantKVCache
    TurboQuantMSEState <.. TurboQuantMSECodec
    TurboQuantMSEState <.. TurboQuantKVCache
    KVCacheSimple <.. TurboQuantKVCache

File-Level Changes

Change	Details	Files
Add shared attention utility that routes to TurboQuant, quantized, or standard attention based on cache type and token count.	Introduce attentionWithCacheUpdate helper that wraps MLXFast.scaledDotProductAttention and cache updates. Detect TurboQuantKVCacheProtocol and call its decodeAttention or prefillAttention depending on sequence length. Route QuantizedKVCacheProtocol through quantizedScaledDotProductAttention using cache-provided quantization parameters. Fall back to generic KVCache.update plus dense MLXFast.scaledDotProductAttention when cache is neither TurboQuant nor quantized.	`Scripts/patches/AttentionUtils.swift` `Scripts/apply-mlx-patches.sh`
Redesign TurboQuantKVCache to store packed low-bit norms/indices as primary state with dense shadow keys/values for runtime fallback.	Add TurboQuantMSECodec, TurboQuantMSEState, packing/unpacking helpers, and table caches to build and reuse codebooks, sign vectors, and midpoints. Split fractional kvBits asymmetrically between keys and values using floor/ceil and validate allowed bit-widths. Change TurboQuantKVCache to maintain keyState/valueState (packed), shadowKeys/shadowValues (dense), legacyDenseState for backward compatibility, and per-dimension codecs. Update update(), state, metaState, truncateToOffset, and toUnquantized to operate on packed state while reconstructing dense views on demand and supporting both legacy 2-array and new 4-array serialized formats. Add TurboQuantKVCacheProtocol decodeAttention and prefillAttention methods that currently delegate to a dense fallbackAttention implemented via MLXFast.scaledDotProductAttention.	`Scripts/patches/KVCache.swift`
Extend TurboQuant tests to cover packed state, dense round-tripping, and attention dispatch routing for TurboQuant caches.	Add FakeTurboQuantCache implementing TurboQuantKVCacheProtocol that records decode/prefill calls and returns sentinel outputs. Update TurboQuant cache serialization tests to expect 4-array packed state (norms and indices for keys/values) and verify toUnquantized produces dense tensors close to the originals. Add tests that attentionWithCacheUpdate routes single-token decode through decodeAttention and multi-token prefill through prefillAttention on TurboQuant caches.	`Tests/MacLocalAPITests/TurboQuantCacheTests.swift`
Document TurboQuant packed codec state and attention dispatch design and how they fit in the TurboQuant rollout.	Add design doc describing TurboQuant packed MSE-style state, asymmetric key/value bit allocation, and dense shadow buffers as runtime helpers. Add design doc describing explicit TurboQuant attention dispatch via cache-specific decode/prefill entry points and their role as a seam for future optimizations.	`docs/feature-codex-turboquant-codecs.md` `docs/feature-codex-turboquant-attention.md`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The low-bit pack/unpack helpers (turboQuantPackLowBit / turboQuantUnpackLowBit) currently loop over positions and perform per-index MLXArray slicing and assignment, which will be quite slow for long sequences; consider restructuring these into vectorized operations or a dedicated kernel-style helper so the bit packing work scales better.
In TurboQuantKVCache.truncateToOffset(), only keyState/valueState and shadow tensors are trimmed; if legacyDenseState is present it remains unmodified and will be re-quantized to the full length on the next access, which can be surprising—either trim legacyDenseState as well or make it explicit that truncate is a no-op when running in dense-compatibility mode.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The low-bit pack/unpack helpers (`turboQuantPackLowBit` / `turboQuantUnpackLowBit`) currently loop over positions and perform per-index MLXArray slicing and assignment, which will be quite slow for long sequences; consider restructuring these into vectorized operations or a dedicated kernel-style helper so the bit packing work scales better.
- In `TurboQuantKVCache.truncateToOffset()`, only `keyState`/`valueState` and shadow tensors are trimmed; if `legacyDenseState` is present it remains unmodified and will be re-quantized to the full length on the next access, which can be surprising—either trim `legacyDenseState` as well or make it explicit that truncate is a no-op when running in dense-compatibility mode.

## Individual Comments

### Comment 1
<location path="Scripts/apply-mlx-patches.sh" line_range="22-24" />
<code_context>
+PATCH_FILES=("Qwen3VL.swift" "Qwen3Next.swift" "GatedDelta.swift" "Qwen3_5MoE.swift" "DeepseekV3.swift" "MiniMaxM2.swift" "NemotronH.swift" "GLM4MoeLite.swift" "GLM5MoeDsa.swift" "KimiK25.swift" "Gemma4Text.swift" "Gemma4VLM.swift" "LLMModelFactory.swift" "Load.swift" "Evaluate.swift" "LanguageModel.swift" "Tokenizer.swift" "AttentionUtils.swift" "Qwen3_5MoEVL.swift" "VLMModelFactory.swift" "SamplerTests.swift" "ToolCallFormat.swift" "KVCache.swift" "SwitchLayers.swift" "BatchKVCache.swift" "SSM.swift" "Chat.swift" "Gemma4FunctionParser.swift")
</code_context>
<issue_to_address>
**issue (bug_risk):** AttentionUtils.swift is added to PATCH_FILES/TARGET_PATHS but not to NEW_FILES, which may break patching on a clean tree.

On a clean checkout where this file doesn’t exist yet, the script won’t pre-create it, so `patch` may fail. Please add `AttentionUtils.swift` to NEW_FILES to keep the arrays consistent and ensure patching works from a clean tree.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…urboquant-core

scouzi1966 added 3 commits April 4, 2026 20:06

Add TurboQuant attention dispatch

da08d15

Add TurboQuant packed codec state

19867ca

Add TurboQuant core design notes

e1da0a7

scouzi1966 marked this pull request as ready for review April 5, 2026 00:49

sourcery-ai Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread Scripts/apply-mlx-patches.sh Outdated

scouzi1966 added 2 commits April 4, 2026 21:08

Merge branch 'feature/codex-turboquant-plumbing' into feature/codex-t…

29938d9

…urboquant-core

Fix TurboQuant patching and truncate behavior

251796c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add core serial TurboQuant attention and packed state#90

Add core serial TurboQuant attention and packed state#90
scouzi1966 wants to merge 5 commits intofeature/codex-turboquant-plumbingfrom
feature/codex-turboquant-core

scouzi1966 commented Apr 5, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Apr 5, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scouzi1966 commented Apr 5, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR includes

What this PR does not include

Validation

Stack

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for TurboQuant-aware attention dispatch

Sequence diagram for TurboQuantKVCache packed update and fallback attention

Updated class diagram for TurboQuantKVCache and TurboQuant codecs

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

scouzi1966 commented Apr 5, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Apr 5, 2026 •

edited

Loading