Skip to content

Add core serial TurboQuant attention and packed state#90

Open
scouzi1966 wants to merge 5 commits intofeature/codex-turboquant-plumbingfrom
feature/codex-turboquant-core
Open

Add core serial TurboQuant attention and packed state#90
scouzi1966 wants to merge 5 commits intofeature/codex-turboquant-plumbingfrom
feature/codex-turboquant-core

Conversation

@scouzi1966
Copy link
Copy Markdown
Owner

@scouzi1966 scouzi1966 commented Apr 5, 2026

Summary

Add the core serial TurboQuant behavior on top of the plumbing branch.

What this PR includes

  • explicit TurboQuant attention dispatch
  • packed TurboQuant codec state for prompt-cache serialization
  • recursive cache conversion for supported cache topologies
  • design notes for the attention/codecs slices

What this PR does not include

  • Metal fast-path execution
  • batch/concurrent TurboQuant cache execution
  • rotating/sliding-window TurboQuant support

Validation

  • MACAFM_MLX_METALLIB="$PWD/default.metallib" swift test --filter TurboQuantCacheTests --parallel --num-workers 1
  • MACAFM_MLX_METALLIB="$PWD/default.metallib" swift test --filter 'KVCacheTruncateTests|BatchedPrefillTests' --parallel --num-workers 1

Stack

This is PR 2 of a stacked TurboQuant series and targets PR 1.

Summary by Sourcery

Introduce packed TurboQuant KV cache state and explicit attention dispatch hooks, routing model attention through TurboQuant-specific decode/prefill paths while preserving compatibility with existing quantized and dense caches.

New Features:

  • Add TurboQuant-specific decode and prefill attention entrypoints and route attention through TurboQuant, quantized, or dense paths via a shared helper.

Enhancements:

  • Introduce packed TurboQuant KV cache representation using MSE-style norms and low-bit indices as the serialized source of truth while maintaining dense shadow buffers for fallback attention.
  • Support asymmetric fractional TurboQuant bit-widths for keys and values and extend cache state/metaState handling for backward-compatible prompt-cache serialization.
  • Expand TurboQuant tests to cover packed-state save/load, dense round-trips, and attention dispatch behavior for decode vs prefill flows.

Build:

  • Update MLX patch application script to include the new attention utilities patch and corresponding target path.

Documentation:

  • Add design notes documenting TurboQuant packed codec state and attention dispatch architecture and their intended scope and limitations.

Tests:

  • Add a fake TurboQuant cache and new tests to verify attention utilities correctly route single-token decode and multi-token prefill through TurboQuant-specific APIs and validate packed-state round-trips.

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Apr 5, 2026

Reviewer's Guide

Implements explicit TurboQuant-aware attention routing and converts TurboQuantKVCache to use packed low-bit MSE-style codec state as the serialized source of truth, while keeping dense shadow buffers for fallback attention and maintaining backward-compatible cache metadata and serialization.

Sequence diagram for TurboQuant-aware attention dispatch

sequenceDiagram
    participant Caller
    participant AttentionUtils as attentionWithCacheUpdate
    participant KV as KVCache
    participant TurboKV as TurboQuantKVCacheProtocol
    participant QuantKV as QuantizedKVCacheProtocol
    participant FastAttn as MLXFast
    participant QuantAttn as quantizedScaledDotProductAttention

    Caller->>AttentionUtils: attentionWithCacheUpdate(queries, keys, values, cache, scale, mask)
    alt cache is nil
        AttentionUtils->>FastAttn: scaledDotProductAttention(queries, keys, values, scale, mask)
        FastAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    else cache is TurboQuantKVCacheProtocol
        AttentionUtils->>TurboKV: queries.dim(2) == 1?
        alt decode path
            AttentionUtils->>TurboKV: decodeAttention(queries, keys, values, scale, mask)
            TurboKV-->>AttentionUtils: output
        else prefill path
            AttentionUtils->>TurboKV: prefillAttention(queries, keys, values, scale, mask)
            TurboKV-->>AttentionUtils: output
        end
        AttentionUtils-->>Caller: output
    else cache is QuantizedKVCacheProtocol
        AttentionUtils->>QuantKV: updateQuantized(keys, values)
        QuantKV-->>AttentionUtils: quantizedKeys, quantizedValues
        AttentionUtils->>QuantAttn: quantizedScaledDotProductAttention(queries, quantizedKeys, quantizedValues, scale, mask, groupSize, bits, mode)
        QuantAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    else generic KVCache
        AttentionUtils->>KV: update(keys, values)
        KV-->>AttentionUtils: cachedKeys, cachedValues
        AttentionUtils->>FastAttn: scaledDotProductAttention(queries, cachedKeys, cachedValues, scale, mask)
        FastAttn-->>AttentionUtils: output
        AttentionUtils-->>Caller: output
    end
Loading

Sequence diagram for TurboQuantKVCache packed update and fallback attention

sequenceDiagram
    participant Model
    participant TurboKV as TurboQuantKVCache
    participant CodecK as TurboQuantMSECodec(keys)
    participant CodecV as TurboQuantMSECodec(values)
    participant FastAttn as MLXFast

    Model->>TurboKV: decodeAttention(queries, keys, values, scale, mask)
    activate TurboKV
    TurboKV->>TurboKV: fallbackAttention(queries, keys, values, scale, mask)
    TurboKV->>TurboKV: update(keys, values)
    note over TurboKV: ensureCodecs(keyDim, valueDim)
    TurboKV->>CodecK: quantize(keys)
    CodecK-->>TurboKV: keyStateUpdate
    TurboKV->>CodecV: quantize(values)
    CodecV-->>TurboKV: valueStateUpdate
    TurboKV->>TurboKV: appendPackedState(keyState, keyStateUpdate, previousOffset)
    TurboKV->>TurboKV: appendPackedState(valueState, valueStateUpdate, previousOffset)
    TurboKV->>TurboKV: appendShadow(keys, values, previousOffset)
    TurboKV->>TurboKV: denseState()
    TurboKV-->>TurboKV: cachedKeys, cachedValues
    TurboKV->>FastAttn: scaledDotProductAttention(queries, cachedKeys, cachedValues, scale, mask)
    FastAttn-->>TurboKV: attentionOutput
    TurboKV-->>Model: attentionOutput
    deactivate TurboKV
Loading

Updated class diagram for TurboQuantKVCache and TurboQuant codecs

classDiagram
    class KVCache {
        <<protocol>>
        +offset: Int
        +state: [MLXArray]
        +metaState: [String]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +truncateToOffset()
    }

    class TurboQuantKVCacheProtocol {
        <<protocol>>
        +configuration: TurboQuantConfiguration
        +decodeAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +prefillAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
    }

    class QuantizedKVCacheProtocol {
        <<protocol>>
        +groupSize: Int
        +bits: Int
        +mode: Int
        +updateQuantized(keys: MLXArray, values: MLXArray) MLXArray MLXArray
    }

    class BaseKVCache {
        +offset: Int
        +state: [MLXArray]
        +metaState: [String]
        +innerState() [MLXArray]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +truncateToOffset()
    }

    class TurboQuantMSEState {
        +norms: MLXArray
        +indices: MLXArray
    }

    class TurboQuantMSECodec {
        +dim: Int
        +bits: Int
        +useRHT: Bool
        +signs: MLXArray?
        +codebook: MLXArray
        +midpoints: [Float]
        +TurboQuantMSECodec(dim: Int, bits: Int, seed: Int)
        +quantize(vectors: MLXArray) TurboQuantMSEState
        +dequantize(state: TurboQuantMSEState) MLXArray
        -rotateForward(array: MLXArray) MLXArray
        -rotateInverse(array: MLXArray) MLXArray
    }

    class TurboQuantConfiguration {
        +bits: Float
        +variant: TurboQuantVariant
        +metadataPath: String?
        +metadataVersion: Int
        +transformVersion: String
        +codebookVersion: String
    }

    class TurboQuantKVCache {
        +configuration: TurboQuantConfiguration
        +step: Int
        +didGrow: Bool
        -keyState: TurboQuantMSEState?
        -valueState: TurboQuantMSEState?
        -shadowKeys: MLXArray?
        -shadowValues: MLXArray?
        -legacyDenseState: (keys: MLXArray, values: MLXArray)?
        -keyCodec: TurboQuantMSECodec?
        -valueCodec: TurboQuantMSECodec?
        -keyDimension: Int?
        -valueDimension: Int?
        +innerState() [MLXArray]
        +update(keys: MLXArray, values: MLXArray) MLXArray MLXArray
        +decodeAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +prefillAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
        +truncateToOffset()
        +toUnquantized() KVCacheSimple
        +state: [MLXArray]
        +metaState: [String]
        +debugDescription: String
        -ensureCodecs(keyDim: Int, valueDim: Int)
        -appendShadow(keys: MLXArray, values: MLXArray, previous: Int)
        -appendPackedState(current: TurboQuantMSEState?, update: TurboQuantMSEState, previous: Int)
        -rebuildFromDenseState(keys: MLXArray, values: MLXArray)
        -rehydrateShadowFromPackedState()
        -denseState() MLXArray MLXArray
        -fallbackAttention(queries: MLXArray, keys: MLXArray, values: MLXArray, scale: Float, mask: MLXFast.ScaledDotProductAttentionMaskMode) MLXArray
    }

    class KVCacheSimple {
        +state: [MLXArray]
    }

    KVCache <|-- BaseKVCache
    BaseKVCache <|-- TurboQuantKVCache
    TurboQuantKVCacheProtocol <|.. TurboQuantKVCache
    QuantizedKVCacheProtocol <|.. KVCache
    TurboQuantConfiguration <.. TurboQuantKVCache
    TurboQuantMSECodec <.. TurboQuantKVCache
    TurboQuantMSEState <.. TurboQuantMSECodec
    TurboQuantMSEState <.. TurboQuantKVCache
    KVCacheSimple <.. TurboQuantKVCache
Loading

File-Level Changes

Change Details Files
Add shared attention utility that routes to TurboQuant, quantized, or standard attention based on cache type and token count.
  • Introduce attentionWithCacheUpdate helper that wraps MLXFast.scaledDotProductAttention and cache updates.
  • Detect TurboQuantKVCacheProtocol and call its decodeAttention or prefillAttention depending on sequence length.
  • Route QuantizedKVCacheProtocol through quantizedScaledDotProductAttention using cache-provided quantization parameters.
  • Fall back to generic KVCache.update plus dense MLXFast.scaledDotProductAttention when cache is neither TurboQuant nor quantized.
Scripts/patches/AttentionUtils.swift
Scripts/apply-mlx-patches.sh
Redesign TurboQuantKVCache to store packed low-bit norms/indices as primary state with dense shadow keys/values for runtime fallback.
  • Add TurboQuantMSECodec, TurboQuantMSEState, packing/unpacking helpers, and table caches to build and reuse codebooks, sign vectors, and midpoints.
  • Split fractional kvBits asymmetrically between keys and values using floor/ceil and validate allowed bit-widths.
  • Change TurboQuantKVCache to maintain keyState/valueState (packed), shadowKeys/shadowValues (dense), legacyDenseState for backward compatibility, and per-dimension codecs.
  • Update update(), state, metaState, truncateToOffset, and toUnquantized to operate on packed state while reconstructing dense views on demand and supporting both legacy 2-array and new 4-array serialized formats.
  • Add TurboQuantKVCacheProtocol decodeAttention and prefillAttention methods that currently delegate to a dense fallbackAttention implemented via MLXFast.scaledDotProductAttention.
Scripts/patches/KVCache.swift
Extend TurboQuant tests to cover packed state, dense round-tripping, and attention dispatch routing for TurboQuant caches.
  • Add FakeTurboQuantCache implementing TurboQuantKVCacheProtocol that records decode/prefill calls and returns sentinel outputs.
  • Update TurboQuant cache serialization tests to expect 4-array packed state (norms and indices for keys/values) and verify toUnquantized produces dense tensors close to the originals.
  • Add tests that attentionWithCacheUpdate routes single-token decode through decodeAttention and multi-token prefill through prefillAttention on TurboQuant caches.
Tests/MacLocalAPITests/TurboQuantCacheTests.swift
Document TurboQuant packed codec state and attention dispatch design and how they fit in the TurboQuant rollout.
  • Add design doc describing TurboQuant packed MSE-style state, asymmetric key/value bit allocation, and dense shadow buffers as runtime helpers.
  • Add design doc describing explicit TurboQuant attention dispatch via cache-specific decode/prefill entry points and their role as a seam for future optimizations.
docs/feature-codex-turboquant-codecs.md
docs/feature-codex-turboquant-attention.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@scouzi1966 scouzi1966 marked this pull request as ready for review April 5, 2026 00:49
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The low-bit pack/unpack helpers (turboQuantPackLowBit / turboQuantUnpackLowBit) currently loop over positions and perform per-index MLXArray slicing and assignment, which will be quite slow for long sequences; consider restructuring these into vectorized operations or a dedicated kernel-style helper so the bit packing work scales better.
  • In TurboQuantKVCache.truncateToOffset(), only keyState/valueState and shadow tensors are trimmed; if legacyDenseState is present it remains unmodified and will be re-quantized to the full length on the next access, which can be surprising—either trim legacyDenseState as well or make it explicit that truncate is a no-op when running in dense-compatibility mode.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The low-bit pack/unpack helpers (`turboQuantPackLowBit` / `turboQuantUnpackLowBit`) currently loop over positions and perform per-index MLXArray slicing and assignment, which will be quite slow for long sequences; consider restructuring these into vectorized operations or a dedicated kernel-style helper so the bit packing work scales better.
- In `TurboQuantKVCache.truncateToOffset()`, only `keyState`/`valueState` and shadow tensors are trimmed; if `legacyDenseState` is present it remains unmodified and will be re-quantized to the full length on the next access, which can be surprising—either trim `legacyDenseState` as well or make it explicit that truncate is a no-op when running in dense-compatibility mode.

## Individual Comments

### Comment 1
<location path="Scripts/apply-mlx-patches.sh" line_range="22-24" />
<code_context>
+PATCH_FILES=("Qwen3VL.swift" "Qwen3Next.swift" "GatedDelta.swift" "Qwen3_5MoE.swift" "DeepseekV3.swift" "MiniMaxM2.swift" "NemotronH.swift" "GLM4MoeLite.swift" "GLM5MoeDsa.swift" "KimiK25.swift" "Gemma4Text.swift" "Gemma4VLM.swift" "LLMModelFactory.swift" "Load.swift" "Evaluate.swift" "LanguageModel.swift" "Tokenizer.swift" "AttentionUtils.swift" "Qwen3_5MoEVL.swift" "VLMModelFactory.swift" "SamplerTests.swift" "ToolCallFormat.swift" "KVCache.swift" "SwitchLayers.swift" "BatchKVCache.swift" "SSM.swift" "Chat.swift" "Gemma4FunctionParser.swift")
</code_context>
<issue_to_address>
**issue (bug_risk):** AttentionUtils.swift is added to PATCH_FILES/TARGET_PATHS but not to NEW_FILES, which may break patching on a clean tree.

On a clean checkout where this file doesn’t exist yet, the script won’t pre-create it, so `patch` may fail. Please add `AttentionUtils.swift` to NEW_FILES to keep the arrays consistent and ensure patching works from a clean tree.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread Scripts/apply-mlx-patches.sh Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant