Skip to content

Fix CAS false-success via per-segment generation in a 64-bit token#24

Merged
brayniac merged 3 commits into
pelikan-io:mainfrom
brayniac:brayniac/cas-generation-token
Jun 11, 2026
Merged

Fix CAS false-success via per-segment generation in a 64-bit token#24
brayniac merged 3 commits into
pelikan-io:mainfrom
brayniac:brayniac/cas-generation-token

Conversation

@brayniac

Copy link
Copy Markdown
Contributor

Summary

The current CAS token is the item's packed location truncated to u32. This has a false-success (ABA) case:

  • Truncating the 44-bit location to u32 aliases locations whose segment ids differ by a multiple of 4096.
  • Even at full width, a recycled segment can place the same key at the same offset, so a stale token matches a value the client never observed and the CAS falsely succeeds.

This follows the discussion on #23: rather than reintroducing a side counter (which becomes a two-word protocol once operations are concurrent), keep the location-derived token and fix its defects the way crucible's CasToken does.

Changes

  • Add a 16-bit generation counter to SegmentHeader (fits in existing padding; header stays 64 bytes), bumped in reset() — called only from push_free, which all recycle paths (eviction, merge, S3-FIFO, expiration, clear) funnel through.
  • Port crucible's CasToken: a 60-bit token in u64 — generation in bits 59..44, location in bits 43..0. The token remains the content of the single hashtable slot word plus quasi-static segment metadata, preserving the single-word atomic swap path for future concurrent CAS.
  • Widen Item::cas() and Segcache::cas() from u32 to u64, matching the memcache protocol's 64-bit "cas unique".

Testing

  • New cas_stale_token_rejected_after_segment_recycle test deterministically reproduces the ABA scenario (clear() recycles the segment LIFO; reinserting the same key lands at the identical 44-bit location) and asserts the stale token now fails with Exists. Precondition asserts verify the same-location setup, so the test fails loudly rather than silently not exercising ABA if free-queue ordering changes.
  • Verified the test bites: with the generation bump removed, stale and fresh tokens are bit-identical.
  • Ported crucible's CasToken layout tests; added a header generation unit test.
  • cargo test --workspace, --features debug, loom suite (8/8 models), clippy --all-targets --all-features -- -D warnings, and fmt --check are all green; fuzz target updated and compiles.

🤖 Generated with Claude Code

brayniac and others added 3 commits June 11, 2026 08:03
Add a 16-bit generation counter to SegmentHeader, bumped each time a
segment is returned to the free queue (SegmentHeader::reset, called
only from Segments::push_free, which all recycle paths funnel
through). This will let CAS tokens distinguish different lifetimes of
the same location, fixing a false-success ABA case where a recycled
segment places the same key at the same offset.

The field fits in existing padding (26 -> 24 bytes); the header stays
64 bytes. Mirrors the generation field in crucible's segment headers.

Co-Authored-By: Claude Fable 5 <[email protected]>
Replace the u32 location-truncated CAS token with a 64-bit token
combining the 44-bit packed location with the segment's 16-bit
generation counter (bits 59..44), ported from crucible's CasToken.

The old token had a false-success case: (a) truncating the 44-bit
location to u32 aliases locations whose seg_ids differ by a multiple
of 4096, and (b) even at full width, a recycled segment can place the
same key at the same offset, letting a stale token match a value the
client never observed. The generation counter, bumped on every
segment recycle, distinguishes lifetimes of the same location.

The 64-bit width also matches the memcache protocol's "cas unique".

Co-Authored-By: Claude Fable 5 <[email protected]>
Deterministically reproduce the segment-recycle ABA: insert a key,
capture its token, clear() the cache (push_free prepends the freed
segment to the free queue and bumps its generation), then reinsert
the same key — it lands at the identical 44-bit location. The stale
token must now fail with Exists; with location-only tokens it falsely
succeeded. Precondition asserts verify the test really reproduces the
same-location scenario, so it fails loudly if free-queue ordering
ever changes.

Verified the test bites: with the generation bump removed, stale and
fresh tokens are identical (the false-success case).

Also widen the fuzz target's cas argument to u64.

Co-Authored-By: Claude Fable 5 <[email protected]>
@thinkingfish

Copy link
Copy Markdown
Member

Neat design. LGTM

@brayniac brayniac merged commit 868705f into pelikan-io:main Jun 11, 2026
4 checks passed
thinkingfish pushed a commit to pelikan-io/pelikan that referenced this pull request Jun 12, 2026
Upstream cache-rs fixed the CAS false-success regression by encoding a
per-segment generation into a 64-bit token (pelikan-io/cache-rs#24).
Adopt the new commit:

- Drop the u32 narrowing casts in entrystore now that the storage cas()
  method takes the full u64 token from the memcache protocol.
- Rework the 'cas stored' integration test to fetch the real cas token
  via gets instead of assuming tokens are a counter starting at 1; the
  token is opaque to clients under the new location+generation scheme.

https://claude.ai/code/session_01MbdtNhjhyFVtsrPQF6wJjJ
thinkingfish added a commit to pelikan-io/pelikan that referenced this pull request Jun 14, 2026
* Update dependencies and address security advisories

- Run cargo update to bring all transitive dependencies to their latest
  semver-compatible versions. This resolves three rustls-webpki
  vulnerabilities (RUSTSEC-2026-0098, RUSTSEC-2026-0099,
  RUSTSEC-2026-0104) by moving to 0.103.13, and clears the rand
  unsoundness warning (RUSTSEC-2026-0097) via rand 0.8.6.
- Replace the unmaintained rustls-pemfile crate (RUSTSEC-2025-0134)
  with the PEM parsing built into rustls-pki-types.
- Keep the segcache git dependency pinned at 5d1d870b: the newer
  upstream commit regresses memcache CAS semantics (returns EXISTS
  where STORED is expected), failing the segcache integration tests.

cargo audit now reports no vulnerabilities or warnings.

https://claude.ai/code/session_01MbdtNhjhyFVtsrPQF6wJjJ

* Update segcache to upstream with CAS generation-token fix

Upstream cache-rs fixed the CAS false-success regression by encoding a
per-segment generation into a 64-bit token (pelikan-io/cache-rs#24).
Adopt the new commit:

- Drop the u32 narrowing casts in entrystore now that the storage cas()
  method takes the full u64 token from the memcache protocol.
- Rework the 'cas stored' integration test to fetch the real cas token
  via gets instead of assuming tokens are a counter starting at 1; the
  token is opaque to clients under the new location+generation scheme.

https://claude.ai/code/session_01MbdtNhjhyFVtsrPQF6wJjJ

* Bump remaining transitive dependencies to latest patch releases

Pick up patch/minor updates that became available since the last bump:
cc, js-sys, memchr, smallvec, wasip2, the wasm-bindgen family, web-sys,
and zeroize. cache-rs is already at the latest main commit (868705f).

cargo audit remains clean; full workspace test suite passes.

* Bump segcache to cache-rs PR #25 (reader-pinned segments)

Pin the segcache git dependency to the head of cache-rs PR #25
(745246571a7a67738f6db73021b5609d5d6cc2d7), which adds reader-pin RAII
guards and drain-but-defer-free handling so pinned segments are not
recycled while readers hold references.

The change is internal to segcache; the only caller-visible effect is
that Item::cas() now returns u64 directly, so drop the now-useless
.into() widening conversions in the memcache entrystore.

This is pinned to an unmerged PR head; re-point to the cache-rs default
branch once PR #25 lands on main.

Full workspace tests pass; cargo audit and clippy are clean.

* Move segcache from git dependency to crates.io 0.3.0

cache-rs published segcache 0.3.0 (and keyvalue 0.2.0) to crates.io,
incorporating the reader-pinned segment work. Replace the git+rev pin
with the released crate so the build no longer depends on an unmerged
PR head or a git checkout.

Both segcache and its transitive keyvalue dependency now resolve from
the registry. No source changes required; full workspace tests pass,
cargo audit and clippy are clean.

---------

Co-authored-by: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants