Skip to content

fix: guard reset path against concurrent overwrites in ETS and Atomic backends#184

Merged
epinault merged 1 commit into
ExHammer:feature/fix-window-per-keyfrom
vittorio-reinaudo:feature/fix-window-per-key
May 14, 2026
Merged

fix: guard reset path against concurrent overwrites in ETS and Atomic backends#184
epinault merged 1 commit into
ExHammer:feature/fix-window-per-keyfrom
vittorio-reinaudo:feature/fix-window-per-key

Conversation

@vittorio-reinaudo

Copy link
Copy Markdown
Contributor

Fix concurrent window-reset race in fix_window_per_key (ETS and Atomic)

Background

PR #183 introduced :fix_window_per_key, a per-key fixed-window algorithm that anchors each key's window to its first hit rather than a shared wall-clock epoch. The feature itself is sound; this PR addresses a correctness bug in the reset path that appears when multiple processes hit the same key at the moment its window expires.


The race condition

ETS backend — hit/5 and inc/4

When a key's window has expired (or the key is absent), hit/5 falls through to:

_ ->
  new_expires_at = now + scale
  :ets.insert(table, {key, increment, new_expires_at})   # ← not conditional
  {:allow, increment}

:ets.insert/2 is an unconditional overwrite. If N processes all call :ets.lookup and observe expires_at ≤ now before any of them completes the insert, each one will write {key, 1, new_expires_at} and the last write wins. All N callers receive {:allow, 1}, but the stored counter is 1 — N−1 hits are silently dropped.

The same pattern exists in inc/4.

Atomic backend — do_hit/5 and inc/4

The reset path in do_hit uses two separate atomic operations:

:atomics.put(atomic, 1, increment)        # reset counter  ← not conditional
:atomics.exchange(atomic, 2, now + scale) # publish window

These are not atomic together. N processes that all read expires_at ≤ now before any exchange completes will each call put(1, increment), overwriting one another. The last put wins and the counter is increment while N {:allow, increment} responses have already been returned.

Additionally, inc/4 in the Atomic backend used :ets.insert (unconditional) instead of :ets.insert_new for the initial atomics-ref creation, matching a separate but related race already correctly avoided in hit/5.

Why :fix_window does not have this problem

Hammer.ETS.FixWindow and Hammer.Atomic.FixWindow use {key, window_number} as the ETS key. A new window is a genuinely new key; the active-window update_counter / add_get path is reached on the very first hit. There is no "reset in place" of an existing key — the problem is specific to :fix_window_per_key.


Fix

ETS — insert_new + select_replace + retry

Replace the bare :ets.insert with a three-step CAS-style sequence:

1. :ets.insert_new   → wins if the key does not exist yet (atomic)
2. :ets.select_replace with guard expires_at < now
                     → wins if the key exists but is expired (atomic)
3. recurse           → someone else claimed the window; next call lands in
                       the active-window update_counter path

Both :ets.insert_new and :ets.select_replace are ETS operations that guarantee at most one writer succeeds. Exactly one process resets the counter; all others fall back to the update_counter path once the window is live.

Atomic — compare_exchange + reset-lock sentinel

Replace the non-atomic put + exchange pair with a CAS on the expires_at slot using a sentinel value (@reset_lock = 0xFFFFFFFFFFFFFFFF, the unsigned 64-bit max — unreachable by any real timestamp):

1. compare_exchange(expires_at, old_val, @reset_lock)
   → exactly one process wins ownership of the reset
2. Winner: put(counter, increment), put(expires_at, now + scale)
3. Losers that read @reset_lock: spin-retry until step 2 completes,
   then use add_get on the now-active window
4. Losers whose CAS fails for any other reason: recurse immediately

While @reset_lock is set, no other process can enter the add_get path (the cond checks == @reset_lock before > now). This guarantees the winner's put(counter) is never overwritten by a stale peer's put before the real expires_at is published.


Files changed

File Change
lib/hammer/ets/fix_window_per_key.ex hit/5, inc/4: replace :ets.insert with insert_new + select_replace + recurse
lib/hammer/atomic/fix_window_per_key.ex do_hit/5: CAS with @reset_lock sentinel; inc/4: same CAS pattern + fix insertinsert_new for atomics-ref creation
test/hammer/ets/fix_window_per_key_test.exs Add "concurrent expiry reset" describe: 200-task stress test asserting get == allows after a synchronized expiry-boundary hit
test/hammer/atomic/fix_window_per_key_test.exs Same stress test for the Atomic backend

Tests

The new "concurrent expiry reset" tests in both files follow the same structure:

  1. Pre-plant an expired entry so every task enters the reset path.
  2. Spawn 200 tasks, all blocked on a receive :go barrier.
  3. Release all at once to maximise overlap at the reset boundary.
  4. Assert get(table, key, scale) == total_allows — any under-count reveals a lost reset.

The stress tests pass consistently with the fix in place. The invariant they protect (counter == allows) cannot be satisfied by the unfixed code under concurrent expiry pressure on multi-core schedulers.

All 131 existing tests continue to pass.


Notes for reviewers

  • The @reset_lock sentinel is safe: now is milliseconds since epoch (~1.7 × 10¹²); now + scale for any realistic scale never approaches 2⁶⁴ − 1 ≈ 1.8 × 10¹⁹.
  • The retry loops (do_hit calling itself, inc calling itself) terminate: after the winner publishes now + scale, all spinners read a valid expires_at > now and proceed via add_get. There is no livelock — each reset event has exactly one winner.
  • Hammer.ETS.FixWindow and Hammer.Atomic.FixWindow are not touched; their {key, window} keying strategy does not have this class of bug.

@epinault epinault merged commit 1fc8302 into ExHammer:feature/fix-window-per-key May 14, 2026
epinault added a commit that referenced this pull request May 18, 2026
* Add :fix_window_per_key algorithm for ETS and Atomic backends

A fixed-window variant whose window is anchored to each key's first hit
instead of a globally-aligned wall-clock epoch. Keeps the same
one-entry-per-key memory profile as :fix_window. The 2x boundary burst
is still possible per key, but boundaries are no longer globally
synchronized, so they cannot be exploited deterministically. Same
semantics as the common Redis INCR + EXPIRE NX pattern.

Closes #181.

* fix(fix_window_per_key): guard reset path against concurrent overwrites in ETS and Atomic backends (#184)

* refactor: extract helpers to fix credo nesting depth warnings

- Add allow_or_deny/3 to ETS and Atomic FixWindowPerKey backends
- Extract do_inc/4 in Atomic backend, mirroring do_hit/5

---------

Co-authored-by: vittorio-reinaudo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants