Skip to content

STOMP M6: load tests + lived-experience writeup#21

Merged
samuelSavanovic merged 1 commit into
mainfrom
stomp-m6-load-tests
May 10, 2026
Merged

STOMP M6: load tests + lived-experience writeup#21
samuelSavanovic merged 1 commit into
mainfrom
stomp-m6-load-tests

Conversation

@samuelSavanovic
Copy link
Copy Markdown
Owner

Summary

  • Adds benchmarks/stomp/ — Python STOMP load harness (fanout, slow, queue, soak), driver script with per-phase fresh broker and RSS sampling, a fanout cell sweep, and Gem-level microbenches isolating deep-copy fanout cost and gen_server.loop survival.
  • Appends "Milestone 6: lived experience" to examples/stomp_broker/NOTES.md with quantitative findings, a sketch of the backpressure design space, and notes on what the language pushed back on.
  • Bumps the deep non-tail recursion ceiling in docs/ROADMAP.md from P2 → P1 — this is the first user hit.

Headline finding

The STOMP broker dies at exactly 175 frames per connection writer. Cause: connection.gem's writer_loop ↔ handle_frame ↔ handle_<command> ↔ writer_loop is mutual tail recursion across three functions. The compiler only TCO's direct self-recursive tail calls (per the existing mark_process_tail GFP pass), so each cycle eats ~1.4 KB of C stack and minicoro's 256 KB ceiling falls after ~175 cycles. A bare gen_server (microbench) handles 500+ casts cleanly, isolating the bug to the broker's writer loop pattern.

This dominates milestone 6 — the deep-copy fanout cost and slow-consumer OOM that the tutorial flags are both masked by hitting the recursion cliff first.

Working envelope (below the cliff)

n_subs n_msgs body delivered publish msg/s fanout msg/s p100 first-msg
500 50 256 25,000 89k 54k 45 ms
500 50 1024 25,000 106k 54k 37 ms
100 100 256 10,000 141k 46k 4 ms
200 100 256 20,000 64k 59k 27 ms

Microbench: deep-copy fanout cost

Per-send cost is ~2 µs flat across body sizes 64–1024 B at K up to 800 subscribers. The cost is the table structure, not the body bytes. Body size starts to show only at 4 KB (~3 µs), so small-message workloads pay full per-frame overhead.

Test plan

  • make test — all examples + LSP smoke tests pass
  • bash benchmarks/stomp/run.sh runs all four phases end-to-end
  • PHASES=fanout_small bash benchmarks/stomp/run.sh smoke test
  • bash benchmarks/stomp/sweep.sh reproduces the cliff
  • build/gem benchmarks/stomp/microbench_gen_server.gem runs to completion (500 casts, no overflow)
  • build/gem benchmarks/stomp/microbench_fanout_one.gem 500 1024 200 produces clean numbers

Adds benchmarks/stomp/ — Python STOMP harness (fanout/slow/queue/soak),
shell driver with per-phase fresh broker + RSS sampling, fanout cell
sweep, and pure-Gem microbenches for deep-copy fanout cost and
gen_server.loop survival.

Headline finding: the broker dies at exactly 175 frames per connection
writer, due to the writer_loop ↔ handle_frame ↔ handle_<command> ↔
writer_loop mutual tail recursion (not direct self-recursion, so not
TCO'd) overflowing the 256 KB minicoro stack. Bumps the
non-tail-recursion-ceiling roadmap entry from P2 to P1 — first user hit.

Working envelope: 25k MESSAGE deliveries at 50–55k msg/s, p100 first-msg
latency 35–45 ms at 500 subs. Slow-consumer mailbox growth observed but
masked by the recursion cliff. Queue round-robin fair to ratio 0.977
right up to the death point. Microbench shows per-send cost is ~2 µs
flat across body sizes 64–1024 B at K up to 800 — fanout cost is in the
table structure, not the body bytes.

Reproduce: bash benchmarks/stomp/run.sh
@samuelSavanovic samuelSavanovic enabled auto-merge (squash) May 10, 2026 13:15
@samuelSavanovic samuelSavanovic merged commit a690bfb into main May 10, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant