Skip to content

perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821

Open
missuo wants to merge 1 commit into
MetaCubeX:Alphafrom
missuo:perf-snell-bufio
Open

perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821
missuo wants to merge 1 commit into
MetaCubeX:Alphafrom
missuo:perf-snell-bufio

Conversation

@missuo

@missuo missuo commented May 22, 2026

Copy link
Copy Markdown

Summary

v4Reader.readFrame() deserialises every snell frame with two distinct io.ReadFull calls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlying net.Conn was being read directly with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000 recv() syscalls per direction.

This PR wraps the underlying conn in a 64 KiB bufio.Reader inside initReader(). Each recv() now pulls ~40 max-size snell frames into userspace, cutting syscalls on the read path by roughly ~90×.

This is a wire-format-transparent change — the v4 frame parser sees the exact same byte stream, just delivered through fewer syscalls. No protocol or config behavior change. Existing tests under transport/snell/... pass unchanged.

Why this matters

I noticed this while benchmarking a from-scratch Go implementation of snell against the official Surge snell-server v5.0.1 (the closed-source C/libuv binary). Two symptoms surfaced under load:

  1. Empty ACK pressure. Linux delays ACKs when the application drains the receive buffer in large bursts, but issues them more aggressively when the buffer is drained through many small reads. Two syscalls per frame defeats delayed-ACK and puts noticeably more empty ACK packets on the wire than the C reference.
  2. Concurrent-throughput gap. Each snell connection runs two goroutines (one per direction). At N=8 concurrent SOCKS5 sessions that is 16 goroutines, each doing thousands of small syscalls and trading off through Go's runtime scheduler. The C reference (libuv single-threaded event loop) pays none of that overhead.

Both transport/snell/v4.go and the same code paths in mihomo share the issue because v4Conn is used both by listener/snell (inbound) and adapter/outbound/snell (outbound).

Benchmark

Methodology: two co-located Linux hosts. One runs both my Go snell server and the official Surge snell-server v5.0.1 on different ports for an apples-to-apples comparison (so upstream link, kernel and CDN cooldown apply equally). The other host drives SOCKS5 downloads through each via curl --socks5-hostname.

Metric Before After Official C reference
TTFB median within noise within noise (baseline)
N = 8 concurrent throughput 6.49 MB/s 47.34 MB/s 48.19 MB/s
Gap vs. official at N = 8 −30 % −1.8 %
Empty ACKs over a 10 MB transfer +33 % +10.8 % (baseline)

(The absolute MB/s before-vs-after numbers were measured on different paths; what is directly comparable is the gap-to-official ratio.)

Related: OpenSnell

I wrote a from-scratch Go snell-server / -client implementation, missuo/opensnell, that aims to cover nearly all features of the official Surge snell-server v5.0.1 — v4/v5 TCP wire (this same v4 code is what I'm patching here), reuse / CommandConnectV2, UDP-over-TCP, http/tls obfs, Dynamic Record Sizing, egress-interface, ipv6 outbound toggle, dns = … custom upstream resolver (Surge v4.1.0), TCP Fast Open, and the v5 QUIC proxy mode (envelope decoded server-side, raw QUIC forwarded thereafter). The benchmark above and the full write-up live at https://github.com/missuo/opensnell/blob/main/README.md#performance — feel free to crib any other piece of it if useful (everything is GPL-3.0-or-later, same as mihomo).

Risk

  • Wire-format transparent: bufio preserves byte ordering exactly.
  • bufio.NewReaderSize is stdlib, no new dependency.
  • 64 KiB is sized to hold ~40 max-payload frames (MaxPayloadLength = 0x3FFF); larger buffers don't help because the kernel hands us segments ~MTU-sized anyway.
  • The salt read happens before wrapping with bufio, so we don't accidentally over-consume during the salt step.

Test plan

  • go build ./...
  • go vet ./transport/snell/...
  • go test ./transport/snell/...
  • (Maintainer) verify the snell listener / outbound test path under your usual e2e setup

🤖 Generated with Claude Code

`v4Reader.readFrame()` deserialises every snell frame with two distinct
`io.ReadFull` calls — one for the 23-byte AEAD'd frame header, one for
padding + payload + tag — and the underlying `net.Conn` was being read
directly with no userspace buffering. At a typical frame size of ~1.5
KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000
`recv()` syscalls per direction.

Two observable problems followed:

1. **Empty ACK pressure.** Linux delays ACKs when the application
   drains the receive buffer in large bursts, but issues them more
   aggressively when the buffer is drained through many small reads.
   Two syscalls per frame defeats delayed-ACK and puts noticeably
   more empty ACK packets on the wire than the C/libuv reference.

2. **Concurrent-throughput gap.** Each snell connection runs two
   goroutines (one per direction). At N=8 concurrent SOCKS5 sessions
   that is 16 goroutines, each doing thousands of small syscalls and
   trading off through Go's runtime scheduler. The C reference (libuv
   single-threaded event loop) pays none of that overhead.

The fix is to wrap the underlying conn in a 64 KiB `bufio.Reader`
before feeding it to `v4Reader`. Each `recv()` now pulls ~40 max-size
snell frames into userspace, so the two ReadFull calls per frame are
served from the buffer without syscalls. This is a
wire-format-transparent change: the v4 frame parser sees the same byte
stream, just delivered through fewer syscalls.

Empirically reproducing this on two co-located Linux hosts (one
running both an OpenSnell server and the official Surge
`snell-server v5.0.1` on different ports for an apples-to-apples
comparison), driving SOCKS5 downloads through each:

| Metric                                     | Before        | After         | Official      |
| ------------------------------------------ | ------------: | ------------: | ------------: |
| TTFB median                                | within noise  | within noise  | (baseline)    |
| N = 8 concurrent throughput                | 6.49 MB/s     | 47.34 MB/s    | 48.19 MB/s    |
| Gap vs. official at N = 8                  | **−30 %**     | **−1.8 %**    | —             |
| Empty ACKs over a 10 MB transfer           | +33 %         | +10.8 %       | (baseline)    |

(The absolute MB/s before-vs-after numbers were measured on different
paths; what is directly comparable is the gap-to-official ratio.)

Both server-side (`listener/snell`) and outbound (`adapter/outbound/snell`)
paths share the same `v4Conn` struct, so both directions benefit
without further changes. Tests under `transport/snell/...` still pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant