perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821
Open
missuo wants to merge 1 commit into
Open
perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821missuo wants to merge 1 commit into
missuo wants to merge 1 commit into
Conversation
`v4Reader.readFrame()` deserialises every snell frame with two distinct `io.ReadFull` calls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlying `net.Conn` was being read directly with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000 `recv()` syscalls per direction. Two observable problems followed: 1. **Empty ACK pressure.** Linux delays ACKs when the application drains the receive buffer in large bursts, but issues them more aggressively when the buffer is drained through many small reads. Two syscalls per frame defeats delayed-ACK and puts noticeably more empty ACK packets on the wire than the C/libuv reference. 2. **Concurrent-throughput gap.** Each snell connection runs two goroutines (one per direction). At N=8 concurrent SOCKS5 sessions that is 16 goroutines, each doing thousands of small syscalls and trading off through Go's runtime scheduler. The C reference (libuv single-threaded event loop) pays none of that overhead. The fix is to wrap the underlying conn in a 64 KiB `bufio.Reader` before feeding it to `v4Reader`. Each `recv()` now pulls ~40 max-size snell frames into userspace, so the two ReadFull calls per frame are served from the buffer without syscalls. This is a wire-format-transparent change: the v4 frame parser sees the same byte stream, just delivered through fewer syscalls. Empirically reproducing this on two co-located Linux hosts (one running both an OpenSnell server and the official Surge `snell-server v5.0.1` on different ports for an apples-to-apples comparison), driving SOCKS5 downloads through each: | Metric | Before | After | Official | | ------------------------------------------ | ------------: | ------------: | ------------: | | TTFB median | within noise | within noise | (baseline) | | N = 8 concurrent throughput | 6.49 MB/s | 47.34 MB/s | 48.19 MB/s | | Gap vs. official at N = 8 | **−30 %** | **−1.8 %** | — | | Empty ACKs over a 10 MB transfer | +33 % | +10.8 % | (baseline) | (The absolute MB/s before-vs-after numbers were measured on different paths; what is directly comparable is the gap-to-official ratio.) Both server-side (`listener/snell`) and outbound (`adapter/outbound/snell`) paths share the same `v4Conn` struct, so both directions benefit without further changes. Tests under `transport/snell/...` still pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
v4Reader.readFrame()deserialises every snell frame with two distinctio.ReadFullcalls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlyingnet.Connwas being read directly with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000recv()syscalls per direction.This PR wraps the underlying conn in a 64 KiB
bufio.ReaderinsideinitReader(). Eachrecv()now pulls ~40 max-size snell frames into userspace, cutting syscalls on the read path by roughly ~90×.This is a wire-format-transparent change — the v4 frame parser sees the exact same byte stream, just delivered through fewer syscalls. No protocol or config behavior change. Existing tests under
transport/snell/...pass unchanged.Why this matters
I noticed this while benchmarking a from-scratch Go implementation of snell against the official Surge
snell-server v5.0.1(the closed-source C/libuv binary). Two symptoms surfaced under load:Both
transport/snell/v4.goand the same code paths in mihomo share the issue becausev4Connis used both bylistener/snell(inbound) andadapter/outbound/snell(outbound).Benchmark
Methodology: two co-located Linux hosts. One runs both my Go snell server and the official Surge
snell-server v5.0.1on different ports for an apples-to-apples comparison (so upstream link, kernel and CDN cooldown apply equally). The other host drives SOCKS5 downloads through each viacurl --socks5-hostname.(The absolute MB/s before-vs-after numbers were measured on different paths; what is directly comparable is the gap-to-official ratio.)
Related: OpenSnell
I wrote a from-scratch Go snell-server / -client implementation, missuo/opensnell, that aims to cover nearly all features of the official Surge
snell-server v5.0.1— v4/v5 TCP wire (this same v4 code is what I'm patching here), reuse /CommandConnectV2, UDP-over-TCP, http/tls obfs, Dynamic Record Sizing,egress-interface,ipv6outbound toggle,dns = …custom upstream resolver (Surge v4.1.0), TCP Fast Open, and the v5 QUIC proxy mode (envelope decoded server-side, raw QUIC forwarded thereafter). The benchmark above and the full write-up live at https://github.com/missuo/opensnell/blob/main/README.md#performance — feel free to crib any other piece of it if useful (everything is GPL-3.0-or-later, same as mihomo).Risk
bufio.NewReaderSizeis stdlib, no new dependency.MaxPayloadLength = 0x3FFF); larger buffers don't help because the kernel hands us segments ~MTU-sized anyway.Test plan
go build ./...go vet ./transport/snell/...go test ./transport/snell/...🤖 Generated with Claude Code