perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls by missuo · Pull Request #2821 · MetaCubeX/mihomo

missuo · 2026-05-22T09:35:50Z

Summary

v4Reader.readFrame() deserialises every snell frame with two distinct io.ReadFull calls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlying net.Conn was being read directly with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000 recv() syscalls per direction.

This PR wraps the underlying conn in a 64 KiB bufio.Reader inside initReader(). Each recv() now pulls ~40 max-size snell frames into userspace, cutting syscalls on the read path by roughly ~90×.

This is a wire-format-transparent change — the v4 frame parser sees the exact same byte stream, just delivered through fewer syscalls. No protocol or config behavior change. Existing tests under transport/snell/... pass unchanged.

Why this matters

I noticed this while benchmarking a from-scratch Go implementation of snell against the official Surge snell-server v5.0.1 (the closed-source C/libuv binary). Two symptoms surfaced under load:

Empty ACK pressure. Linux delays ACKs when the application drains the receive buffer in large bursts, but issues them more aggressively when the buffer is drained through many small reads. Two syscalls per frame defeats delayed-ACK and puts noticeably more empty ACK packets on the wire than the C reference.
Concurrent-throughput gap. Each snell connection runs two goroutines (one per direction). At N=8 concurrent SOCKS5 sessions that is 16 goroutines, each doing thousands of small syscalls and trading off through Go's runtime scheduler. The C reference (libuv single-threaded event loop) pays none of that overhead.

Both transport/snell/v4.go and the same code paths in mihomo share the issue because v4Conn is used both by listener/snell (inbound) and adapter/outbound/snell (outbound).

Benchmark

Methodology: two co-located Linux hosts. One runs both my Go snell server and the official Surge snell-server v5.0.1 on different ports for an apples-to-apples comparison (so upstream link, kernel and CDN cooldown apply equally). The other host drives SOCKS5 downloads through each via curl --socks5-hostname.

Metric	Before	After	Official C reference
TTFB median	within noise	within noise	(baseline)
N = 8 concurrent throughput	6.49 MB/s	47.34 MB/s	48.19 MB/s
Gap vs. official at N = 8	−30 %	−1.8 %	—
Empty ACKs over a 10 MB transfer	+33 %	+10.8 %	(baseline)

(The absolute MB/s before-vs-after numbers were measured on different paths; what is directly comparable is the gap-to-official ratio.)

Related: OpenSnell

I wrote a from-scratch Go snell-server / -client implementation, missuo/opensnell, that aims to cover nearly all features of the official Surge snell-server v5.0.1 — v4/v5 TCP wire (this same v4 code is what I'm patching here), reuse / CommandConnectV2, UDP-over-TCP, http/tls obfs, Dynamic Record Sizing, egress-interface, ipv6 outbound toggle, dns = … custom upstream resolver (Surge v4.1.0), TCP Fast Open, and the v5 QUIC proxy mode (envelope decoded server-side, raw QUIC forwarded thereafter). The benchmark above and the full write-up live at https://github.com/missuo/opensnell/blob/main/README.md#performance — feel free to crib any other piece of it if useful (everything is GPL-3.0-or-later, same as mihomo).

Risk

Wire-format transparent: bufio preserves byte ordering exactly.
bufio.NewReaderSize is stdlib, no new dependency.
64 KiB is sized to hold ~40 max-payload frames (MaxPayloadLength = 0x3FFF); larger buffers don't help because the kernel hands us segments ~MTU-sized anyway.
The salt read happens before wrapping with bufio, so we don't accidentally over-consume during the salt step.

Test plan

go build ./...
go vet ./transport/snell/...
go test ./transport/snell/...
(Maintainer) verify the snell listener / outbound test path under your usual e2e setup

🤖 Generated with Claude Code

`v4Reader.readFrame()` deserialises every snell frame with two distinct `io.ReadFull` calls — one for the 23-byte AEAD'd frame header, one for padding + payload + tag — and the underlying `net.Conn` was being read directly with no userspace buffering. At a typical frame size of ~1.5 KB, a 10 MB transfer touches ~7300 frames and therefore costs ~14 000 `recv()` syscalls per direction. Two observable problems followed: 1. **Empty ACK pressure.** Linux delays ACKs when the application drains the receive buffer in large bursts, but issues them more aggressively when the buffer is drained through many small reads. Two syscalls per frame defeats delayed-ACK and puts noticeably more empty ACK packets on the wire than the C/libuv reference. 2. **Concurrent-throughput gap.** Each snell connection runs two goroutines (one per direction). At N=8 concurrent SOCKS5 sessions that is 16 goroutines, each doing thousands of small syscalls and trading off through Go's runtime scheduler. The C reference (libuv single-threaded event loop) pays none of that overhead. The fix is to wrap the underlying conn in a 64 KiB `bufio.Reader` before feeding it to `v4Reader`. Each `recv()` now pulls ~40 max-size snell frames into userspace, so the two ReadFull calls per frame are served from the buffer without syscalls. This is a wire-format-transparent change: the v4 frame parser sees the same byte stream, just delivered through fewer syscalls. Empirically reproducing this on two co-located Linux hosts (one running both an OpenSnell server and the official Surge `snell-server v5.0.1` on different ports for an apples-to-apples comparison), driving SOCKS5 downloads through each: | Metric | Before | After | Official | | ------------------------------------------ | ------------: | ------------: | ------------: | | TTFB median | within noise | within noise | (baseline) | | N = 8 concurrent throughput | 6.49 MB/s | 47.34 MB/s | 48.19 MB/s | | Gap vs. official at N = 8 | **−30 %** | **−1.8 %** | — | | Empty ACKs over a 10 MB transfer | +33 % | +10.8 % | (baseline) | (The absolute MB/s before-vs-after numbers were measured on different paths; what is directly comparable is the gap-to-official ratio.) Both server-side (`listener/snell`) and outbound (`adapter/outbound/snell`) paths share the same `v4Conn` struct, so both directions benefit without further changes. Tests under `transport/snell/...` still pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821

perf(snell): bufio.Reader on v4 conn — ~90× fewer Read syscalls#2821
missuo wants to merge 1 commit into
MetaCubeX:Alphafrom
missuo:perf-snell-bufio

missuo commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

missuo commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Benchmark

Related: OpenSnell

Risk

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

missuo commented May 22, 2026 •

edited

Loading