celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487
celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487thesyncim wants to merge 25 commits into
Conversation
Add an archsimd implementation of scaleFloat32IntoNEON (dst[i]=src[i]*gain), gated on goexperiment.simd && !purego and compiled only under the Go tip toolchain. amd64 scales eight lanes per VMULPS through the 256-bit Float32x8 behind an X86.AVX() guard, with a 128-bit Float32x4 tail and fallback; arm64 uses the 128-bit Float32x4 NEON path. Every lane is a bare per-lane multiply, the same single-rounding product as the scalar reference and the hand asm, so the result is bit-exact. The arm64 NEON asm and the portable scalar fallback stay as the default for every non-experiment build and as the A/B baseline; their tags gain !goexperiment.simd so the three paths partition cleanly across every arch/purego/experiment config. TestScaleFloat32IntoBitExact asserts Float32bits equality against the scalar reference on whichever path the build selects. BenchmarkScaleInto times the kernel against that reference. A non-blocking CI job installs the tip toolchain and builds, parity-checks, and benchmarks the archsimd path on an amd64 runner. On arm64 (M-series) the hand NEON asm stays fastest and remains the default; archsimd trails it but beats scalar, and on amd64 it brings the first SIMD path to a kernel that was scalar-only there.
… arches in CI Add scripts/bench-simd-kernel.sh: runs a kernel benchmark under the default toolchain (arm64 NEON asm, amd64/purego scalar) and under gotip+GOEXPERIMENT=simd (archsimd), then benchstats them. The dispatch compiles one path per build, so the comparison is cross-binary on one host; the always-compiled scalar Ref benchmark anchors both runs. Make test-simd-experiment a 2-arch matrix (ubuntu-latest amd64 + ubuntu-24.04-arm arm64) that runs the script, so the archsimd path is measured against the current hand asm on arm64 and against the scalar fallback on amd64, and arm64 archsimd parity now runs in CI too. Still non-blocking. On arm64 (M4 Max) the NEON asm stays well ahead of archsimd (benchstat geomean +128% sec/op for archsimd); on amd64 archsimd is the only SIMD path and beats scalar 2-7x.
Port stereoMergeRescaleNEON (l=mid*x; x=lgain*(l-y); y=rgain*(l+y)) to archsimd, gated goexperiment.simd && !purego. amd64 processes 8 lanes per step through the 256-bit Float32x8 behind an X86.AVX() guard with a 128-bit Float32x4 tail and fallback; arm64 uses the 128-bit Float32x4 NEON path. Mul, Sub and Add are distinct archsimd ops — objdump confirms separate FMUL/FSUB/FADD on arm64 and VMULPS/VSUBPS/VADDPS (no VFMADD/VFMSUB) on amd64 — so each lane keeps the two-rounding shape of the scalar noFMA32 reference and the hand asm, and the result is bit-exact. The arm64 asm and the scalar fallback stay the default for non-experiment builds; their tags gain !goexperiment.simd so the three paths partition per config. TestStereoMergeRescaleBitExact asserts Float32bits equality vs the scalar reference on whichever path the build selects; BenchmarkStereoMerge feeds the existing asm-vs-archsimd benchstat harness. On arm64 (M4 Max) archsimd is competitive — faster at N16/N64, behind the asm at N176/N480 (geomean +11% sec/op) — so the asm stays the arm64 default; on amd64 it adds the first SIMD path.
…xact, faster than NEON) Port celtInnerProd8FMA32 to a 4-lane archsimd Float32x4 accumulator with fused MulAdd (FMLA on arm64, VFMADD on amd64), gated goexperiment.simd && !purego. Lane L sums elements L,L+4,L+8,… and the reduction is (acc0+acc2)+(acc1+acc3) with a scalar fused tail — the exact order of the scalar reference and the hand asm, so the result is bit-exact (proven by the existing TestCeltInnerProd8FMA32MatchesReference). Four lanes is mandatory: a wider Float32x8 accumulator would build a different partial-sum tree and diverge. Unlike the elementwise kernels this stays 4-wide on amd64 too (bit-exactness), and since amd64 MulAdd lowers to VFMADD unconditionally it gates on archsimd.X86.FMA() with a scalar fallback. arm64 NEON always has FMLA. The asm, scalar default and arm64 purego paths stay the default for non-experiment builds; their tags gain !goexperiment.simd where needed so the paths partition per config. Being compute-bound rather than memory-bound, archsimd here beats the hand NEON asm on arm64 (M4 Max benchstat: N64 -16%, N176 -15%, geomean -11% sec/op) while staying bit-exact, and brings a 4-lane FMA SIMD path to amd64 where it was scalar.
…nd bench Extend the test-simd-experiment matrix to parity-check and benchstat all three migrated kernels (scale_into, stereo_merge, inner_prod) against the default build, not just scale_into.
…, ~matches asm) Port xcorrKernel4Float32Neon4Acc to four phase-parallel archsimd Float32x4 MulAdd chains (one per sample in each 4-sample block, lane k = lag k), gated goexperiment.simd && !purego on arm64. Phase p fuses x[p] into y[p:p+4]; phases combine as (acc0+acc1)+(acc2+acc3) with a scalar-order tail — the exact order of xcorrKernel4Float32FourAccRef, so it is bit-exact (proven by the existing TestXcorrKernel4Float32Neon4AccBitExact). The const pitchXcorrUsesNeonFMA moves into the experiment file; the asm .go/.s gain !goexperiment.simd. amd64 keeps the scalar path (it has no 4-acc asm), so this is arm64-only. Being compute-bound, archsimd lands within ~2% of the hand NEON asm on arm64 (M4 Max benchstat geomean +1.7%) and is faster on the realistic coarse/half pitch searches (Coarse -7.7%, Half -3.9%); only tiny length-5 windows regress.
…hand asm The archsimd scale_into was 3-5x slower than the NEON asm not because of the SIMD but because archsimd.LoadFloat32x4(src[i:])/.Store(dst[i:]) emit a slice length bounds check and panic path on every load and store, dominating this load/store- bound kernel. Loading through *[4]float32 views over advancing unsafe pointers (LoadFloat32x4Array/StoreArray) drops the checks; with an 8-wide unroll the archsimd loop now beats the hand NEON asm on M4 Max (N16 -17%, N64 -31%, N480 -21%) while staying bit-exact (TestScaleFloat32IntoBitExact). The pointers only advance while i+k <= n, so all accesses stay in range; empty slices skip every loop.
…he hand asm Apply the scale_into win to every archsimd kernel: stereo_merge, inner_prod and xcorr now load/store through raw advancing pointers via shared loadF32x4/storeF32x4 (and loadF32x8/storeF32x8 on amd64) instead of archsimd.LoadFloat32x4(s[i:]), which emits a slice bounds check and panic path on every access. The check machinery, not the SIMD, was what made archsimd several times slower than the hand asm. With the checks gone the archsimd kernels beat the hand NEON asm on M4 Max (combined benchstat geomean -21% sec/op, +19% bandwidth) while staying bit-exact — every parity test passes on arm64 archsimd and amd64 (SSE fallback under Rosetta). The helpers inline to a bare vector load/store; pointers only advance within range, so accesses stay valid and empty slices skip all loops.
Record the goexperiment.simd archsimd kernels as a measurement track toward replacing hand assembly with portable Go SIMD, bit-exact and (with bounds-check elimination) beating the asm on Apple Silicon; asm stays the released default.
… (bit-exact) Port prefilterDualInnerProdAsm (sum1=<x,y1>, sum2=<x,y2>) to two 4-lane archsimd FMA accumulators loading through raw pointers (loadF32x4, no slice bounds checks), gated goexperiment.simd && !purego. Lane L sums elements L,L+4,L+8,… and each reduction is (a0+a2)+(a1+a3) with a scalar fused tail — the exact order of the scalar reference, so bit-exact (TestPrefilterDualInnerProdMatchesReference). amd64 gates on archsimd.X86.FMA() with a scalar fallback; arm64 NEON always has FMLA. The asm .go/.s gain !goexperiment.simd; default.go excludes amd64-under-experiment.
…m64) Port l1AbsSumNeon to a 4-lane archsimd accumulator (Abs + Add) loading through raw pointers (loadF32x4, no slice bounds checks), gated goexperiment.simd && !purego on arm64. Lane k sums |tmp[k]|,|tmp[k+4]|,… and the reduction is (a0+a1)+(a2+a3)+tail — the exact order of l1AbsSumNeonReference, so bit-exact (TestL1AbsSumNeonBitExact). That order diverges from the scalar L1 sum by a few ULP (the arm64 quality-gated regime), so amd64/purego keep the scalar sum — this stays arm64-only. The asm .go/.s gain !goexperiment.simd.
The archsimd kernels match or beat the hand asm on Apple Silicon and Graviton, and the 4-way codec scoreboard shows Go SIMD as the closest Go build to libopus C.
4-way comparison — Decode half (CELT decode, gopus/libopus-C ratio, lower = closer to C)Complements the Encode table in the PR description. Decode is where
Go SIMD wins on most decode configs (a couple noisy ties at 20x). Final CI green. |
Evaluated: the portable scalable
|
scale_into is the one archsimd kernel that trailed the hand asm on Neoverse (+19%, the pure memory-copy case). Widen the arm64 unroll from 8 to 16 elements/iter (the width that profiled fastest in the optimization sweep) to amortize loop control on the overhead-bound core. M4 Max improves (N480 -20% vs asm, was -17%) and stays bit-exact; Graviton effect measured in CI.
scale_into Graviton holdout — closed via 16-wide unroll
All sizes now ≤8.6% or faster — inside the 10-15% replace-the-asm bar. M4 Max also Updated verdict: with the holdout closed, archsimd is within-or-better than the |
Re-examined the "blocked" kernels — root cause is archsimd's thinner arm64 backendPushed harder on whether more kernels can replace the asm. Findings:
Root cause: archsimd's arm64 backend is less complete than amd64 (doc: "currently supports |
Port the constant-gain 5-tap comb filter (combFilterConstNeon) from hand Plan9 asm to Go simd/archsimd, gated goexperiment.simd && !purego. Two tap sums round as plain Adds and the three accumulates are fused MulAdds, matching combFilterConstValue and the asm bit-for-bit. Raw-pointer loads (loadF32x4/storeF32x4) skip the per-lane bounds check. The hand asm .go/.s retag !goexperiment.simd so the paths partition per build config.
Port the spreading-rotation pass (expRotation1PassNeon) from hand Plan9 asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each block runs two rounded Muls and two fused MulAdds over the x[i] and x[i+stride] lanes, matching expRotationMac32 and the asm bit-for-bit; stride>=4 keeps the four lanes independent. The shared expRotation1StrideNeon wrapper and the expRotationUsesNeon flag stay in the umbrella arm64 file; the asm decl moves to a !goexperiment.simd file and the .s retags so the kernel symbol partitions per build. Parity holds and the Stride2 case runs ~12% faster than the asm.
Port the int16->float32 widening conversion (writeInt16AsFloat32Core) from hand NEON asm to simd/archsimd, gated arm64 && goexperiment.simd && !purego; retag the asm .go/.s !goexperiment.simd so the build partitions per config. The default (asm + purego) build is untouched. Each int16 is sign-extended to int32, converted to float32 and multiplied by the exact-in-float32 1/32768 scale -- the same per-element op sequence as the scalar reference, so it is bit-exact under both the experiment toolchain and normal go test. The low 4 lanes widen with ExtendLo4ToInt32 and the high 4 with HiToLo().ExtendLo4ToInt32() (arm64 has no ExtendHi4), the extra shift emulating SXTL2. Loads/stores go through raw advancing pointers (new simd_load_store.go int16/int32/f32 helpers) to drop the per-access slice bounds check. Measured: archsimd 22.97 ns/op vs asm 19.02 ns/op (+20.8%, n=6) -- the expected gap from the HiToLo shift; we take the coverage.
Port the float32->int16 narrowing conversion (floatToInt16ScaledCore plus the floatToInt16Scaled wrapper) from hand NEON asm to simd/archsimd, gated arm64 && goexperiment.simd && !purego; retag the asm .go/.s !goexperiment.simd so the build partitions per config. The default (asm + purego) build is untouched. The op sequence reproduces the FCVTNS+SQXTN asm and the scalar saturate-then-round-even path bit-for-bit: Mul(scale), Round() (VFRINTN, ties-to-even) so the value is an exact integer, ConvertToInt32() (VFCVTZS truncation is then lossless), and SaturateToInt16() (VSQXTN clamp to [-32768,32767]). Two int32 quads saturate separately and their low halves merge with a VZIP1 (Uint64x2.InterleaveLo), reproducing SQXTN2. TestFloatToInt16ScaledBitExact and ...Boundaries pass under both the experiment toolchain and normal go test. Measured: archsimd 23.87 ns/op vs asm 17.66 ns/op (+35.2%, n=6) -- the gap is Round+ConvertToInt32 (two ops) vs the asm's fused FCVTNS plus the VZIP1 merge vs free SQXTN2; we take the coverage.
…4 patch) Port the IMDCT pre-rotation and TDAC windowing kernels to archsimd, gated goexperiment.simd && !purego. Both rely on a full 4-lane reverse (reverse4) built on Float32x4.Reverse64 — the arm64 VREV64 op added in github.com/thesyncim/go branch arm64-simd-reverse64 — for their descending spectrum/window accesses; they will not build on a vanilla gotip. Bit-exact vs the scalar references (TestIMDCTPreRotateFMA32KissMatchesScalar, TestIMDCTTDACWindowFMA32MatchesScalar). imdct_tdac (real-valued, reverse only) beats the hand asm on M4 Max (-15%); imdct_pre (complex re/im deinterleave) is +26%, the cost of archsimd lacking a VLD2-style deinterleaving load. The asm/purego paths stay the default for non-experiment builds.
Port the IMDCT post-rotation (imdctPostRotateF32FromKiss) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego (needs the Reverse64 op from the patched toolchain). Forward four come from a ConcatEven/Odd deinterleave of the kissCpx scratch; the backward four walk down and reverse4 so descending lanes line up; products are single-round Muls and accumulates fused MulAdds, matching mdctMulAddMix/mdctMulSubMix bit-for-bit. The asm .go/.s retag !goexperiment.simd. Benchmarks at asm parity.
Port the forward-MDCT post-twiddle (mdctPostTwiddleNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each block pairs a forward run and its mirror j=n4-1-i; ConcatEven/Odd deinterleaves the kissCpx, the two ends tile coeffs contiguously via reverse4 + InterleaveLo/Hi. Products are single-round Muls and the combines plain Sub/Add (no fusion), matching mdctMul and the scalar loop bit-for-bit. The asm .go/.s retag !goexperiment.simd.
Port the forward-MDCT middle fold (mdctMidFoldStoreNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. The (re,im) compute is vectorized four lanes at a time (ConcatEven deinterleave, descending re via reverse4, fused MulAdd, scaling Mul) matching mdctStoreDirectStageFMALike bit-for-bit; the bit-reversed dst[bitrev[i]] scatter stays scalar, exactly as the asm does. The asm .go/.s retag !goexperiment.simd.
Port the two windowed forward-MDCT fold kernels (mdctFold1StoreNeon, mdctFold3StoreNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each deinterleaves six sample/window streams (ConcatEven ascending, reverse4 descending) and combines them with fused MulAdds + single-round Muls matching mdctMulAddMixEncode/mdctMulSubMixEncode/ mdctMulSubMixAlt; the shared twiddle/scale/bit-reversed scatter tail (mdctFoldStore) mirrors mid_fold. The asm .go/.s retag !goexperiment.simd.
The 7 complex IMDCT/MDCT archsimd kernels need Float32x4.Reverse64, which is not in upstream Go yet (golang/go#80032). Gate their simd path behind the gopus_reverse64 build tag and route the experiment build to the asm fallback when it is unset, so the stock-gotip experiment CI stays green (vanilla archsimd + complex asm) and the complex archsimd path builds only on the patched toolchain with -tags gopus_reverse64.
Summary
Migrates the hot CELT float kernels from hand-written Plan9 assembly to Go's
experimental
simd/archsimdpackage (Go tip,GOEXPERIMENT=simd) and answers thequestion: is portable Go SIMD close enough to delete the hand assembly? Short
answer: yes for the float kernels — Go SIMD matches or beats the asm on both Apple
Silicon and Graviton and at the codec level, once per-access bounds checks are
removed. Everything is gated
goexperiment.simd; asm and purego paths are untouchedfor normal builds. No kernel migrated without bit-exact parity + benchmarks.
Ported (bit-exact):
scale_into,stereo_merge,inner_prod,prefilter_dual_inner_prod(both arches);xcorr,l1_abs_sum,comb_const,exp_rotation(arm64, where the hand asm lived).comb_construns at asmparity;
exp_rotationties and is ~12% faster on the stride-2 case.Also ported (silk, arm64) as measurement-track coverage, slower than the asm
because archsimd lacks the native widen/narrow fusion:
int16_float32(+20.8%,HiToLoemulates the missingExtendHi4=SXTL2) andfloat_to_int16(+35.2%,Round+ConvertToInt32+SaturateToInt16vs the asm's fused FCVTNS+SQXTN2).These are honest negative data points — archsimd can be bit-exact here but not
yet competitive.
Finding 1 — bounds-check elimination is the whole game
The kernels first used
archsimd.LoadFloat32x4(s[i:]), which emits a slice boundscheck + panic path on every load/store. A subagent sweep proved that machinery — not
the SIMD — made archsimd 3-5× slower than the asm. Loading through raw advancing
pointers (
unsafe.SliceData+unsafe.Add+LoadFloat32x4Array, via sharedloadF32x4/storeF32x4/loadF32x8/storeF32x8helpers) removes it.Finding 2 — kernel gap, archsimd vs hand asm (benchstat)
inner_prodl1_abs_sumxcorr/prefilter_dualstereo_mergescale_intoarchsimd wins on compute-bound, ties on mixed, and trails only on the pure
memory-copy
scale_intoon Neoverse (+19%, down from +280% before the fix). On M4 Maxit beats the asm across the board. amd64 (no prior asm) gets a clean 2-7× over scalar.
Finding 3 — the 4-way comparison (codec scoreboard vs libopus 1.6.1 C)
CELT encode, gopus/libopus C ratio (lower = closer to C), via
benchmark_libopus_scoreboard_test.gounder three Go builds + self-timed libopus C:Go SIMD matches or edges the hand asm at the codec level and is the closest Go
build to libopus C; both Go-SIMD and Go-asm are far ahead of pure Go. SILK is
unaffected (no SILK kernels migrated). (Single-run 20x; archsimd-vs-asm is ~a wash
within noise on some configs.)
Complex IMDCT/MDCT kernels — gated behind
gopus_reverse64The 7 complex kernels —
imdct_pre_kiss,imdct_tdac,imdct_post_kiss,mdct_post_twiddle,mdct_mid_fold,mdct_fold1,mdct_fold3— were the last CELTfloat kernels blocked on a missing op: a full 4-lane lane-reverse. They deinterleave
the
kissCpxre/im streams withConcatEven/Odd, reverse descending streams with areverse4built fromFloat32x4.Reverse64(NEONVREV64), and scatter thebit-reversed fold output via temps + a scalar loop (as the asm does).
Reverse64is not in upstream Go yet — submitted as golang/go#80032. So thecomplex archsimd path is gated behind the
gopus_reverse64build tag:asm fallback (
!goexperiment.simd || !gopus_reverse64), and CI validates thevanilla archsimd + the complex asm;
GOEXPERIMENT=simd -tags gopus_reverse64on the patched toolchain (
thesyncim/go), until #80032 lands and the gate is dropped;What we win: archsimd hits parity on the complex family (not the regression you'd
expect from emulated VLD2) —
imdct_post37.6↔37.6 ns,imdct_tdacbeats asm−15%, the folds bit-exact + full-suite green;
imdct_preis the lone +26% outlier(the VLD2-deinterleave gap). With these, every CELT float kernel is portable.
Verdict on deleting the asm
For the migrated CELT float kernels, the hand asm is replaceable by portable Go
archsimd with equal-or-better performance on both arches and at the codec level —
scale_into(pure copy-scale) is the lone marginal case (+19% on Neoverse). The asmstays the default here (archsimd is gotip-only); flipping the default is a follow-up
once
simd/archsimdships in a Go release.Not migrated (genuinely blocked — needs an op archsimd lacks)
(The complex twiddle/fold kernels that needed a lane-reverse are now ported above,
behind
gopus_reverse64.)kf_bfly/kf_bfly4m1— strided twiddle gather + cross-lane radix recombinehaar1— strided gather + even/odd deinterleave archsimd lacks for floatsresample_fir— indexed-lane FMLA plus int16↔int32 widening (int16_float32and
float_to_int16are now ported above, at a measured widen/narrow-fusion gap)pitch_xcorr— indexed-lane FMLA (fmla v0.4s, v5.4s, v4.s[i]) has noarchsimd form; emulating it per lane regresses, and its output mixes fused
(vector lags) and non-fused (scalar-tail lags) accumulation
deemphasis,lpc_synth,up2hq_core— serial IIR, not vectorizablecwrs,pvq_search— integer combinatorialHarness
scripts/bench-simd-kernel.sh <pkg> <bench>benchstats default-build (asm) vsarchsimd. The non-blocking
test-simd-experimentCI job runs it on a 2-arch matrix(
ubuntu-latestamd64 +ubuntu-24.04-armGraviton).Test plan
GOEXPERIMENT=simd gotipand the normal toolchain (9 configs)