celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT) by thesyncim · Pull Request #487 · thesyncim/gopus

thesyncim · 2026-06-15T21:41:36Z

Summary

Migrates the hot CELT float kernels from hand-written Plan9 assembly to Go's
experimental simd/archsimd package (Go tip, GOEXPERIMENT=simd) and answers the
question: is portable Go SIMD close enough to delete the hand assembly? Short
answer: yes for the float kernels — Go SIMD matches or beats the asm on both Apple
Silicon and Graviton and at the codec level, once per-access bounds checks are
removed. Everything is gated goexperiment.simd; asm and purego paths are untouched
for normal builds. No kernel migrated without bit-exact parity + benchmarks.

WIP — do not merge yet. This is the single PR for the whole migration: the
CELT float wins, the silk widen/narrow coverage, and the complex IMDCT/MDCT
kernels (gated behind gopus_reverse64 — see below). Iterating over the next
several days.

Ported (bit-exact): scale_into, stereo_merge, inner_prod,
prefilter_dual_inner_prod (both arches); xcorr, l1_abs_sum, comb_const,
exp_rotation (arm64, where the hand asm lived). comb_const runs at asm
parity; exp_rotation ties and is ~12% faster on the stride-2 case.

Also ported (silk, arm64) as measurement-track coverage, slower than the asm
because archsimd lacks the native widen/narrow fusion: int16_float32 (+20.8%,
HiToLo emulates the missing ExtendHi4=SXTL2) and float_to_int16 (+35.2%,
Round+ConvertToInt32+SaturateToInt16 vs the asm's fused FCVTNS+SQXTN2).
These are honest negative data points — archsimd can be bit-exact here but not
yet competitive.

Finding 1 — bounds-check elimination is the whole game

The kernels first used archsimd.LoadFloat32x4(s[i:]), which emits a slice bounds
check + panic path on every load/store. A subagent sweep proved that machinery — not
the SIMD — made archsimd 3-5× slower than the asm. Loading through raw advancing
pointers (unsafe.SliceData + unsafe.Add + LoadFloat32x4Array, via shared
loadF32x4/storeF32x4/loadF32x8/storeF32x8 helpers) removes it.

Finding 2 — kernel gap, archsimd vs hand asm (benchstat)

kernel	bottleneck	M4 Max	Graviton (Neoverse)
`inner_prod`	compute	−11%	−20% ✅
`l1_abs_sum`	compute	−57%	wins ✅
`xcorr` / `prefilter_dual`	compute	~tie	~tie ✅
`stereo_merge`	mixed	beats/ties	+1% ✅
`scale_into`	pure memory	−28%	+19% ⚠️

archsimd wins on compute-bound, ties on mixed, and trails only on the pure
memory-copy scale_into on Neoverse (+19%, down from +280% before the fix). On M4 Max
it beats the asm across the board. amd64 (no prior asm) gets a clean 2-7× over scalar.

Finding 3 — the 4-way comparison (codec scoreboard vs libopus 1.6.1 C)

CELT encode, gopus/libopus C ratio (lower = closer to C), via
benchmark_libopus_scoreboard_test.go under three Go builds + self-timed libopus C:

CELT config	pure Go	Go asm	Go SIMD	libopus C
mono 8k 20ms	1.33×	1.08×	1.04×	1.0
mono 24k 10ms	1.78×	1.51×	1.38×	1.0
mono 48k 20ms	1.46×	1.16×	1.13×	1.0
stereo 48k 20ms	1.50×	1.35×	1.27×	1.0

Go SIMD matches or edges the hand asm at the codec level and is the closest Go
build to libopus C; both Go-SIMD and Go-asm are far ahead of pure Go. SILK is
unaffected (no SILK kernels migrated). (Single-run 20x; archsimd-vs-asm is ~a wash
within noise on some configs.)

Complex IMDCT/MDCT kernels — gated behind `gopus_reverse64`

The 7 complex kernels — imdct_pre_kiss, imdct_tdac, imdct_post_kiss,
mdct_post_twiddle, mdct_mid_fold, mdct_fold1, mdct_fold3 — were the last CELT
float kernels blocked on a missing op: a full 4-lane lane-reverse. They deinterleave
the kissCpx re/im streams with ConcatEven/Odd, reverse descending streams with a
reverse4 built from Float32x4.Reverse64 (NEON VREV64), and scatter the
bit-reversed fold output via temps + a scalar loop (as the asm does).

Reverse64 is not in upstream Go yet — submitted as golang/go#80032. So the
complex archsimd path is gated behind the gopus_reverse64 build tag:

stock-gotip CI is green: without the tag the experiment build uses the complex
asm fallback (!goexperiment.simd || !gopus_reverse64), and CI validates the
vanilla archsimd + the complex asm;
the complex archsimd path builds only with GOEXPERIMENT=simd -tags gopus_reverse64
on the patched toolchain (thesyncim/go), until #80032 lands and the gate is dropped;
the default build is unchanged — the hand asm still runs.

What we win: archsimd hits parity on the complex family (not the regression you'd
expect from emulated VLD2) — imdct_post 37.6↔37.6 ns, imdct_tdac beats asm
−15%, the folds bit-exact + full-suite green; imdct_pre is the lone +26% outlier
(the VLD2-deinterleave gap). With these, every CELT float kernel is portable.

Verdict on deleting the asm

For the migrated CELT float kernels, the hand asm is replaceable by portable Go
archsimd with equal-or-better performance on both arches and at the codec level —
scale_into (pure copy-scale) is the lone marginal case (+19% on Neoverse). The asm
stays the default here (archsimd is gotip-only); flipping the default is a follow-up
once simd/archsimd ships in a Go release.

Not migrated (genuinely blocked — needs an op archsimd lacks)

(The complex twiddle/fold kernels that needed a lane-reverse are now ported above,
behind gopus_reverse64.)

kf_bfly / kf_bfly4m1 — strided twiddle gather + cross-lane radix recombine
haar1 — strided gather + even/odd deinterleave archsimd lacks for floats
silk resample_fir — indexed-lane FMLA plus int16↔int32 widening (int16_float32
and float_to_int16 are now ported above, at a measured widen/narrow-fusion gap)
silk pitch_xcorr — indexed-lane FMLA (fmla v0.4s, v5.4s, v4.s[i]) has no
archsimd form; emulating it per lane regresses, and its output mixes fused
(vector lags) and non-fused (scalar-tail lags) accumulation
deemphasis, lpc_synth, up2hq_core — serial IIR, not vectorizable
cwrs, pvq_search — integer combinatorial

Harness

scripts/bench-simd-kernel.sh <pkg> <bench> benchstats default-build (asm) vs
archsimd. The non-blocking test-simd-experiment CI job runs it on a 2-arch matrix
(ubuntu-latest amd64 + ubuntu-24.04-arm Graviton).

Test plan

Builds clean under GOEXPERIMENT=simd gotip and the normal toolchain (9 configs)
Bit-exact parity on every runnable config; AVX/FMA + Graviton via CI
asm + purego intact; gofmt + vet clean; unsafe pointer walks stay in range
Kernel benchmarks (M4 Max + Graviton) + 4-way codec scoreboard vs libopus C

Add an archsimd implementation of scaleFloat32IntoNEON (dst[i]=src[i]*gain), gated on goexperiment.simd && !purego and compiled only under the Go tip toolchain. amd64 scales eight lanes per VMULPS through the 256-bit Float32x8 behind an X86.AVX() guard, with a 128-bit Float32x4 tail and fallback; arm64 uses the 128-bit Float32x4 NEON path. Every lane is a bare per-lane multiply, the same single-rounding product as the scalar reference and the hand asm, so the result is bit-exact. The arm64 NEON asm and the portable scalar fallback stay as the default for every non-experiment build and as the A/B baseline; their tags gain !goexperiment.simd so the three paths partition cleanly across every arch/purego/experiment config. TestScaleFloat32IntoBitExact asserts Float32bits equality against the scalar reference on whichever path the build selects. BenchmarkScaleInto times the kernel against that reference. A non-blocking CI job installs the tip toolchain and builds, parity-checks, and benchmarks the archsimd path on an amd64 runner. On arm64 (M-series) the hand NEON asm stays fastest and remains the default; archsimd trails it but beats scalar, and on amd64 it brings the first SIMD path to a kernel that was scalar-only there.

… arches in CI Add scripts/bench-simd-kernel.sh: runs a kernel benchmark under the default toolchain (arm64 NEON asm, amd64/purego scalar) and under gotip+GOEXPERIMENT=simd (archsimd), then benchstats them. The dispatch compiles one path per build, so the comparison is cross-binary on one host; the always-compiled scalar Ref benchmark anchors both runs. Make test-simd-experiment a 2-arch matrix (ubuntu-latest amd64 + ubuntu-24.04-arm arm64) that runs the script, so the archsimd path is measured against the current hand asm on arm64 and against the scalar fallback on amd64, and arm64 archsimd parity now runs in CI too. Still non-blocking. On arm64 (M4 Max) the NEON asm stays well ahead of archsimd (benchstat geomean +128% sec/op for archsimd); on amd64 archsimd is the only SIMD path and beats scalar 2-7x.

Port stereoMergeRescaleNEON (l=mid*x; x=lgain*(l-y); y=rgain*(l+y)) to archsimd, gated goexperiment.simd && !purego. amd64 processes 8 lanes per step through the 256-bit Float32x8 behind an X86.AVX() guard with a 128-bit Float32x4 tail and fallback; arm64 uses the 128-bit Float32x4 NEON path. Mul, Sub and Add are distinct archsimd ops — objdump confirms separate FMUL/FSUB/FADD on arm64 and VMULPS/VSUBPS/VADDPS (no VFMADD/VFMSUB) on amd64 — so each lane keeps the two-rounding shape of the scalar noFMA32 reference and the hand asm, and the result is bit-exact. The arm64 asm and the scalar fallback stay the default for non-experiment builds; their tags gain !goexperiment.simd so the three paths partition per config. TestStereoMergeRescaleBitExact asserts Float32bits equality vs the scalar reference on whichever path the build selects; BenchmarkStereoMerge feeds the existing asm-vs-archsimd benchstat harness. On arm64 (M4 Max) archsimd is competitive — faster at N16/N64, behind the asm at N176/N480 (geomean +11% sec/op) — so the asm stays the arm64 default; on amd64 it adds the first SIMD path.

…xact, faster than NEON) Port celtInnerProd8FMA32 to a 4-lane archsimd Float32x4 accumulator with fused MulAdd (FMLA on arm64, VFMADD on amd64), gated goexperiment.simd && !purego. Lane L sums elements L,L+4,L+8,… and the reduction is (acc0+acc2)+(acc1+acc3) with a scalar fused tail — the exact order of the scalar reference and the hand asm, so the result is bit-exact (proven by the existing TestCeltInnerProd8FMA32MatchesReference). Four lanes is mandatory: a wider Float32x8 accumulator would build a different partial-sum tree and diverge. Unlike the elementwise kernels this stays 4-wide on amd64 too (bit-exactness), and since amd64 MulAdd lowers to VFMADD unconditionally it gates on archsimd.X86.FMA() with a scalar fallback. arm64 NEON always has FMLA. The asm, scalar default and arm64 purego paths stay the default for non-experiment builds; their tags gain !goexperiment.simd where needed so the paths partition per config. Being compute-bound rather than memory-bound, archsimd here beats the hand NEON asm on arm64 (M4 Max benchstat: N64 -16%, N176 -15%, geomean -11% sec/op) while staying bit-exact, and brings a 4-lane FMA SIMD path to amd64 where it was scalar.

…nd bench Extend the test-simd-experiment matrix to parity-check and benchstat all three migrated kernels (scale_into, stereo_merge, inner_prod) against the default build, not just scale_into.

…, ~matches asm) Port xcorrKernel4Float32Neon4Acc to four phase-parallel archsimd Float32x4 MulAdd chains (one per sample in each 4-sample block, lane k = lag k), gated goexperiment.simd && !purego on arm64. Phase p fuses x[p] into y[p:p+4]; phases combine as (acc0+acc1)+(acc2+acc3) with a scalar-order tail — the exact order of xcorrKernel4Float32FourAccRef, so it is bit-exact (proven by the existing TestXcorrKernel4Float32Neon4AccBitExact). The const pitchXcorrUsesNeonFMA moves into the experiment file; the asm .go/.s gain !goexperiment.simd. amd64 keeps the scalar path (it has no 4-acc asm), so this is arm64-only. Being compute-bound, archsimd lands within ~2% of the hand NEON asm on arm64 (M4 Max benchstat geomean +1.7%) and is faster on the realistic coarse/half pitch searches (Coarse -7.7%, Half -3.9%); only tiny length-5 windows regress.

…hand asm The archsimd scale_into was 3-5x slower than the NEON asm not because of the SIMD but because archsimd.LoadFloat32x4(src[i:])/.Store(dst[i:]) emit a slice length bounds check and panic path on every load and store, dominating this load/store- bound kernel. Loading through *[4]float32 views over advancing unsafe pointers (LoadFloat32x4Array/StoreArray) drops the checks; with an 8-wide unroll the archsimd loop now beats the hand NEON asm on M4 Max (N16 -17%, N64 -31%, N480 -21%) while staying bit-exact (TestScaleFloat32IntoBitExact). The pointers only advance while i+k <= n, so all accesses stay in range; empty slices skip every loop.

…he hand asm Apply the scale_into win to every archsimd kernel: stereo_merge, inner_prod and xcorr now load/store through raw advancing pointers via shared loadF32x4/storeF32x4 (and loadF32x8/storeF32x8 on amd64) instead of archsimd.LoadFloat32x4(s[i:]), which emits a slice bounds check and panic path on every access. The check machinery, not the SIMD, was what made archsimd several times slower than the hand asm. With the checks gone the archsimd kernels beat the hand NEON asm on M4 Max (combined benchstat geomean -21% sec/op, +19% bandwidth) while staying bit-exact — every parity test passes on arm64 archsimd and amd64 (SSE fallback under Rosetta). The helpers inline to a bare vector load/store; pointers only advance within range, so accesses stay valid and empty slices skip all loops.

Record the goexperiment.simd archsimd kernels as a measurement track toward replacing hand assembly with portable Go SIMD, bit-exact and (with bounds-check elimination) beating the asm on Apple Silicon; asm stays the released default.

… (bit-exact) Port prefilterDualInnerProdAsm (sum1=<x,y1>, sum2=<x,y2>) to two 4-lane archsimd FMA accumulators loading through raw pointers (loadF32x4, no slice bounds checks), gated goexperiment.simd && !purego. Lane L sums elements L,L+4,L+8,… and each reduction is (a0+a2)+(a1+a3) with a scalar fused tail — the exact order of the scalar reference, so bit-exact (TestPrefilterDualInnerProdMatchesReference). amd64 gates on archsimd.X86.FMA() with a scalar fallback; arm64 NEON always has FMLA. The asm .go/.s gain !goexperiment.simd; default.go excludes amd64-under-experiment.

…m64) Port l1AbsSumNeon to a 4-lane archsimd accumulator (Abs + Add) loading through raw pointers (loadF32x4, no slice bounds checks), gated goexperiment.simd && !purego on arm64. Lane k sums |tmp[k]|,|tmp[k+4]|,… and the reduction is (a0+a1)+(a2+a3)+tail — the exact order of l1AbsSumNeonReference, so bit-exact (TestL1AbsSumNeonBitExact). That order diverges from the scalar L1 sum by a few ULP (the arm64 quality-gated regime), so amd64/purego keep the scalar sum — this stays arm64-only. The asm .go/.s gain !goexperiment.simd.

The archsimd kernels match or beat the hand asm on Apple Silicon and Graviton, and the 4-way codec scoreboard shows Go SIMD as the closest Go build to libopus C.

thesyncim · 2026-06-16T00:09:57Z

4-way comparison — Decode half (CELT decode, gopus/libopus-C ratio, lower = closer to C)

Complements the Encode table in the PR description. Decode is where scale_into and
stereo_merge live, and archsimd is consistently the closest Go build to libopus C:

CELT decode	pure Go	Go asm	Go SIMD	libopus C
mono 8k 10ms	1.60×	1.50×	1.29×	1.0
mono 48k 10ms	1.70×	1.64×	1.23×	1.0
mono 48k 20ms	1.57×	1.45×	1.32×	1.0
stereo 24k 10ms	1.52×	1.47×	1.31×	1.0
stereo 48k 10ms	1.48×	1.47×	1.34×	1.0

Go SIMD wins on most decode configs (a couple noisy ties at 20x). Final CI green.

thesyncim · 2026-06-16T00:11:57Z

Evaluated: the portable scalable `simd` package (vs fixed-width `archsimd`)

Tried reimplementing scale_into on the scalable simd.Float32s type (auto-widens to
the platform width — 4 on this arm64, 8/16 on AVX2/AVX-512 — from one source, the more
portable alternative to per-arch Float32x4/Float32x8).

Two blockers at this Go tip:

It crashes the compiler on arm64 — internal compiler error: nil pointer dereference in cmd/compile/internal/midway/rewrite.go (the pass that lowers
scalable vectors). Not usable yet.
Even if it compiled, its API is slice-only (LoadFloat32s([]float32)) with no
raw-pointer load, so it would keep the per-access bounds check that the unsafe
archsimd kernels eliminate — the very cost that decides asm-vs-archsimd here.

Conclusion: fixed-width archsimd + the unsafe raw-pointer trick is the viable
path; the scalable package is too immature on arm64 to evaluate for performance.
Revisit once the midway backend stabilizes and an array-pointer load lands.

scale_into is the one archsimd kernel that trailed the hand asm on Neoverse (+19%, the pure memory-copy case). Widen the arm64 unroll from 8 to 16 elements/iter (the width that profiled fastest in the optimization sweep) to amortize loop control on the overhead-bound core. M4 Max improves (N480 -20% vs asm, was -17%) and stays bit-exact; Graviton effect measured in CI.

thesyncim · 2026-06-16T07:47:34Z

scale_into Graviton holdout — closed via 16-wide unroll

scale_into was the lone kernel trailing the hand asm on Neoverse (+19% at N480).
Widening the arm64 unroll from 8 to 16 elements/iter (amortizing loop control on the
overhead-bound core) closes it — Graviton archsimd vs asm:

N	8-wide (before)	16-wide (now)
64	(trailed)	−11% (archsimd faster)
176	—	+4.6%
480	+19%	+8.6%

All sizes now ≤8.6% or faster — inside the 10-15% replace-the-asm bar. M4 Max also
improved (N480 −20% vs asm, was −17%); still bit-exact.

Updated verdict: with the holdout closed, archsimd is within-or-better than the
hand NEON asm on all 6 migrated kernels across both M4 Max and Graviton — the
float-kernel assembly is fully replaceable by portable Go SIMD (once simd/archsimd
ships in a Go release; asm stays the default until then).

thesyncim · 2026-06-16T08:01:50Z

Re-examined the "blocked" kernels — root cause is archsimd's thinner arm64 backend

Pushed harder on whether more kernels can replace the asm. Findings:

silk int16↔float32 is portable, not blocked (I was wrong): Int16x8.ExtendLo4ToInt32()
→ Int32x4.ConvertToFloat32() → ×2⁻¹⁵ is bit-exact. But benchmarked it's +27% vs the asm
(tried 4-lane and 2-load 8-lane, both ~+27%) — arm64 archsimd has ExtendLo4 but no
ExtendHi4, so it can't widen both int16 halves like the asm's SXTL/SXTL2. Portable but
not a competitive replacement → reverted (won't ship a regression).
FFT/MDCT complex kernels are genuinely arm64-blocked: the bit-reinterpret-back
(Int32x4.AsFloat32x4) exists on amd64 but not arm64, so the re/im float deinterleave they
need is impossible without a memory round-trip.

Root cause: archsimd's arm64 backend is less complete than amd64 (doc: "currently supports
AMD64"). Missing on arm64 — ExtendHi4, AsFloat32x4 reinterpret, float deinterleave/reverse,
gather — is what caps competitive replacement beyond the 6 float kernels. Those 6 remain the set
archsimd matches/beats on arm64; more will qualify as the arm64 backend gains ops.

Port the constant-gain 5-tap comb filter (combFilterConstNeon) from hand Plan9 asm to Go simd/archsimd, gated goexperiment.simd && !purego. Two tap sums round as plain Adds and the three accumulates are fused MulAdds, matching combFilterConstValue and the asm bit-for-bit. Raw-pointer loads (loadF32x4/storeF32x4) skip the per-lane bounds check. The hand asm .go/.s retag !goexperiment.simd so the paths partition per build config.

Port the spreading-rotation pass (expRotation1PassNeon) from hand Plan9 asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each block runs two rounded Muls and two fused MulAdds over the x[i] and x[i+stride] lanes, matching expRotationMac32 and the asm bit-for-bit; stride>=4 keeps the four lanes independent. The shared expRotation1StrideNeon wrapper and the expRotationUsesNeon flag stay in the umbrella arm64 file; the asm decl moves to a !goexperiment.simd file and the .s retags so the kernel symbol partitions per build. Parity holds and the Stride2 case runs ~12% faster than the asm.

…el set

Port the int16->float32 widening conversion (writeInt16AsFloat32Core) from hand NEON asm to simd/archsimd, gated arm64 && goexperiment.simd && !purego; retag the asm .go/.s !goexperiment.simd so the build partitions per config. The default (asm + purego) build is untouched. Each int16 is sign-extended to int32, converted to float32 and multiplied by the exact-in-float32 1/32768 scale -- the same per-element op sequence as the scalar reference, so it is bit-exact under both the experiment toolchain and normal go test. The low 4 lanes widen with ExtendLo4ToInt32 and the high 4 with HiToLo().ExtendLo4ToInt32() (arm64 has no ExtendHi4), the extra shift emulating SXTL2. Loads/stores go through raw advancing pointers (new simd_load_store.go int16/int32/f32 helpers) to drop the per-access slice bounds check. Measured: archsimd 22.97 ns/op vs asm 19.02 ns/op (+20.8%, n=6) -- the expected gap from the HiToLo shift; we take the coverage.

Port the float32->int16 narrowing conversion (floatToInt16ScaledCore plus the floatToInt16Scaled wrapper) from hand NEON asm to simd/archsimd, gated arm64 && goexperiment.simd && !purego; retag the asm .go/.s !goexperiment.simd so the build partitions per config. The default (asm + purego) build is untouched. The op sequence reproduces the FCVTNS+SQXTN asm and the scalar saturate-then-round-even path bit-for-bit: Mul(scale), Round() (VFRINTN, ties-to-even) so the value is an exact integer, ConvertToInt32() (VFCVTZS truncation is then lossless), and SaturateToInt16() (VSQXTN clamp to [-32768,32767]). Two int32 quads saturate separately and their low halves merge with a VZIP1 (Uint64x2.InterleaveLo), reproducing SQXTN2. TestFloatToInt16ScaledBitExact and ...Boundaries pass under both the experiment toolchain and normal go test. Measured: archsimd 23.87 ns/op vs asm 17.66 ns/op (+35.2%, n=6) -- the gap is Round+ConvertToInt32 (two ops) vs the asm's fused FCVTNS plus the VZIP1 merge vs free SQXTN2; we take the coverage.

…ent job

…4 patch) Port the IMDCT pre-rotation and TDAC windowing kernels to archsimd, gated goexperiment.simd && !purego. Both rely on a full 4-lane reverse (reverse4) built on Float32x4.Reverse64 — the arm64 VREV64 op added in github.com/thesyncim/go branch arm64-simd-reverse64 — for their descending spectrum/window accesses; they will not build on a vanilla gotip. Bit-exact vs the scalar references (TestIMDCTPreRotateFMA32KissMatchesScalar, TestIMDCTTDACWindowFMA32MatchesScalar). imdct_tdac (real-valued, reverse only) beats the hand asm on M4 Max (-15%); imdct_pre (complex re/im deinterleave) is +26%, the cost of archsimd lacking a VLD2-style deinterleaving load. The asm/purego paths stay the default for non-experiment builds.

Port the IMDCT post-rotation (imdctPostRotateF32FromKiss) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego (needs the Reverse64 op from the patched toolchain). Forward four come from a ConcatEven/Odd deinterleave of the kissCpx scratch; the backward four walk down and reverse4 so descending lanes line up; products are single-round Muls and accumulates fused MulAdds, matching mdctMulAddMix/mdctMulSubMix bit-for-bit. The asm .go/.s retag !goexperiment.simd. Benchmarks at asm parity.

Port the forward-MDCT post-twiddle (mdctPostTwiddleNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each block pairs a forward run and its mirror j=n4-1-i; ConcatEven/Odd deinterleaves the kissCpx, the two ends tile coeffs contiguously via reverse4 + InterleaveLo/Hi. Products are single-round Muls and the combines plain Sub/Add (no fusion), matching mdctMul and the scalar loop bit-for-bit. The asm .go/.s retag !goexperiment.simd.

Port the forward-MDCT middle fold (mdctMidFoldStoreNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. The (re,im) compute is vectorized four lanes at a time (ConcatEven deinterleave, descending re via reverse4, fused MulAdd, scaling Mul) matching mdctStoreDirectStageFMALike bit-for-bit; the bit-reversed dst[bitrev[i]] scatter stays scalar, exactly as the asm does. The asm .go/.s retag !goexperiment.simd.

Port the two windowed forward-MDCT fold kernels (mdctFold1StoreNeon, mdctFold3StoreNeon) from hand asm to Go simd/archsimd, gated goexperiment.simd && !purego. Each deinterleaves six sample/window streams (ConcatEven ascending, reverse4 descending) and combines them with fused MulAdds + single-round Muls matching mdctMulAddMixEncode/mdctMulSubMixEncode/ mdctMulSubMixAlt; the shared twiddle/scale/bit-reversed scatter tail (mdctFoldStore) mirrors mid_fold. The asm .go/.s retag !goexperiment.simd.

The 7 complex IMDCT/MDCT archsimd kernels need Float32x4.Reverse64, which is not in upstream Go yet (golang/go#80032). Gate their simd path behind the gopus_reverse64 build tag and route the experiment build to the asm fallback when it is unset, so the stock-gotip experiment CI stays green (vanilla archsimd + complex asm) and the complex archsimd path builds only on the patched toolchain with -tags gopus_reverse64.

thesyncim added 3 commits June 15, 2026 22:40

thesyncim changed the title ~~celt: migrate kernels to Go simd/archsimd packages (start: scale_into)~~ celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge) Jun 15, 2026

thesyncim added 2 commits June 15, 2026 23:16

ci: cover stereo_merge and inner_prod in the simd-experiment parity a…

d2efcce

…nd bench Extend the test-simd-experiment matrix to parity-check and benchstat all three migrated kernels (scale_into, stereo_merge, inner_prod) against the default build, not just scale_into.

thesyncim changed the title ~~celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge)~~ celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge, inner_prod) Jun 15, 2026

thesyncim added 7 commits June 15, 2026 23:25

docs: sharpen the Go SIMD note with the both-arch + codec-level result

919987d

The archsimd kernels match or beat the hand asm on Apple Silicon and Graviton, and the 4-way codec scoreboard shows Go SIMD as the closest Go build to libopus C.

thesyncim added 11 commits June 16, 2026 12:18

docs: list the comb filter and spreading rotation in the Go SIMD kern…

c990ffa

…el set

ci: build + parity-test silk archsimd widening kernels in the experim…

0eb45d8

…ent job

thesyncim mentioned this pull request Jun 16, 2026

celt: complex IMDCT/MDCT archsimd kernels (needs the Reverse64 op) #488

Merged

3 tasks

thesyncim marked this pull request as draft June 16, 2026 13:01

thesyncim changed the title ~~celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge, inner_prod)~~ celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT) Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487

celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487
thesyncim wants to merge 25 commits into
masterfrom
simd-pkg-kernels

thesyncim commented Jun 15, 2026 •

edited

Loading

Uh oh!

thesyncim commented Jun 16, 2026

Uh oh!

thesyncim commented Jun 16, 2026

Uh oh!

thesyncim commented Jun 16, 2026

Uh oh!

thesyncim commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thesyncim commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Finding 1 — bounds-check elimination is the whole game

Finding 2 — kernel gap, archsimd vs hand asm (benchstat)

Finding 3 — the 4-way comparison (codec scoreboard vs libopus 1.6.1 C)

Complex IMDCT/MDCT kernels — gated behind gopus_reverse64

Verdict on deleting the asm

Not migrated (genuinely blocked — needs an op archsimd lacks)

Harness

Test plan

Uh oh!

thesyncim commented Jun 16, 2026

4-way comparison — Decode half (CELT decode, gopus/libopus-C ratio, lower = closer to C)

Uh oh!

thesyncim commented Jun 16, 2026

Evaluated: the portable scalable simd package (vs fixed-width archsimd)

Uh oh!

thesyncim commented Jun 16, 2026

scale_into Graviton holdout — closed via 16-wide unroll

Uh oh!

thesyncim commented Jun 16, 2026

Re-examined the "blocked" kernels — root cause is archsimd's thinner arm64 backend

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thesyncim commented Jun 15, 2026 •

edited

Loading

Complex IMDCT/MDCT kernels — gated behind `gopus_reverse64`

Evaluated: the portable scalable `simd` package (vs fixed-width `archsimd`)