Skip to content

celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487

Draft
thesyncim wants to merge 25 commits into
masterfrom
simd-pkg-kernels
Draft

celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT)#487
thesyncim wants to merge 25 commits into
masterfrom
simd-pkg-kernels

Conversation

@thesyncim

@thesyncim thesyncim commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

Migrates the hot CELT float kernels from hand-written Plan9 assembly to Go's
experimental simd/archsimd package (Go tip, GOEXPERIMENT=simd) and answers the
question: is portable Go SIMD close enough to delete the hand assembly? Short
answer: yes for the float kernels — Go SIMD matches or beats the asm on both Apple
Silicon and Graviton and at the codec level, once per-access bounds checks are
removed. Everything is gated goexperiment.simd; asm and purego paths are untouched
for normal builds. No kernel migrated without bit-exact parity + benchmarks.

WIP — do not merge yet. This is the single PR for the whole migration: the
CELT float wins, the silk widen/narrow coverage, and the complex IMDCT/MDCT
kernels
(gated behind gopus_reverse64 — see below). Iterating over the next
several days.

Ported (bit-exact): scale_into, stereo_merge, inner_prod,
prefilter_dual_inner_prod (both arches); xcorr, l1_abs_sum, comb_const,
exp_rotation (arm64, where the hand asm lived). comb_const runs at asm
parity; exp_rotation ties and is ~12% faster on the stride-2 case.

Also ported (silk, arm64) as measurement-track coverage, slower than the asm
because archsimd lacks the native widen/narrow fusion: int16_float32 (+20.8%,
HiToLo emulates the missing ExtendHi4=SXTL2) and float_to_int16 (+35.2%,
Round+ConvertToInt32+SaturateToInt16 vs the asm's fused FCVTNS+SQXTN2).
These are honest negative data points — archsimd can be bit-exact here but not
yet competitive.

Finding 1 — bounds-check elimination is the whole game

The kernels first used archsimd.LoadFloat32x4(s[i:]), which emits a slice bounds
check + panic path on every load/store. A subagent sweep proved that machinery — not
the SIMD — made archsimd 3-5× slower than the asm. Loading through raw advancing
pointers (unsafe.SliceData + unsafe.Add + LoadFloat32x4Array, via shared
loadF32x4/storeF32x4/loadF32x8/storeF32x8 helpers) removes it.

Finding 2 — kernel gap, archsimd vs hand asm (benchstat)

kernel bottleneck M4 Max Graviton (Neoverse)
inner_prod compute −11% −20%
l1_abs_sum compute −57% wins ✅
xcorr / prefilter_dual compute ~tie ~tie ✅
stereo_merge mixed beats/ties +1%
scale_into pure memory −28% +19% ⚠️

archsimd wins on compute-bound, ties on mixed, and trails only on the pure
memory-copy scale_into on Neoverse (+19%, down from +280% before the fix). On M4 Max
it beats the asm across the board. amd64 (no prior asm) gets a clean 2-7× over scalar.

Finding 3 — the 4-way comparison (codec scoreboard vs libopus 1.6.1 C)

CELT encode, gopus/libopus C ratio (lower = closer to C), via
benchmark_libopus_scoreboard_test.go under three Go builds + self-timed libopus C:

CELT config pure Go Go asm Go SIMD libopus C
mono 8k 20ms 1.33× 1.08× 1.04× 1.0
mono 24k 10ms 1.78× 1.51× 1.38× 1.0
mono 48k 20ms 1.46× 1.16× 1.13× 1.0
stereo 48k 20ms 1.50× 1.35× 1.27× 1.0

Go SIMD matches or edges the hand asm at the codec level and is the closest Go
build to libopus C
; both Go-SIMD and Go-asm are far ahead of pure Go. SILK is
unaffected (no SILK kernels migrated). (Single-run 20x; archsimd-vs-asm is ~a wash
within noise on some configs.)

Complex IMDCT/MDCT kernels — gated behind gopus_reverse64

The 7 complex kernels — imdct_pre_kiss, imdct_tdac, imdct_post_kiss,
mdct_post_twiddle, mdct_mid_fold, mdct_fold1, mdct_fold3 — were the last CELT
float kernels blocked on a missing op: a full 4-lane lane-reverse. They deinterleave
the kissCpx re/im streams with ConcatEven/Odd, reverse descending streams with a
reverse4 built from Float32x4.Reverse64 (NEON VREV64), and scatter the
bit-reversed fold output via temps + a scalar loop (as the asm does).

Reverse64 is not in upstream Go yet — submitted as golang/go#80032. So the
complex archsimd path is gated behind the gopus_reverse64 build tag:

  • stock-gotip CI is green: without the tag the experiment build uses the complex
    asm fallback (!goexperiment.simd || !gopus_reverse64), and CI validates the
    vanilla archsimd + the complex asm;
  • the complex archsimd path builds only with GOEXPERIMENT=simd -tags gopus_reverse64
    on the patched toolchain (thesyncim/go), until #80032 lands and the gate is dropped;
  • the default build is unchanged — the hand asm still runs.

What we win: archsimd hits parity on the complex family (not the regression you'd
expect from emulated VLD2) — imdct_post 37.6↔37.6 ns, imdct_tdac beats asm
−15%, the folds bit-exact + full-suite green; imdct_pre is the lone +26% outlier
(the VLD2-deinterleave gap). With these, every CELT float kernel is portable.

Verdict on deleting the asm

For the migrated CELT float kernels, the hand asm is replaceable by portable Go
archsimd
with equal-or-better performance on both arches and at the codec level —
scale_into (pure copy-scale) is the lone marginal case (+19% on Neoverse). The asm
stays the default here (archsimd is gotip-only); flipping the default is a follow-up
once simd/archsimd ships in a Go release.

Not migrated (genuinely blocked — needs an op archsimd lacks)

(The complex twiddle/fold kernels that needed a lane-reverse are now ported above,
behind gopus_reverse64.)

  • kf_bfly / kf_bfly4m1 — strided twiddle gather + cross-lane radix recombine
  • haar1 — strided gather + even/odd deinterleave archsimd lacks for floats
  • silk resample_fir — indexed-lane FMLA plus int16↔int32 widening (int16_float32
    and float_to_int16 are now ported above, at a measured widen/narrow-fusion gap)
  • silk pitch_xcorr — indexed-lane FMLA (fmla v0.4s, v5.4s, v4.s[i]) has no
    archsimd form; emulating it per lane regresses, and its output mixes fused
    (vector lags) and non-fused (scalar-tail lags) accumulation
  • deemphasis, lpc_synth, up2hq_core — serial IIR, not vectorizable
  • cwrs, pvq_search — integer combinatorial

Harness

scripts/bench-simd-kernel.sh <pkg> <bench> benchstats default-build (asm) vs
archsimd. The non-blocking test-simd-experiment CI job runs it on a 2-arch matrix
(ubuntu-latest amd64 + ubuntu-24.04-arm Graviton).

Test plan

  • Builds clean under GOEXPERIMENT=simd gotip and the normal toolchain (9 configs)
  • Bit-exact parity on every runnable config; AVX/FMA + Graviton via CI
  • asm + purego intact; gofmt + vet clean; unsafe pointer walks stay in range
  • Kernel benchmarks (M4 Max + Graviton) + 4-way codec scoreboard vs libopus C

Add an archsimd implementation of scaleFloat32IntoNEON (dst[i]=src[i]*gain),
gated on goexperiment.simd && !purego and compiled only under the Go tip
toolchain. amd64 scales eight lanes per VMULPS through the 256-bit Float32x8
behind an X86.AVX() guard, with a 128-bit Float32x4 tail and fallback; arm64
uses the 128-bit Float32x4 NEON path. Every lane is a bare per-lane multiply,
the same single-rounding product as the scalar reference and the hand asm, so
the result is bit-exact.

The arm64 NEON asm and the portable scalar fallback stay as the default for
every non-experiment build and as the A/B baseline; their tags gain
!goexperiment.simd so the three paths partition cleanly across every
arch/purego/experiment config.

TestScaleFloat32IntoBitExact asserts Float32bits equality against the scalar
reference on whichever path the build selects. BenchmarkScaleInto times the
kernel against that reference. A non-blocking CI job installs the tip toolchain
and builds, parity-checks, and benchmarks the archsimd path on an amd64 runner.

On arm64 (M-series) the hand NEON asm stays fastest and remains the default;
archsimd trails it but beats scalar, and on amd64 it brings the first SIMD path
to a kernel that was scalar-only there.
… arches in CI

Add scripts/bench-simd-kernel.sh: runs a kernel benchmark under the default
toolchain (arm64 NEON asm, amd64/purego scalar) and under gotip+GOEXPERIMENT=simd
(archsimd), then benchstats them. The dispatch compiles one path per build, so the
comparison is cross-binary on one host; the always-compiled scalar Ref benchmark
anchors both runs.

Make test-simd-experiment a 2-arch matrix (ubuntu-latest amd64 + ubuntu-24.04-arm
arm64) that runs the script, so the archsimd path is measured against the current
hand asm on arm64 and against the scalar fallback on amd64, and arm64 archsimd
parity now runs in CI too. Still non-blocking.

On arm64 (M4 Max) the NEON asm stays well ahead of archsimd (benchstat geomean
+128% sec/op for archsimd); on amd64 archsimd is the only SIMD path and beats
scalar 2-7x.
Port stereoMergeRescaleNEON (l=mid*x; x=lgain*(l-y); y=rgain*(l+y)) to archsimd,
gated goexperiment.simd && !purego. amd64 processes 8 lanes per step through the
256-bit Float32x8 behind an X86.AVX() guard with a 128-bit Float32x4 tail and
fallback; arm64 uses the 128-bit Float32x4 NEON path. Mul, Sub and Add are distinct
archsimd ops — objdump confirms separate FMUL/FSUB/FADD on arm64 and
VMULPS/VSUBPS/VADDPS (no VFMADD/VFMSUB) on amd64 — so each lane keeps the
two-rounding shape of the scalar noFMA32 reference and the hand asm, and the result
is bit-exact.

The arm64 asm and the scalar fallback stay the default for non-experiment builds;
their tags gain !goexperiment.simd so the three paths partition per config.
TestStereoMergeRescaleBitExact asserts Float32bits equality vs the scalar reference
on whichever path the build selects; BenchmarkStereoMerge feeds the existing
asm-vs-archsimd benchstat harness.

On arm64 (M4 Max) archsimd is competitive — faster at N16/N64, behind the asm at
N176/N480 (geomean +11% sec/op) — so the asm stays the arm64 default; on amd64 it
adds the first SIMD path.
@thesyncim thesyncim changed the title celt: migrate kernels to Go simd/archsimd packages (start: scale_into) celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge) Jun 15, 2026
…xact, faster than NEON)

Port celtInnerProd8FMA32 to a 4-lane archsimd Float32x4 accumulator with fused
MulAdd (FMLA on arm64, VFMADD on amd64), gated goexperiment.simd && !purego. Lane L
sums elements L,L+4,L+8,… and the reduction is (acc0+acc2)+(acc1+acc3) with a scalar
fused tail — the exact order of the scalar reference and the hand asm, so the result
is bit-exact (proven by the existing TestCeltInnerProd8FMA32MatchesReference). Four
lanes is mandatory: a wider Float32x8 accumulator would build a different partial-sum
tree and diverge.

Unlike the elementwise kernels this stays 4-wide on amd64 too (bit-exactness), and
since amd64 MulAdd lowers to VFMADD unconditionally it gates on archsimd.X86.FMA()
with a scalar fallback. arm64 NEON always has FMLA. The asm, scalar default and arm64
purego paths stay the default for non-experiment builds; their tags gain
!goexperiment.simd where needed so the paths partition per config.

Being compute-bound rather than memory-bound, archsimd here beats the hand NEON asm
on arm64 (M4 Max benchstat: N64 -16%, N176 -15%, geomean -11% sec/op) while staying
bit-exact, and brings a 4-lane FMA SIMD path to amd64 where it was scalar.
…nd bench

Extend the test-simd-experiment matrix to parity-check and benchstat all three
migrated kernels (scale_into, stereo_merge, inner_prod) against the default build,
not just scale_into.
@thesyncim thesyncim changed the title celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge) celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge, inner_prod) Jun 15, 2026
…, ~matches asm)

Port xcorrKernel4Float32Neon4Acc to four phase-parallel archsimd Float32x4 MulAdd
chains (one per sample in each 4-sample block, lane k = lag k), gated
goexperiment.simd && !purego on arm64. Phase p fuses x[p] into y[p:p+4]; phases
combine as (acc0+acc1)+(acc2+acc3) with a scalar-order tail — the exact order of
xcorrKernel4Float32FourAccRef, so it is bit-exact (proven by the existing
TestXcorrKernel4Float32Neon4AccBitExact). The const pitchXcorrUsesNeonFMA moves into
the experiment file; the asm .go/.s gain !goexperiment.simd. amd64 keeps the scalar
path (it has no 4-acc asm), so this is arm64-only.

Being compute-bound, archsimd lands within ~2% of the hand NEON asm on arm64 (M4 Max
benchstat geomean +1.7%) and is faster on the realistic coarse/half pitch searches
(Coarse -7.7%, Half -3.9%); only tiny length-5 windows regress.
…hand asm

The archsimd scale_into was 3-5x slower than the NEON asm not because of the SIMD
but because archsimd.LoadFloat32x4(src[i:])/.Store(dst[i:]) emit a slice length
bounds check and panic path on every load and store, dominating this load/store-
bound kernel. Loading through *[4]float32 views over advancing unsafe pointers
(LoadFloat32x4Array/StoreArray) drops the checks; with an 8-wide unroll the archsimd
loop now beats the hand NEON asm on M4 Max (N16 -17%, N64 -31%, N480 -21%) while
staying bit-exact (TestScaleFloat32IntoBitExact). The pointers only advance while
i+k <= n, so all accesses stay in range; empty slices skip every loop.
…he hand asm

Apply the scale_into win to every archsimd kernel: stereo_merge, inner_prod and
xcorr now load/store through raw advancing pointers via shared loadF32x4/storeF32x4
(and loadF32x8/storeF32x8 on amd64) instead of archsimd.LoadFloat32x4(s[i:]), which
emits a slice bounds check and panic path on every access. The check machinery, not
the SIMD, was what made archsimd several times slower than the hand asm.

With the checks gone the archsimd kernels beat the hand NEON asm on M4 Max (combined
benchstat geomean -21% sec/op, +19% bandwidth) while staying bit-exact — every
parity test passes on arm64 archsimd and amd64 (SSE fallback under Rosetta). The
helpers inline to a bare vector load/store; pointers only advance within range, so
accesses stay valid and empty slices skip all loops.
Record the goexperiment.simd archsimd kernels as a measurement track toward
replacing hand assembly with portable Go SIMD, bit-exact and (with bounds-check
elimination) beating the asm on Apple Silicon; asm stays the released default.
… (bit-exact)

Port prefilterDualInnerProdAsm (sum1=<x,y1>, sum2=<x,y2>) to two 4-lane archsimd
FMA accumulators loading through raw pointers (loadF32x4, no slice bounds checks),
gated goexperiment.simd && !purego. Lane L sums elements L,L+4,L+8,… and each
reduction is (a0+a2)+(a1+a3) with a scalar fused tail — the exact order of the
scalar reference, so bit-exact (TestPrefilterDualInnerProdMatchesReference). amd64
gates on archsimd.X86.FMA() with a scalar fallback; arm64 NEON always has FMLA. The
asm .go/.s gain !goexperiment.simd; default.go excludes amd64-under-experiment.
…m64)

Port l1AbsSumNeon to a 4-lane archsimd accumulator (Abs + Add) loading through raw
pointers (loadF32x4, no slice bounds checks), gated goexperiment.simd && !purego on
arm64. Lane k sums |tmp[k]|,|tmp[k+4]|,… and the reduction is (a0+a1)+(a2+a3)+tail —
the exact order of l1AbsSumNeonReference, so bit-exact (TestL1AbsSumNeonBitExact).
That order diverges from the scalar L1 sum by a few ULP (the arm64 quality-gated
regime), so amd64/purego keep the scalar sum — this stays arm64-only. The asm .go/.s
gain !goexperiment.simd.
The archsimd kernels match or beat the hand asm on Apple Silicon and Graviton, and
the 4-way codec scoreboard shows Go SIMD as the closest Go build to libopus C.
@thesyncim

Copy link
Copy Markdown
Owner Author

4-way comparison — Decode half (CELT decode, gopus/libopus-C ratio, lower = closer to C)

Complements the Encode table in the PR description. Decode is where scale_into and
stereo_merge live, and archsimd is consistently the closest Go build to libopus C:

CELT decode pure Go Go asm Go SIMD libopus C
mono 8k 10ms 1.60× 1.50× 1.29× 1.0
mono 48k 10ms 1.70× 1.64× 1.23× 1.0
mono 48k 20ms 1.57× 1.45× 1.32× 1.0
stereo 24k 10ms 1.52× 1.47× 1.31× 1.0
stereo 48k 10ms 1.48× 1.47× 1.34× 1.0

Go SIMD wins on most decode configs (a couple noisy ties at 20x). Final CI green.

@thesyncim

Copy link
Copy Markdown
Owner Author

Evaluated: the portable scalable simd package (vs fixed-width archsimd)

Tried reimplementing scale_into on the scalable simd.Float32s type (auto-widens to
the platform width — 4 on this arm64, 8/16 on AVX2/AVX-512 — from one source, the more
portable alternative to per-arch Float32x4/Float32x8).

Two blockers at this Go tip:

  1. It crashes the compiler on arm64 — internal compiler error: nil pointer dereference in cmd/compile/internal/midway/rewrite.go (the pass that lowers
    scalable vectors). Not usable yet.
  2. Even if it compiled, its API is slice-only (LoadFloat32s([]float32)) with no
    raw-pointer load, so it would keep the per-access bounds check that the unsafe
    archsimd kernels eliminate — the very cost that decides asm-vs-archsimd here.

Conclusion: fixed-width archsimd + the unsafe raw-pointer trick is the viable
path; the scalable package is too immature on arm64 to evaluate for performance.
Revisit once the midway backend stabilizes and an array-pointer load lands.

scale_into is the one archsimd kernel that trailed the hand asm on Neoverse (+19%,
the pure memory-copy case). Widen the arm64 unroll from 8 to 16 elements/iter (the
width that profiled fastest in the optimization sweep) to amortize loop control on
the overhead-bound core. M4 Max improves (N480 -20% vs asm, was -17%) and stays
bit-exact; Graviton effect measured in CI.
@thesyncim

Copy link
Copy Markdown
Owner Author

scale_into Graviton holdout — closed via 16-wide unroll

scale_into was the lone kernel trailing the hand asm on Neoverse (+19% at N480).
Widening the arm64 unroll from 8 to 16 elements/iter (amortizing loop control on the
overhead-bound core) closes it — Graviton archsimd vs asm:

N 8-wide (before) 16-wide (now)
64 (trailed) −11% (archsimd faster)
176 +4.6%
480 +19% +8.6%

All sizes now ≤8.6% or faster — inside the 10-15% replace-the-asm bar. M4 Max also
improved (N480 −20% vs asm, was −17%); still bit-exact.

Updated verdict: with the holdout closed, archsimd is within-or-better than the
hand NEON asm on all 6 migrated kernels across both M4 Max and Graviton — the
float-kernel assembly is fully replaceable by portable Go SIMD (once simd/archsimd
ships in a Go release; asm stays the default until then).

@thesyncim

Copy link
Copy Markdown
Owner Author

Re-examined the "blocked" kernels — root cause is archsimd's thinner arm64 backend

Pushed harder on whether more kernels can replace the asm. Findings:

  • silk int16↔float32 is portable, not blocked (I was wrong): Int16x8.ExtendLo4ToInt32()
    Int32x4.ConvertToFloat32() → ×2⁻¹⁵ is bit-exact. But benchmarked it's +27% vs the asm
    (tried 4-lane and 2-load 8-lane, both ~+27%) — arm64 archsimd has ExtendLo4 but no
    ExtendHi4
    , so it can't widen both int16 halves like the asm's SXTL/SXTL2. Portable but
    not a competitive replacement → reverted (won't ship a regression).
  • FFT/MDCT complex kernels are genuinely arm64-blocked: the bit-reinterpret-back
    (Int32x4.AsFloat32x4) exists on amd64 but not arm64, so the re/im float deinterleave they
    need is impossible without a memory round-trip.

Root cause: archsimd's arm64 backend is less complete than amd64 (doc: "currently supports
AMD64"
). Missing on arm64 — ExtendHi4, AsFloat32x4 reinterpret, float deinterleave/reverse,
gather — is what caps competitive replacement beyond the 6 float kernels. Those 6 remain the set
archsimd matches/beats on arm64; more will qualify as the arm64 backend gains ops.

thesyncim added 11 commits June 16, 2026 12:18
Port the constant-gain 5-tap comb filter (combFilterConstNeon) from hand
Plan9 asm to Go simd/archsimd, gated goexperiment.simd && !purego. Two tap
sums round as plain Adds and the three accumulates are fused MulAdds,
matching combFilterConstValue and the asm bit-for-bit. Raw-pointer loads
(loadF32x4/storeF32x4) skip the per-lane bounds check. The hand asm .go/.s
retag !goexperiment.simd so the paths partition per build config.
Port the spreading-rotation pass (expRotation1PassNeon) from hand Plan9 asm
to Go simd/archsimd, gated goexperiment.simd && !purego. Each block runs two
rounded Muls and two fused MulAdds over the x[i] and x[i+stride] lanes,
matching expRotationMac32 and the asm bit-for-bit; stride>=4 keeps the four
lanes independent. The shared expRotation1StrideNeon wrapper and the
expRotationUsesNeon flag stay in the umbrella arm64 file; the asm decl moves
to a !goexperiment.simd file and the .s retags so the kernel symbol
partitions per build. Parity holds and the Stride2 case runs ~12% faster
than the asm.
Port the int16->float32 widening conversion (writeInt16AsFloat32Core)
from hand NEON asm to simd/archsimd, gated arm64 && goexperiment.simd &&
!purego; retag the asm .go/.s !goexperiment.simd so the build partitions
per config. The default (asm + purego) build is untouched.

Each int16 is sign-extended to int32, converted to float32 and multiplied
by the exact-in-float32 1/32768 scale -- the same per-element op sequence
as the scalar reference, so it is bit-exact under both the experiment
toolchain and normal go test. The low 4 lanes widen with ExtendLo4ToInt32
and the high 4 with HiToLo().ExtendLo4ToInt32() (arm64 has no ExtendHi4),
the extra shift emulating SXTL2. Loads/stores go through raw advancing
pointers (new simd_load_store.go int16/int32/f32 helpers) to drop the
per-access slice bounds check.

Measured: archsimd 22.97 ns/op vs asm 19.02 ns/op (+20.8%, n=6) -- the
expected gap from the HiToLo shift; we take the coverage.
Port the float32->int16 narrowing conversion (floatToInt16ScaledCore plus
the floatToInt16Scaled wrapper) from hand NEON asm to simd/archsimd, gated
arm64 && goexperiment.simd && !purego; retag the asm .go/.s
!goexperiment.simd so the build partitions per config. The default
(asm + purego) build is untouched.

The op sequence reproduces the FCVTNS+SQXTN asm and the scalar
saturate-then-round-even path bit-for-bit: Mul(scale), Round() (VFRINTN,
ties-to-even) so the value is an exact integer, ConvertToInt32() (VFCVTZS
truncation is then lossless), and SaturateToInt16() (VSQXTN clamp to
[-32768,32767]). Two int32 quads saturate separately and their low halves
merge with a VZIP1 (Uint64x2.InterleaveLo), reproducing SQXTN2.
TestFloatToInt16ScaledBitExact and ...Boundaries pass under both the
experiment toolchain and normal go test.

Measured: archsimd 23.87 ns/op vs asm 17.66 ns/op (+35.2%, n=6) -- the
gap is Round+ConvertToInt32 (two ops) vs the asm's fused FCVTNS plus the
VZIP1 merge vs free SQXTN2; we take the coverage.
…4 patch)

Port the IMDCT pre-rotation and TDAC windowing kernels to archsimd, gated
goexperiment.simd && !purego. Both rely on a full 4-lane reverse (reverse4) built on
Float32x4.Reverse64 — the arm64 VREV64 op added in github.com/thesyncim/go branch
arm64-simd-reverse64 — for their descending spectrum/window accesses; they will not
build on a vanilla gotip. Bit-exact vs the scalar references
(TestIMDCTPreRotateFMA32KissMatchesScalar, TestIMDCTTDACWindowFMA32MatchesScalar).

imdct_tdac (real-valued, reverse only) beats the hand asm on M4 Max (-15%); imdct_pre
(complex re/im deinterleave) is +26%, the cost of archsimd lacking a VLD2-style
deinterleaving load. The asm/purego paths stay the default for non-experiment builds.
Port the IMDCT post-rotation (imdctPostRotateF32FromKiss) from hand asm to
Go simd/archsimd, gated goexperiment.simd && !purego (needs the Reverse64 op
from the patched toolchain). Forward four come from a ConcatEven/Odd
deinterleave of the kissCpx scratch; the backward four walk down and reverse4
so descending lanes line up; products are single-round Muls and accumulates
fused MulAdds, matching mdctMulAddMix/mdctMulSubMix bit-for-bit. The asm
.go/.s retag !goexperiment.simd. Benchmarks at asm parity.
Port the forward-MDCT post-twiddle (mdctPostTwiddleNeon) from hand asm to Go
simd/archsimd, gated goexperiment.simd && !purego. Each block pairs a forward
run and its mirror j=n4-1-i; ConcatEven/Odd deinterleaves the kissCpx, the two
ends tile coeffs contiguously via reverse4 + InterleaveLo/Hi. Products are
single-round Muls and the combines plain Sub/Add (no fusion), matching mdctMul
and the scalar loop bit-for-bit. The asm .go/.s retag !goexperiment.simd.
Port the forward-MDCT middle fold (mdctMidFoldStoreNeon) from hand asm to Go
simd/archsimd, gated goexperiment.simd && !purego. The (re,im) compute is
vectorized four lanes at a time (ConcatEven deinterleave, descending re via
reverse4, fused MulAdd, scaling Mul) matching mdctStoreDirectStageFMALike
bit-for-bit; the bit-reversed dst[bitrev[i]] scatter stays scalar, exactly as
the asm does. The asm .go/.s retag !goexperiment.simd.
Port the two windowed forward-MDCT fold kernels (mdctFold1StoreNeon,
mdctFold3StoreNeon) from hand asm to Go simd/archsimd, gated
goexperiment.simd && !purego. Each deinterleaves six sample/window streams
(ConcatEven ascending, reverse4 descending) and combines them with fused
MulAdds + single-round Muls matching mdctMulAddMixEncode/mdctMulSubMixEncode/
mdctMulSubMixAlt; the shared twiddle/scale/bit-reversed scatter tail
(mdctFoldStore) mirrors mid_fold. The asm .go/.s retag !goexperiment.simd.
The 7 complex IMDCT/MDCT archsimd kernels need Float32x4.Reverse64, which is
not in upstream Go yet (golang/go#80032). Gate their simd path behind the
gopus_reverse64 build tag and route the experiment build to the asm fallback
when it is unset, so the stock-gotip experiment CI stays green (vanilla
archsimd + complex asm) and the complex archsimd path builds only on the
patched toolchain with -tags gopus_reverse64.
@thesyncim thesyncim marked this pull request as draft June 16, 2026 13:01
@thesyncim thesyncim changed the title celt: migrate kernels to Go simd/archsimd packages (scale_into, stereo_merge, inner_prod) celt+silk: migrate hot kernels to Go simd/archsimd (vanilla wins + silk widen + gated complex IMDCT/MDCT) Jun 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant