Remove align-up vectors in dense Load/Store of SVE2 by stevesuzuki-arm · Pull Request #9120 · halide/Halide

stevesuzuki-arm · 2026-05-04T15:54:59Z

Now that fallback mechanism excludes odd sized vectors in SVE2, this change results in better instructions by LLVM optimization than using predicates.

Breaking changes

No

Checklist

Tests added or updated (not required for docs, CI config, or typo fixes)
Documentation updated (if public API changed)
Python bindings updated (if public API changed)
Benchmarks are included here if the change is intended to affect performance.
Commits include AI attribution where applicable (see Code of Conduct)

Now that fallback mechanism excludes odd sized vectors in SVE2, this change results in better instructions by LLVM optimization than using predicates.

alexreinking · 2026-05-04T17:14:50Z

re: "Benchmarks are included here if the change is intended to affect performance" and "this change results in better instructions by LLVM optimization than using predicates."

Can you please show us a before/after codegen snippet? If the point is to change codegen, I like to see the change 🙂

stevesuzuki-arm · 2026-05-04T19:12:16Z

re: "Benchmarks are included here if the change is intended to affect performance" and "this change results in better instructions by LLVM optimization than using predicates."

Can you please show us a before/after codegen snippet? If the point is to change codegen, I like to see the change 🙂

Before

  %222 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr nonnull align 2 %54, <vscale x 16 x i1> %38, <vscale x 16 x i8> poison), !tbaa !184
  %223 = tail call <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv16i8(<vscale x 16 x i8> %222, i64 0)
  %224 = tail call <vscale x 16 x i8> @llvm.vector.insert.nxv16i8.nxv8i8(<vscale x 16 x i8> zeroinitializer, <vscale x 8 x i8> %223, i64 0)
  %225 = add nuw nsw i64 %56, 524
  %226 = mul nsw i64 %225, %39
  %227 = getelementptr i8, ptr %bilateral_slice_and_apply_2, i64 %226
  %228 = getelementptr i8, ptr %227, i64 930
  tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %224, ptr align 1 %228, <vscale x 16 x i1> %38), !tbaa !148

	ld1b	{ z0.b }, p2/z, [x14]
	uunpklo	z0.h, z0.b
	uzp1	z0.b, z0.b, z2.b
	st1b	{ z0.b }, p2, [x13, x20]

Align-up to x16 and slice down/up is probablly complicating

After

  %185 = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr nonnull align 2 %49, <vscale x 8 x i1> %33, <vscale x 8 x i8> poison), !tbaa !184
  %186 = add nuw nsw i64 %51, 524
  %187 = mul nsw i64 %186, %34
  %188 = getelementptr i8, ptr %bilateral_slice_and_apply_2, i64 %187
  %189 = getelementptr i8, ptr %188, i64 930
  tail call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> %185, ptr align 1 %189, <vscale x 8 x i1> %33), !tbaa !148

	ld1b	{ z0.h }, p1/z, [x17]
	st1b	{ z0.h }, p1, [x9, x20]

alexreinking · 2026-05-04T20:10:35Z

Thanks! That's clearly better... is there a simple regression test we can add that checks for the absence of uunpklo and uzp1 instructions?

stevesuzuki-arm · 2026-05-05T07:21:40Z

Thanks! That's clearly better... is there a simple regression test we can add that checks for the absence of uunpklo and uzp1 instructions?

    Target t("arm-64-linux-sve2-vector_bits_128-no_asserts-no_runtime-no_bounds_query");
    ImageParam im(UInt(8), 1);
    Func f("f");
    f(x) = im(x);
    f.compute_root().vectorize(x, 8);
    f.compile_to_assembly("/dev/stdout", f.infer_arguments(), t);

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

alexreinking · 2026-05-05T15:04:29Z

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

In simd_op_check_arm should be good.

stevesuzuki-arm · 2026-05-05T15:13:44Z

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

In simd_op_check_arm should be good.

Just in case, do you mean simd_op_check_arm instead of simd_op_check_sve2?
Do you know if there is any existing test which checks if some instruction is not emitted?

Remove align-up vectors in dense Load/Store of SVE2

d6d20f3

Now that fallback mechanism excludes odd sized vectors in SVE2, this change results in better instructions by LLVM optimization than using predicates.

alexreinking requested a review from halidebuildbots May 5, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove align-up vectors in dense Load/Store of SVE2#9120

Remove align-up vectors in dense Load/Store of SVE2#9120
stevesuzuki-arm wants to merge 1 commit intohalide:mainfrom
stevesuzuki-arm:pr-rm_ls_predicate

stevesuzuki-arm commented May 4, 2026 •

edited

Loading

Uh oh!

alexreinking commented May 4, 2026

Uh oh!

stevesuzuki-arm commented May 4, 2026

Uh oh!

alexreinking commented May 4, 2026

Uh oh!

stevesuzuki-arm commented May 5, 2026

Uh oh!

alexreinking commented May 5, 2026

Uh oh!

stevesuzuki-arm commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stevesuzuki-arm commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Breaking changes

Checklist

Uh oh!

alexreinking commented May 4, 2026

Uh oh!

stevesuzuki-arm commented May 4, 2026

Uh oh!

alexreinking commented May 4, 2026

Uh oh!

stevesuzuki-arm commented May 5, 2026

Uh oh!

alexreinking commented May 5, 2026

Uh oh!

stevesuzuki-arm commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stevesuzuki-arm commented May 4, 2026 •

edited

Loading