Skip to content

Remove align-up vectors in dense Load/Store of SVE2#9120

Open
stevesuzuki-arm wants to merge 1 commit intohalide:mainfrom
stevesuzuki-arm:pr-rm_ls_predicate
Open

Remove align-up vectors in dense Load/Store of SVE2#9120
stevesuzuki-arm wants to merge 1 commit intohalide:mainfrom
stevesuzuki-arm:pr-rm_ls_predicate

Conversation

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor

@stevesuzuki-arm stevesuzuki-arm commented May 4, 2026

Now that fallback mechanism excludes odd sized vectors in SVE2, this change results in better instructions by LLVM optimization than using predicates.

Breaking changes

No

Checklist

  • Tests added or updated (not required for docs, CI config, or typo fixes)
  • Documentation updated (if public API changed)
  • Python bindings updated (if public API changed)
  • Benchmarks are included here if the change is intended to affect performance.
  • Commits include AI attribution where applicable (see Code of Conduct)

Now that fallback mechanism excludes odd sized vectors in SVE2,
this change results in better instructions by LLVM optimization
than using predicates.
@alexreinking
Copy link
Copy Markdown
Member

re: "Benchmarks are included here if the change is intended to affect performance" and "this change results in better instructions by LLVM optimization than using predicates."

Can you please show us a before/after codegen snippet? If the point is to change codegen, I like to see the change 🙂

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

re: "Benchmarks are included here if the change is intended to affect performance" and "this change results in better instructions by LLVM optimization than using predicates."

Can you please show us a before/after codegen snippet? If the point is to change codegen, I like to see the change 🙂

Before

  %222 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0(ptr nonnull align 2 %54, <vscale x 16 x i1> %38, <vscale x 16 x i8> poison), !tbaa !184
  %223 = tail call <vscale x 8 x i8> @llvm.vector.extract.nxv8i8.nxv16i8(<vscale x 16 x i8> %222, i64 0)
  %224 = tail call <vscale x 16 x i8> @llvm.vector.insert.nxv16i8.nxv8i8(<vscale x 16 x i8> zeroinitializer, <vscale x 8 x i8> %223, i64 0)
  %225 = add nuw nsw i64 %56, 524
  %226 = mul nsw i64 %225, %39
  %227 = getelementptr i8, ptr %bilateral_slice_and_apply_2, i64 %226
  %228 = getelementptr i8, ptr %227, i64 930
  tail call void @llvm.masked.store.nxv16i8.p0(<vscale x 16 x i8> %224, ptr align 1 %228, <vscale x 16 x i1> %38), !tbaa !148
	ld1b	{ z0.b }, p2/z, [x14]
	uunpklo	z0.h, z0.b
	uzp1	z0.b, z0.b, z2.b
	st1b	{ z0.b }, p2, [x13, x20]

Align-up to x16 and slice down/up is probablly complicating

After

  %185 = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0(ptr nonnull align 2 %49, <vscale x 8 x i1> %33, <vscale x 8 x i8> poison), !tbaa !184
  %186 = add nuw nsw i64 %51, 524
  %187 = mul nsw i64 %186, %34
  %188 = getelementptr i8, ptr %bilateral_slice_and_apply_2, i64 %187
  %189 = getelementptr i8, ptr %188, i64 930
  tail call void @llvm.masked.store.nxv8i8.p0(<vscale x 8 x i8> %185, ptr align 1 %189, <vscale x 8 x i1> %33), !tbaa !148
	ld1b	{ z0.h }, p1/z, [x17]
	st1b	{ z0.h }, p1, [x9, x20]

@alexreinking
Copy link
Copy Markdown
Member

Thanks! That's clearly better... is there a simple regression test we can add that checks for the absence of uunpklo and uzp1 instructions?

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

Thanks! That's clearly better... is there a simple regression test we can add that checks for the absence of uunpklo and uzp1 instructions?

    Target t("arm-64-linux-sve2-vector_bits_128-no_asserts-no_runtime-no_bounds_query");
    ImageParam im(UInt(8), 1);
    Func f("f");
    f(x) = im(x);
    f.compute_root().vectorize(x, 8);
    f.compile_to_assembly("/dev/stdout", f.infer_arguments(), t);

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

@alexreinking
Copy link
Copy Markdown
Member

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

In simd_op_check_arm should be good.

@stevesuzuki-arm
Copy link
Copy Markdown
Contributor Author

This is a simple repro. If you think it worth adding to existing tests, would you suggest where to place?

In simd_op_check_arm should be good.

Just in case, do you mean simd_op_check_arm instead of simd_op_check_sve2?
Do you know if there is any existing test which checks if some instruction is not emitted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants