8361582: AArch64: Some ConH values cannot be replicated with SVE #26589

Bhavana-Kilambi · 2025-08-01T09:31:40Z

After this commit - a49ecb2, the JTREG test -
test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java fails for some of the tests which contain constant values such as -

public void vectorAddConstInputFloat16() {
         for (int i = 0; i < LEN; ++i) {
             output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST));
         }
     }

The current code in the JDK results in the generation of sve_dup instruction for every 16-bit immediate while the acceptable range is [-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a multiple of 256 for 16-bit signed immediates.

This patch allows the generation of sve_dup instruction for only those 16-bit values which are within the limits as specified above and for the values which are out of range, the immediate half float value is loaded from the constant pool into a register ("loadConH" mach node) which is then replicated or broadcasted to an SVE register ("replicateHF" mach node).

Both the tests - test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java and test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java pass on 256-bit SVE machine. JTREG tests - hotspot (hotspot_all), langtools (tier1) and jdk(tier 1-3) pass on the same machine.

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

JDK-8361582: AArch64: Some ConH values cannot be replicated with SVE (Bug - P3)

Reviewers

Aleksey Shipilev (@shipilev - Reviewer)
Andrew Haley (@theRealAph - Reviewer) 🔄 Re-review required (review applies to bcecc6e1)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26589/head:pull/26589
$ git checkout pull/26589

Update a local copy of the PR:
$ git checkout pull/26589
$ git pull https://git.openjdk.org/jdk.git pull/26589/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26589

View PR using the GUI difftool:
$ git pr show -t 26589

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26589.diff

Using Webrev

Link to Webrev Comment

After this commit - openjdk@a49ecb2, the JTREG test - hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java fails for some of the tests which contain constant values such as - public void vectorAddConstInputFloat16() { for (int i = 0; i < LEN; ++i) { output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST)); } } <The full failure log is present in the JBS ticket, thus not reproducing it here> The current code in the JDK results in the generation of sve_dup instruction for every 16-bit immediate while the acceptable range is [-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a multiple of 256 for 16-bit signed immediates. This patch allows the generation of sve_dup instruction for only those 16-bit values which are within the limits as specified above and for the values which are out of range, the immediate half float value is loaded from the constant pool into a register ("loadConH" mach node) which is then replicated or broadcasted to an SVE register ("replicateHF" mach node).

bridgekeeper · 2025-08-01T09:32:45Z

👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-08-01T09:33:36Z

@Bhavana-Kilambi This change is no longer ready for integration - check the PR body for details.

openjdk · 2025-08-01T09:34:34Z

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-08-01T09:41:47Z

Webrevs

src/hotspot/cpu/aarch64/aarch64_vector.ad

shipilev

Thank you for taking care of this.

I am still a bit confused what matches Replicate with immH that does not fit immH8_shift8 when Matcher::vector_length_in_bytes(n) > 16?

src/hotspot/cpu/aarch64/aarch64.ad

Bhavana-Kilambi · 2025-08-01T12:04:35Z

I am still a bit confused what matches Replicate with immH that does not fit immH8_shift8 when Matcher::vector_length_in_bytes(n) > 16?

Hi, thanks for your review. If the immediate value does not fit immH8_shift8 for Matcher::vector_length_in_bytes(n) > 16 , the compiler would generate loadConH [1] -> replicateHF [2] backend nodes instead. The constant would be loaded from the constant pool instead and then broadcasted/replicated to every lane of an SVE register.

[1]

jdk/src/hotspot/cpu/aarch64/aarch64.ad

Line 6963 in 8ac4a88

instruct loadConH(vRegF dst, immH con) %{

[2]

jdk/src/hotspot/cpu/aarch64/aarch64_vector.ad

Line 4806 in 8ac4a88

instruct replicateHF(vReg dst, vRegF src) %{

src/hotspot/cpu/aarch64/assembler_aarch64.cpp

shipilev · 2025-08-01T14:29:47Z

If the immediate value does not fit immH8_shift8 for Matcher::vector_length_in_bytes(n) > 16 , the compiler would generate loadConH [1] -> replicateHF [2] backend nodes instead.

Ah OK, just checking. I ran this patch on the machine where I have originally found the issue, and it seems to work.

openjdk · 2025-08-06T18:22:15Z

@Bhavana-Kilambi this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8361582
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Bhavana-Kilambi · 2025-08-07T13:19:47Z

Hi @shipilev @theRealAph Can I please ask for another round of review? I have addressed the review comments in the latest patch. Thanks a lot!

test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java

src/hotspot/cpu/aarch64/assembler_aarch64.cpp

shipilev

This looks reasonable to me, thanks. Some nits in the test remain, but they are non-blocking.

/reviewers 2

test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java

shipilev · 2025-08-11T09:11:24Z

test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java

+    private static short[] input;
+    private static short[] output;


This might give things even more chance to vectorize? Not sure, feel free to ignore.

Suggested change

private static short[] input;

private static short[] output;

private static final short[] INPUTE;

private static final short[] OUTPUT;

I hope it's ok to not add these changes to the code. The loops are getting vectorized fine and the tests do pass on aarch64 and x86. I will consider this if there's any issue with auto vectorization in the futurre. Thanks

openjdk · 2025-08-11T09:15:54Z

@shipilev
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

Bhavana-Kilambi · 2025-08-11T10:13:43Z

Thanks for your review comments and approval @shipilev. I will address your review comments in the next patchset along with any other comments.

@theRealAph Would you be able to take another look at the updated patch please? Thank you in advance!

theRealAph · 2025-08-11T12:09:18Z

For loadConH, LLVM and GCC use

        mov     wscratch, #const
        dup     v0.4h, wscratch

We should investigate that.

As far as I can see, LLVM and GCC do this for all vector immediates that don't need more than 2 movz/movk instructions.

shipilev · 2025-08-12T09:01:51Z

Can we/should we plug this problem in encoding first, without going too much into the optimizing the non-broken case? As it stands now, real FP16-using code can run into matcher errors in JDK 25. I would like to fix that first.

Bhavana-Kilambi · 2025-08-12T09:13:15Z

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

theRealAph · 2025-08-12T09:50:21Z

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

Bhavana-Kilambi · 2025-08-12T10:25:55Z

For loadConH, LLVM and GCC use
        mov     wscratch, #const
        dup     v0.4h, wscratch
We should investigate that.

As far as I can see, LLVM and GCC do this for all vector immediates that don't need more than 2 movz/movk instructions.

Just a quick look at what can be done to improve the codegen -
The code shown above is Neon and this is similar to what we generate in hotspot as well (for Neon) -

0x0000e8a5e482ea80:   mov     w8, #0x40b                      // #1035
0x0000e8a5e482ea84:   dup     v18.8h, w8

Only on SVE machines with >16B vectors, we load from the constant table -

0x0000e9cdc902e370:   ldr     s16, 0x0000e9cdc902e280     ;   {section_word}
0x0000e9cdc902e440:   mov     z17.h, p7/m, h16

We could generate a similar mov and a dup from a GPR (instead of immediate) even in the SVE case. I'll update the patch as soon as I can. Thanks a lot for your suggestion

Bhavana-Kilambi · 2025-08-13T10:17:54Z

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

Apologies, I thought I could change just the replicate backend nodes to be able to generate the mov to a scratch reg -> dup to replicate the value but missed the point that I can't still get rid of the loadConH node that loads the immediate from the constant pool. If we want to change loadConH to instead generate a mov of an immediate to a scratch register, then we might have to change the dst from being a vRegF to a iRegI -
instruct loadConH(vRegF dst, immH con) %{
which I am not in favour of as the scalar FP16 operations still expect the value to be in an FPR. I will work on this on a follow up JBS ticket. Thanks.
If this is acceptable, could I please get another round of review of the updated patch?

theRealAph · 2025-08-13T12:27:14Z

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

Apologies, I thought I could change just the replicate backend nodes to be able to generate the mov to a scratch reg -> dup to replicate the value but missed the point that I can't still get rid of the loadConH node that loads the immediate from the constant pool.

Why not?

If we want to change loadConH to instead generate a mov of an immediate to a scratch register, then we might have to change the dst from being a vRegF to a iRegI

I don't understand.

Why not do something along these lines?

// Replicate a 16-bit half precision float
instruct replicateHF_imm8_gt128b(vReg dst, immHDupV con) %{
  predicate(Matcher::vector_length_in_bytes(n) > 16);
  match(Set dst (Replicate con));
  format %{ "replicateHF_imm8_gt128b $dst, $con\t# vector > 128 bits" %}
  ins_encode %{
    assert(UseSVE > 0, "must be sve");
   if (constant fits) {
    __ sve_dup($dst$$FloatRegister, __ H, (int)($con$$constant));
  } else
    __ mov(rscratch1, (int)($con$$constant));
    __ sve_dup($dst$$FloatRegister, __ H, rscratch1);
  }

  %}
  ins_pipe(pipe_slow);
%}

Bhavana-Kilambi · 2025-08-13T12:50:09Z

Why not do something along these lines?

I tried exactly that and it does generate a mov and a dup for illegal immediates which is why I initially said I would put up a patch soon but I realised later that the loadConH node is also being generated somewhere above (most likely because the value it loads is required for the scalar AddHF nodes). This isn't ideal? As we wanted to get rid of the load from the constant pool in the first place, if I got you right?

I don't understand.

Apologies for not being clear.
Another approach I thought was to directly modify the loadConH itself.

loadConH is defined as -

instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "ldrs $dst, [$constantaddress]\t# load from constant table: half float=$con\n\t"
  %}
  ins_encode %{
    __ ldrs(as_FloatRegister($dst$$reg), $constantaddress($con));
  %}
  ins_pipe(fp_load_constant_s);
%}

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI which probably could be acceptable for autovectorization as we are replicating the value in a vector register anyway but for the scalar AddHF operation (the iterations that get peeled or the ones in pre/post loop which are not autovectorized), it would expect the value to be available in an FPR instead (the h register variant). So we might have to introduce a move from the GPR to an FPR. The reason why I felt I needed more time to investigate this. Please let me know your thoughts. Thanks!

theRealAph · 2025-08-13T13:01:57Z

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI

This is the part I don't understand. Why would you have to change the destination register to iregI? I wouldn't.

instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "something"
  %}
  ins_encode %{
    __ movw(rscratch1, $con$$constant);
    __ fmovs($dst$$reg, rscratch1);
  %}

Bhavana-Kilambi · 2025-08-13T14:27:20Z

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI

This is the part I don't understand. Why would you have to change the destination register to iregI? I wouldn't.
instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "something"
  %}
  ins_encode %{
    __ movw(rscratch1, $con$$constant);
    __ fmovs($dst$$reg, rscratch1);
  %}

Thanks. Yes I could do that (as I mentioned earlier in my comment), but I was trying to avoid the extra mov -fmov. Just that I wasn't sure if this version would be faster than an ldr. But it just occured to me that I could compare the latencies. ldr on V1 has a latency of 4 cyc and mov + fmov is 1 + 2 = 3 cyc. So it probably makes sense to go with two moves. Thanks!

Bhavana-Kilambi marked this pull request as draft August 1, 2025 09:34

openjdk bot added the rfr Pull request is ready for review label Aug 1, 2025

openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org and removed rfr Pull request is ready for review labels Aug 1, 2025

Bhavana-Kilambi marked this pull request as ready for review August 1, 2025 09:36

openjdk bot added the rfr Pull request is ready for review label Aug 1, 2025

theRealAph reviewed Aug 1, 2025

View reviewed changes

src/hotspot/cpu/aarch64/aarch64_vector.ad Outdated Show resolved Hide resolved

shipilev reviewed Aug 1, 2025

View reviewed changes

src/hotspot/cpu/aarch64/aarch64.ad Show resolved Hide resolved

theRealAph reviewed Aug 1, 2025

View reviewed changes

src/hotspot/cpu/aarch64/assembler_aarch64.cpp Outdated Show resolved Hide resolved

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Aug 6, 2025

Merge branch 'master' into JDK-8361582

30d82f8

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Aug 7, 2025

Addressed review comments

a44eccc

shipilev reviewed Aug 7, 2025

View reviewed changes

Addressed review comments and modified some comments

bcecc6e

shipilev approved these changes Aug 11, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Aug 11, 2025

openjdk bot removed the ready Pull request is ready to be integrated label Aug 11, 2025

theRealAph approved these changes Aug 12, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Aug 12, 2025

Add an extra space in one of the comments

f8dc132

openjdk bot removed the ready Pull request is ready to be integrated label Aug 13, 2025

Bhavana-Kilambi requested review from theRealAph and shipilev August 13, 2025 10:18

shipilev approved these changes Aug 13, 2025

View reviewed changes

8361582: AArch64: Some ConH values cannot be replicated with SVE #26589

Are you sure you want to change the base?

8361582: AArch64: Some ConH values cannot be replicated with SVE #26589

Conversation

Bhavana-Kilambi commented Aug 1, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Aug 1, 2025

Uh oh!

openjdk bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Aug 1, 2025

Uh oh!

mlbridge bot commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

Uh oh!

shipilev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Bhavana-Kilambi commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shipilev commented Aug 1, 2025

Uh oh!

openjdk bot commented Aug 6, 2025

Uh oh!

Bhavana-Kilambi commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shipilev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shipilev Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Bhavana-Kilambi Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

openjdk bot commented Aug 11, 2025

Uh oh!

Bhavana-Kilambi commented Aug 11, 2025

Uh oh!

theRealAph commented Aug 11, 2025

Uh oh!

shipilev commented Aug 12, 2025

Uh oh!

Bhavana-Kilambi commented Aug 12, 2025

Uh oh!

theRealAph commented Aug 12, 2025

Uh oh!

Bhavana-Kilambi commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bhavana-Kilambi commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theRealAph commented Aug 13, 2025

Uh oh!

Bhavana-Kilambi commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theRealAph commented Aug 13, 2025

Uh oh!

Bhavana-Kilambi commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Bhavana-Kilambi commented Aug 1, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Aug 1, 2025 •

edited

Loading

mlbridge bot commented Aug 1, 2025 •

edited

Loading

Bhavana-Kilambi commented Aug 1, 2025 •

edited

Loading

Bhavana-Kilambi commented Aug 12, 2025 •

edited

Loading

Bhavana-Kilambi commented Aug 13, 2025 •

edited

Loading

Bhavana-Kilambi commented Aug 13, 2025 •

edited

Loading

Bhavana-Kilambi commented Aug 13, 2025 •

edited

Loading