Skip to content

8361582: AArch64: Some ConH values cannot be replicated with SVE #26589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Bhavana-Kilambi
Copy link
Contributor

@Bhavana-Kilambi Bhavana-Kilambi commented Aug 1, 2025

After this commit - a49ecb2, the JTREG test -
test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java fails for some of the tests which contain constant values such as -

public void vectorAddConstInputFloat16() {
         for (int i = 0; i < LEN; ++i) {
             output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST));
         }
     }

<The full failure log is present in the JBS ticket, thus not reproducing it here>

The current code in the JDK results in the generation of sve_dup instruction for every 16-bit immediate while the acceptable range is [-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a multiple of 256 for 16-bit signed immediates.

This patch allows the generation of sve_dup instruction for only those 16-bit values which are within the limits as specified above and for the values which are out of range, the immediate half float value is loaded from the constant pool into a register ("loadConH" mach node) which is then replicated or broadcasted to an SVE register ("replicateHF" mach node).

Both the tests - test/hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java and test/hotspot/jtreg/compiler/c2/aarch64/TestFloat16Replicate.java pass on 256-bit SVE machine. JTREG tests - hotspot (hotspot_all), langtools (tier1) and jdk(tier 1-3) pass on the same machine.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8361582: AArch64: Some ConH values cannot be replicated with SVE (Bug - P3)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26589/head:pull/26589
$ git checkout pull/26589

Update a local copy of the PR:
$ git checkout pull/26589
$ git pull https://git.openjdk.org/jdk.git pull/26589/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26589

View PR using the GUI difftool:
$ git pr show -t 26589

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26589.diff

Using Webrev

Link to Webrev Comment

After this commit -
openjdk@a49ecb2,
the JTREG test -
hotspot/jtreg/compiler/vectorization/TestFloat16VectorOperations.java
fails for some of the tests which contain constant values such as -

public void vectorAddConstInputFloat16() {
         for (int i = 0; i < LEN; ++i) {
             output[i] = float16ToRawShortBits(add(shortBitsToFloat16(input1[i]), FP16_CONST));
         }
     }
<The full failure log is present in the JBS ticket, thus not
reproducing it here>

The current code in the JDK results in the generation of sve_dup
instruction for every 16-bit immediate while the acceptable range is
[-128, 127] for 8-bit immediates and [-127 << 8, 128 << 8] with a
multiple of 256 for 16-bit signed immediates.

This patch allows the generation of sve_dup instruction for only those
16-bit values which are within the limits as specified above and for the
values which are out of range, the immediate half float value is loaded
from the constant pool into a register ("loadConH" mach node) which is
then replicated or broadcasted to an SVE register ("replicateHF" mach
node).
@bridgekeeper
Copy link

bridgekeeper bot commented Aug 1, 2025

👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Aug 1, 2025

@Bhavana-Kilambi This change is no longer ready for integration - check the PR body for details.

@Bhavana-Kilambi Bhavana-Kilambi marked this pull request as draft August 1, 2025 09:34
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 1, 2025
@openjdk
Copy link

openjdk bot commented Aug 1, 2025

@Bhavana-Kilambi The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org and removed rfr Pull request is ready for review labels Aug 1, 2025
@Bhavana-Kilambi Bhavana-Kilambi marked this pull request as ready for review August 1, 2025 09:36
@openjdk openjdk bot added the rfr Pull request is ready for review label Aug 1, 2025
@mlbridge
Copy link

mlbridge bot commented Aug 1, 2025

Webrevs

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking care of this.

I am still a bit confused what matches Replicate with immH that does not fit immH8_shift8 when Matcher::vector_length_in_bytes(n) > 16?

@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Aug 1, 2025

I am still a bit confused what matches Replicate with immH that does not fit immH8_shift8 when Matcher::vector_length_in_bytes(n) > 16?

Hi, thanks for your review. If the immediate value does not fit immH8_shift8 for Matcher::vector_length_in_bytes(n) > 16 , the compiler would generate loadConH [1] -> replicateHF [2] backend nodes instead. The constant would be loaded from the constant pool instead and then broadcasted/replicated to every lane of an SVE register.

[1]

instruct loadConH(vRegF dst, immH con) %{

[2]

instruct replicateHF(vReg dst, vRegF src) %{

@shipilev
Copy link
Member

shipilev commented Aug 1, 2025

If the immediate value does not fit immH8_shift8 for Matcher::vector_length_in_bytes(n) > 16 , the compiler would generate loadConH [1] -> replicateHF [2] backend nodes instead.

Ah OK, just checking. I ran this patch on the machine where I have originally found the issue, and it seems to work.

@openjdk
Copy link

openjdk bot commented Aug 6, 2025

@Bhavana-Kilambi this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8361582
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Aug 6, 2025
@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Aug 7, 2025
@Bhavana-Kilambi
Copy link
Contributor Author

Hi @shipilev @theRealAph Can I please ask for another round of review? I have addressed the review comments in the latest patch. Thanks a lot!

Copy link
Member

@shipilev shipilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me, thanks. Some nits in the test remain, but they are non-blocking.

/reviewers 2

Comment on lines +44 to +45
private static short[] input;
private static short[] output;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might give things even more chance to vectorize? Not sure, feel free to ignore.

Suggested change
private static short[] input;
private static short[] output;
private static final short[] INPUTE;
private static final short[] OUTPUT;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope it's ok to not add these changes to the code. The loops are getting vectorized fine and the tests do pass on aarch64 and x86. I will consider this if there's any issue with auto vectorization in the futurre. Thanks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 11, 2025
@openjdk
Copy link

openjdk bot commented Aug 11, 2025

@shipilev
The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 11, 2025
@Bhavana-Kilambi
Copy link
Contributor Author

Thanks for your review comments and approval @shipilev. I will address your review comments in the next patchset along with any other comments.

@theRealAph Would you be able to take another look at the updated patch please? Thank you in advance!

@theRealAph
Copy link
Contributor

For loadConH, LLVM and GCC use

        mov     wscratch, #const
        dup     v0.4h, wscratch

We should investigate that.

As far as I can see, LLVM and GCC do this for all vector immediates that don't need more than 2 movz/movk instructions.

@shipilev
Copy link
Member

Can we/should we plug this problem in encoding first, without going too much into the optimizing the non-broken case? As it stands now, real FP16-using code can run into matcher errors in JDK 25. I would like to fix that first.

@Bhavana-Kilambi
Copy link
Contributor Author

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

@theRealAph
Copy link
Contributor

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Aug 12, 2025
@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Aug 12, 2025

For loadConH, LLVM and GCC use

        mov     wscratch, #const
        dup     v0.4h, wscratch

We should investigate that.

As far as I can see, LLVM and GCC do this for all vector immediates that don't need more than 2 movz/movk instructions.

Just a quick look at what can be done to improve the codegen -
The code shown above is Neon and this is similar to what we generate in hotspot as well (for Neon) -

0x0000e8a5e482ea80:   mov     w8, #0x40b                      // #1035
0x0000e8a5e482ea84:   dup     v18.8h, w8

Only on SVE machines with >16B vectors, we load from the constant table -

0x0000e9cdc902e370:   ldr     s16, 0x0000e9cdc902e280     ;   {section_word}
0x0000e9cdc902e440:   mov     z17.h, p7/m, h16

We could generate a similar mov and a dup from a GPR (instead of immediate) even in the SVE case. I'll update the patch as soon as I can. Thanks a lot for your suggestion

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Aug 13, 2025
@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Aug 13, 2025

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

Apologies, I thought I could change just the replicate backend nodes to be able to generate the mov to a scratch reg -> dup to replicate the value but missed the point that I can't still get rid of the loadConH node that loads the immediate from the constant pool. If we want to change loadConH to instead generate a mov of an immediate to a scratch register, then we might have to change the dst from being a vRegF to a iRegI -
instruct loadConH(vRegF dst, immH con) %{
which I am not in favour of as the scalar FP16 operations still expect the value to be in an FPR. I will work on this on a follow up JBS ticket. Thanks.
If this is acceptable, could I please get another round of review of the updated patch?

@theRealAph
Copy link
Contributor

HI @theRealAph Thanks a lot for your comment. I feel it is a good idea to modify loadConH to move a constant instead of doing an ldr from the constant pool (it could probably get us some performance benefit as well). However, the scope of this ticket was to mainly fix the JTREG errors that >16B SVE machines were running into due to illegal immediates being passed to the sve_dup instruction. Would it be acceptable if I push this fix first and then create a follow up task to work on optimizing loadConH? I can create a new JBS ticket and assign it to myself and tag it here as well if that helps. Thank you!

Well, yes, but I'm proposing a simpler and better fix to the problem. Sure, if you want to do this in two steps go ahead.

Apologies, I thought I could change just the replicate backend nodes to be able to generate the mov to a scratch reg -> dup to replicate the value but missed the point that I can't still get rid of the loadConH node that loads the immediate from the constant pool.

Why not?

If we want to change loadConH to instead generate a mov of an immediate to a scratch register, then we might have to change the dst from being a vRegF to a iRegI

I don't understand.

Why not do something along these lines?

// Replicate a 16-bit half precision float
instruct replicateHF_imm8_gt128b(vReg dst, immHDupV con) %{
  predicate(Matcher::vector_length_in_bytes(n) > 16);
  match(Set dst (Replicate con));
  format %{ "replicateHF_imm8_gt128b $dst, $con\t# vector > 128 bits" %}
  ins_encode %{
    assert(UseSVE > 0, "must be sve");
   if (constant fits) {
    __ sve_dup($dst$$FloatRegister, __ H, (int)($con$$constant));
  } else
    __ mov(rscratch1, (int)($con$$constant));
    __ sve_dup($dst$$FloatRegister, __ H, rscratch1);
  }

  %}
  ins_pipe(pipe_slow);
%}

@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Aug 13, 2025

Why not do something along these lines?

I tried exactly that and it does generate a mov and a dup for illegal immediates which is why I initially said I would put up a patch soon but I realised later that the loadConH node is also being generated somewhere above (most likely because the value it loads is required for the scalar AddHF nodes). This isn't ideal? As we wanted to get rid of the load from the constant pool in the first place, if I got you right?

I don't understand.

Apologies for not being clear.
Another approach I thought was to directly modify the loadConH itself.

loadConH is defined as -

instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "ldrs $dst, [$constantaddress]\t# load from constant table: half float=$con\n\t"
  %}
  ins_encode %{
    __ ldrs(as_FloatRegister($dst$$reg), $constantaddress($con));
  %}
  ins_pipe(fp_load_constant_s);
%}

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI which probably could be acceptable for autovectorization as we are replicating the value in a vector register anyway but for the scalar AddHF operation (the iterations that get peeled or the ones in pre/post loop which are not autovectorized), it would expect the value to be available in an FPR instead (the h register variant). So we might have to introduce a move from the GPR to an FPR. The reason why I felt I needed more time to investigate this. Please let me know your thoughts. Thanks!

@theRealAph
Copy link
Contributor

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI

This is the part I don't understand. Why would you have to change the destination register to iregI? I wouldn't.

instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "something"
  %}
  ins_encode %{
    __ movw(rscratch1, $con$$constant);
    __ fmovs($dst$$reg, rscratch1);
  %}

@Bhavana-Kilambi
Copy link
Contributor Author

Bhavana-Kilambi commented Aug 13, 2025

The destination register is an FPR. If we would want to modify this to generate a move to a scratch register instead (something similar to loadConI) then we would have to change the destination register to iregI

This is the part I don't understand. Why would you have to change the destination register to iregI? I wouldn't.

instruct loadConH(vRegF dst, immH con) %{
  match(Set dst con);
  format %{
    "something"
  %}
  ins_encode %{
    __ movw(rscratch1, $con$$constant);
    __ fmovs($dst$$reg, rscratch1);
  %}

Thanks. Yes I could do that (as I mentioned earlier in my comment), but I was trying to avoid the extra mov -fmov. Just that I wasn't sure if this version would be faster than an ldr. But it just occured to me that I could compare the latencies. ldr on V1 has a latency of 4 cyc and mov + fmov is 1 + 2 = 3 cyc. So it probably makes sense to go with two moves. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

3 participants