Solve Einstein's Equations on GPUs #4770

lwJi · 2025-11-09T15:21:11Z

lwJi
Nov 9, 2025

Hello,

Our current work involves solving Einstein's field equations using a numerical evolution scheme on GPUs.

📝 Problem Context

System: The system comprises 22 coupled evolution variables/equations.

RHS Calculation: The right-hand side (RHS) of these equations is computationally intensive, requiring the calculation of first and second partial derivatives of the evolution variables.

A snapshot of the equations is attached for reference.

❓ Specific Questions

We are seeking advice on optimizing the GPU execution of these kernels:

AMReX Launch Configuration: We are currently unable to directly control the GPU launch configuration (specifically, thread block size and dynamic shared memory allocation) as these parameters appear to be abstracted by amrex::ParallelFor. Could you please provide documentation or examples showing how to explicitly set these low-level parameters when using AMReX's GPU execution model?

General Performance Optimization: Given the nature of this calculation (high-order derivatives and 22 variables), do you have any suggestions for algorithmic or implementation-specific optimizations to improve the overall GPU efficiency?

Thank you for your guidance.

Answered by AlexanderSinn

Nov 9, 2025

You can set the thread block size with ParallelFor quite easily by putting it as a template argument: amrex::ParallelFor<128>(...);. The default is 256 (set by AMREX_GPU_MAX_THREADS), while values between 32 and 1024 could be worth trying. Dynamic shared memory is not usable with ParallelFor, as only some of the threads might execute the last threadblock. Instead, you would have to use amrex::launch which gives direct access to dynamic shared memory amount, thread block size, and grid size. However, it is not available when compiling for CPUs.

For the equations, I can only give some general advice, which is to use profiling tools such as AMReX TinyProfiler or NVIDIA Nsight Systems and to …

View full answer

AlexanderSinn · 2025-11-09T16:15:52Z

AlexanderSinn
Nov 9, 2025
Collaborator

You can set the thread block size with ParallelFor quite easily by putting it as a template argument: amrex::ParallelFor<128>(...);. The default is 256 (set by AMREX_GPU_MAX_THREADS), while values between 32 and 1024 could be worth trying. Dynamic shared memory is not usable with ParallelFor, as only some of the threads might execute the last threadblock. Instead, you would have to use amrex::launch which gives direct access to dynamic shared memory amount, thread block size, and grid size. However, it is not available when compiling for CPUs.

For the equations, I can only give some general advice, which is to use profiling tools such as AMReX TinyProfiler or NVIDIA Nsight Systems and to make sure the kernel has a coalesced memory access pattern.

2 replies

lwJi Nov 9, 2025
Author

Thanks, Alexander, for the explanation . I will try them out.

We used the profiling tools and determined that the main bottleneck is register pressure, which stems from the large number of grid functions (both evolution variables and their spatial partial derivatives).

AlexanderSinn Nov 10, 2025
Collaborator

Is the source code of that kernel available somewhere? One technique I have used in the past to reduce register usage is to only have a single MultiFab + Array4 with many components instead of one Array4 per field.

zingale · 2025-11-09T16:45:22Z

zingale
Nov 9, 2025
Collaborator

you might want to look at: https://github.com/AMReX-Astro/STvAR
this did some calculations of black hole collisions with AMReX (there's a video in the gallery: https://amrex-codes.github.io/amrex/gallery.html)

The code hasn't been updated in 2 years, so I am not sure if it is still used.

8 replies

dwillcox Nov 10, 2025

Also, here is the source code of our kernel. @AlexanderSinn

https://github.com/AMReX-Astro/STvAR/blob/main/Source/Z4c2/ET_Integration_Rhs_K.H

AlexanderSinn Nov 10, 2025
Collaborator

That is indeed quite a lot of code. It is already doing the Array4 thing I mentioned. Something else that I would highly recommend is to replace divisions like ...)/dx[0]; with multiplications of the inverse that was precomputed on the CPU. GPUs don't have native division hardware and often need to do a function call for that that takes extra registers. Also, the use of std::pow like in std::pow(gamtilde_LL_02, 2) should be avoided. Here a simple multiplication would do the job; however, std::pow generates the full function that deals with every possible exponent, again taking extra registers and compute time. Instead, (gamtilde_LL_02*gamtilde_LL_02) or amrex::Math::powi<2>(gamtilde_LL_02) can be used.

AlexanderSinn Nov 10, 2025
Collaborator

In this example the use of std::pow adds 22 extra registers. https://godbolt.org/z/chPY8dnna

dwillcox Nov 11, 2025

Thanks, @AlexanderSinn -- those are excellent suggestions! Especially the inverse dx, that's a pretty easy optimization to make.

About the pow ... that's pretty silly of the CUDA C++ compiler, I expected better.

When compiling with gcc (15.2) at O1 or higher optimization, the compiler substitutes std::pow(x,2) -> x * x ... I'm a little concerned that nvcc isn't doing that too ... https://godbolt.org/z/3jbh57vY8

AlexanderSinn Nov 11, 2025
Collaborator

Indeed that is quite unfortunate of nvcc, however even with gcc pow(x, 3) makes a function call.

chcheng3 · 2025-11-10T16:10:50Z

chcheng3
Nov 10, 2025

Hi all, thanks for starting the discussion! Already there are some nice suggestions here we could try. I have found similar performance bottlenecks on an Einstein Toolkit/CarpetX code we have been working on.
https://bitbucket.org/canuda/canudax_lean/src/b0b8c7b3d3e66e9ac577b4ff50099ad0cf5bfba6/CanudaX_BSSNMoL/src/calc_bssn_rhs.cxx?at=feature%2FOptimizations#lines-49
The kernel in question (Line 49) uses amrex::ParallelFor to compute the RHS of the equations, although for a different formulation of the Einstein equations from @lwJi's.

In the baseline version of the code, profiling with Nsight Systems on DeltaAI's GH200 nodes showed high register usage during the ricci tensor and finite difference calculations. I have since then tried refactoring the code and made some optimizations like @AlexanderSinn suggested. I also found that splitting the ricci tensor calculation into a separate kernel has helped reduce the overall runtime, and it might be worth looking into doing this for other parts of the RHS calculation.

However, even with these optimizations, the register usage remains high and warp occupancy is low (12.5%) because of the complexity of the tensor contractions. This was measured without changing AMREX_GPU_MAX_THREADS, so it would be interesting to see how the occupancy improves with the thread block size.

The other major issue the profiler identified is long scoreboard warp stalls from global memory accesses, which likely happens during the finite difference calculations of the evolution variables. However, they are 3D quantities, and it's not obvious to me how to optimize the memory layout and increase memory coalescence when loading the 1D and 2D stencils.

I was also considering storing some of the frquently used tensors in shared memory, like the spatial metric hhDD and hhUU, but it seems that it will require rewriting the looping to use amrex::launch.

Finally, I found an interesting AMD blog post suggesting the use of "nontemporal stores" to reduce L2 cache pollution when writing the final results to global memory. Is there a similar mechanism like this in CUDA or other models? The uncoalesced local stores here for the final RHS results were also identified by the profiler as a hotspot for optimizations
https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-finite-difference-docs-laplacian_part3/#nontemporal-memory-access

3 replies

AlexanderSinn Nov 10, 2025
Collaborator

These profiling tools very often tend to complain about low occupancy or some scoreboard dependencies even for fast kernels. What I would be interested in is the speed-of-light compute and memory throughput reported by ncu (memory throughput could also be calculated from the runtime of the kernel and the size of the fields). If the memory bandwidth is already 80%+ utilized, then reducing register usage won't help improve much performance (this is often the case for kernels in electromagnetic codes).

chcheng3 Nov 10, 2025

Oh I see, I didn't know that it can be low even for fast kenels. In my case, the optimized kernels didn't seem to be bound by either compute or memory throughput, because they are both <50%. For reference here are the ncu sections for the two kernels (Ricci and rest of the RHS)

AlexanderSinn Nov 10, 2025
Collaborator

Ok in this case (<50% memory bandwidth) reducing register usage should help a lot.

lwJi · 2025-11-11T05:00:51Z

lwJi
Nov 11, 2025
Author

Thank you all again for the excellent suggestions shared. Based on the discussion, our immediate focus will be on reducing register usage. We will monitor the initial results and evaluate the next steps from there.

0 replies

Solve Einstein's Equations on GPUs #4770

Uh oh!

lwJi Nov 9, 2025

Replies: 4 comments · 13 replies

Uh oh!

AlexanderSinn Nov 9, 2025 Collaborator

Uh oh!

lwJi Nov 9, 2025 Author

Uh oh!

AlexanderSinn Nov 10, 2025 Collaborator

Uh oh!

zingale Nov 9, 2025 Collaborator

Uh oh!

dwillcox Nov 10, 2025

Uh oh!

AlexanderSinn Nov 10, 2025 Collaborator

Uh oh!

AlexanderSinn Nov 10, 2025 Collaborator

Uh oh!

Uh oh!

dwillcox Nov 11, 2025

Uh oh!

AlexanderSinn Nov 11, 2025 Collaborator

Uh oh!

chcheng3 Nov 10, 2025

Uh oh!

AlexanderSinn Nov 10, 2025 Collaborator

Uh oh!

chcheng3 Nov 10, 2025

Uh oh!

AlexanderSinn Nov 10, 2025 Collaborator

Uh oh!

lwJi Nov 11, 2025 Author

lwJi
Nov 9, 2025

Replies: 4 comments 13 replies

AlexanderSinn
Nov 9, 2025
Collaborator

lwJi Nov 9, 2025
Author

AlexanderSinn Nov 10, 2025
Collaborator

zingale
Nov 9, 2025
Collaborator

AlexanderSinn Nov 10, 2025
Collaborator

AlexanderSinn Nov 10, 2025
Collaborator

AlexanderSinn Nov 11, 2025
Collaborator

chcheng3
Nov 10, 2025

AlexanderSinn Nov 10, 2025
Collaborator

AlexanderSinn Nov 10, 2025
Collaborator

lwJi
Nov 11, 2025
Author