Solve Einstein's Equations on GPUs #4770
-
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 13 replies
-
|
You can set the thread block size with ParallelFor quite easily by putting it as a template argument: For the equations, I can only give some general advice, which is to use profiling tools such as AMReX TinyProfiler or NVIDIA Nsight Systems and to make sure the kernel has a coalesced memory access pattern. |
Beta Was this translation helpful? Give feedback.
-
|
you might want to look at: https://github.com/AMReX-Astro/STvAR The code hasn't been updated in 2 years, so I am not sure if it is still used. |
Beta Was this translation helpful? Give feedback.
-
|
Hi all, thanks for starting the discussion! Already there are some nice suggestions here we could try. I have found similar performance bottlenecks on an Einstein Toolkit/CarpetX code we have been working on. In the baseline version of the code, profiling with Nsight Systems on DeltaAI's GH200 nodes showed high register usage during the ricci tensor and finite difference calculations. I have since then tried refactoring the code and made some optimizations like @AlexanderSinn suggested. I also found that splitting the ricci tensor calculation into a separate kernel has helped reduce the overall runtime, and it might be worth looking into doing this for other parts of the RHS calculation. However, even with these optimizations, the register usage remains high and warp occupancy is low (12.5%) because of the complexity of the tensor contractions. This was measured without changing AMREX_GPU_MAX_THREADS, so it would be interesting to see how the occupancy improves with the thread block size. The other major issue the profiler identified is long scoreboard warp stalls from global memory accesses, which likely happens during the finite difference calculations of the evolution variables. However, they are 3D quantities, and it's not obvious to me how to optimize the memory layout and increase memory coalescence when loading the 1D and 2D stencils. I was also considering storing some of the frquently used tensors in shared memory, like the spatial metric Finally, I found an interesting AMD blog post suggesting the use of "nontemporal stores" to reduce L2 cache pollution when writing the final results to global memory. Is there a similar mechanism like this in CUDA or other models? The uncoalesced local stores here for the final RHS results were also identified by the profiler as a hotspot for optimizations |
Beta Was this translation helpful? Give feedback.
-
|
Thank you all again for the excellent suggestions shared. Based on the discussion, our immediate focus will be on reducing register usage. We will monitor the initial results and evaluate the next steps from there. |
Beta Was this translation helpful? Give feedback.



You can set the thread block size with ParallelFor quite easily by putting it as a template argument:
amrex::ParallelFor<128>(...);. The default is 256 (set by AMREX_GPU_MAX_THREADS), while values between 32 and 1024 could be worth trying. Dynamic shared memory is not usable with ParallelFor, as only some of the threads might execute the last threadblock. Instead, you would have to useamrex::launchwhich gives direct access to dynamic shared memory amount, thread block size, and grid size. However, it is not available when compiling for CPUs.For the equations, I can only give some general advice, which is to use profiling tools such as AMReX TinyProfiler or NVIDIA Nsight Systems and to …