Skip to content

Conversation

@efaulhaber
Copy link
Member

@efaulhaber efaulhaber commented Oct 30, 2025

This PR contains two small performance optimizations. The first is an algebraic simplification of the derivatives of the Wendland kernels and the normalization factors:

main:

julia> r = rand(SVector{3, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{3}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
3.409 ns

julia> r = rand(SVector{2, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{2}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
2.650 ns

With result simplified algebraically:

julia> r = rand(SVector{3, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{3}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
3.006 ns

julia> r = rand(SVector{2, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{2}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
2.573 ns

With normalization_factor simplified to 7 / (pi * h^2 * 4):

julia> r = rand(SVector{3, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{3}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
2.843 ns

julia> r = rand(SVector{2, Float64}); d = norm(r); h = 1.0; kernel = WendlandC2Kernel{2}();

julia> @b TrixiParticles.kernel_grad($kernel, $r, $d, $h) seconds=1
2.557 ns

Interestingly, this difference is not measurable when benchmarking only kernel_deriv.

The second is a small optimization of the computation of v_max (apparently only relevant on the CPU).

julia> A = rand(3, 10_000_000);

julia> @b maximum(x -> sqrt(dot(x, x)), reinterpret(reshape, SVector{3, eltype($A)}, view($A, 1:3, :))) seconds=1
6.966 ms

julia> @b sqrt(maximum(x -> dot(x, x), reinterpret(reshape, SVector{3, eltype($A)}, view($A, 1:3, :)))) seconds=1
6.574 ms

julia> A = Metal.rand(3, 10_000_000);

julia> @b maximum(x -> sqrt(dot(x, x)), reinterpret(reshape, SVector{3, eltype($A)}, view($A, 1:3, :))) seconds=1
982.000 μs (664 allocs: 15.070 KiB)

julia> @b sqrt(maximum(x -> dot(x, x), reinterpret(reshape, SVector{3, eltype($A)}, view($A, 1:3, :)))) seconds=1
981.459 μs (664 allocs: 15.070 KiB)

@efaulhaber efaulhaber self-assigned this Oct 30, 2025
@efaulhaber efaulhaber requested a review from Copilot October 31, 2025 17:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors smoothing kernel normalization factors and related computations to improve GPU performance and code clarity. The changes focus on simplifying arithmetic expressions to reduce instructions and improve readability.

Key Changes:

  • Simplified normalization factor expressions by consolidating divisions (e.g., a / b / ca / (b * c))
  • Optimized v_max computation in particle shifting to compute squared magnitude first, then take square root
  • Simplified kernel derivative formulas for Wendland kernels by algebraically reducing expressions

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/schemes/fluid/shifting_techniques.jl Optimized v_max calculation to compute maximum of squared velocities before taking square root
src/general/smoothing_kernels.jl Simplified normalization factors and kernel derivatives across multiple kernel types (Schoenberg, Wendland, Poly6)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@efaulhaber efaulhaber requested review from LasNikas, Copilot and svchb and removed request for Copilot October 31, 2025 17:29
@efaulhaber efaulhaber marked this pull request as ready for review October 31, 2025 17:29
@efaulhaber
Copy link
Member Author

/run-gpu-tests

LasNikas
LasNikas previously approved these changes Nov 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants