Skip to content

Commit f97754c

Browse files
Refactor CudaOffloadFactorization into LU and QR variants (#709)
* Refactor CudaOffloadFactorization into LU and QR variants - Created CudaOffloadLUFactorization using lu factorization - Created CudaOffloadQRFactorization using qr factorization - Deprecated CudaOffloadFactorization to use QR (with deprecation warning) - Updated CUDA extension to implement both algorithms - Updated LinearSolveAutotune to use LU version for better performance - Updated tests to include both new algorithms - Exported both new algorithms from LinearSolve module * Fix syntax issues in CudaOffload factorizations - Fixed namespace issues (removed LinearSolve. prefix) - Fixed constructor syntax (new() instead of new{}()) - Added debug comment - Updated exports to separate lines Note: The new types are defined correctly but there appears to be a precompilation caching issue preventing them from being recognized immediately. A clean rebuild may be required. * Update ext/LinearSolveCUDAExt.jl * Update src/extension_algs.jl * Update src/extension_algs.jl * Update test/gpu/cuda.jl * Update documentation for CudaOffload factorization changes - Updated GPU tutorial to show new CudaOffloadLUFactorization/QRFactorization - Updated solver documentation to explain both algorithms - Added deprecation warning in documentation - Updated release notes with upcoming changes - Created example demonstrating usage of both new algorithms - Explained when to use each algorithm (LU for performance, QR for stability) * Update docs/src/solvers/solvers.md * Delete examples/cuda_offload_example.jl * Update ext/LinearSolveCUDAExt.jl * Update ext/LinearSolveCUDAExt.jl
1 parent 046b789 commit f97754c

File tree

9 files changed

+126
-14
lines changed

9 files changed

+126
-14
lines changed

docs/src/release_notes.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# Release Notes
22

3+
## Upcoming Changes
4+
5+
- `CudaOffloadFactorization` has been split into two algorithms:
6+
- `CudaOffloadLUFactorization` - Uses LU factorization for better performance
7+
- `CudaOffloadQRFactorization` - Uses QR factorization for better numerical stability
8+
- `CudaOffloadFactorization` is now deprecated and will show a warning suggesting to use one of the new algorithms
9+
310
## v2.0
411

512
- `LinearCache` changed from immutable to mutable. With this, the out of place interfaces like

docs/src/solvers/solvers.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,14 @@ use your base system BLAS which can be fast or slow depending on the hardware co
2323

2424
For very large dense factorizations, offloading to the GPU can be preferred. Metal.jl can be used
2525
on Mac hardware to offload, and has a cutoff point of being faster at around size 20,000 x 20,000
26-
matrices (and only supports Float32). `CudaOffloadFactorization` can be more efficient at a
27-
much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though this is highly dependent
28-
on the chosen GPU hardware. `CudaOffloadFactorization` requires a CUDA-compatible NVIDIA GPU.
26+
matrices (and only supports Float32). `CudaOffloadLUFactorization` and `CudaOffloadQRFactorization`
27+
can be more efficient at a much smaller cutoff, possibly around size 1,000 x 1,000 matrices, though
28+
this is highly dependent on the chosen GPU hardware. These algorithms require a CUDA-compatible NVIDIA GPU.
2929
CUDA offload supports Float64 but most consumer GPU hardware will be much faster on Float32
3030
(many are >32x faster for Float32 operations than Float64 operations) and thus for most hardware
31-
this is only recommended for Float32 matrices.
31+
this is only recommended for Float32 matrices. Choose `CudaOffloadLUFactorization` for better
32+
performance on well-conditioned problems, or `CudaOffloadQRFactorization` for better numerical
33+
stability on ill-conditioned problems.
3234

3335
!!! note
3436

@@ -232,10 +234,11 @@ The following are non-standard GPU factorization routines.
232234

233235
!!! note
234236

235-
Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
237+
Using these solvers requires adding the package CUDA.jl, i.e. `using CUDA`
236238

237239
```@docs
238-
CudaOffloadFactorization
240+
CudaOffloadLUFactorization
241+
CudaOffloadQRFactorization
239242
```
240243

241244
### AMDGPU.jl

docs/src/tutorials/gpu.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ This computation can be moved to the GPU by the following:
4141
```julia
4242
using CUDA # Add the GPU library for NVIDIA GPUs
4343
sol = LS.solve(prob, LS.CudaOffloadLUFactorization())
44+
# or
45+
sol = LS.solve(prob, LS.CudaOffloadQRFactorization())
4446
sol.u
4547
```
4648

@@ -54,6 +56,12 @@ sol = LS.solve(prob, LS.AMDGPUOffloadQRFactorization()) # QR factorization
5456
sol.u
5557
```
5658

59+
LinearSolve.jl provides multiple GPU offloading algorithms:
60+
- `CudaOffloadLUFactorization()` - Uses LU factorization on NVIDIA GPUs (generally faster for well-conditioned problems)
61+
- `CudaOffloadQRFactorization()` - Uses QR factorization on NVIDIA GPUs (more stable for ill-conditioned problems)
62+
- `AMDGPUOffloadLUFactorization()` - Uses LU factorization on AMD GPUs (generally faster for well-conditioned problems)
63+
- `AMDGPUOffloadQRFactorization()` - Uses QR factorization on AMD GPUs (more stable for ill-conditioned problems)
64+
-
5765
## GPUArray Interface
5866

5967
For more manual control over the factorization setup, you can use the

ext/LinearSolveCUDAExt.jl

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ using LinearSolve: LinearSolve, is_cusparse, defaultalg, cudss_loaded, DefaultLi
55
DefaultAlgorithmChoice, ALREADY_WARNED_CUDSS, LinearCache,
66
needs_concrete_A,
77
error_no_cudss_lu, init_cacheval, OperatorAssumptions,
8-
CudaOffloadFactorization,
8+
CudaOffloadFactorization, CudaOffloadLUFactorization, CudaOffloadQRFactorization,
99
SparspakFactorization, KLUFactorization, UMFPACKFactorization
1010
using LinearSolve.LinearAlgebra, LinearSolve.SciMLBase, LinearSolve.ArrayInterface
1111
using SciMLBase: AbstractSciMLOperator
@@ -35,6 +35,48 @@ function LinearSolve.error_no_cudss_lu(A::CUDA.CUSPARSE.CuSparseMatrixCSR)
3535
nothing
3636
end
3737

38+
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadLUFactorization;
39+
kwargs...)
40+
if cache.isfresh
41+
fact = lu(CUDA.CuArray(cache.A))
42+
cache.cacheval = fact
43+
cache.isfresh = false
44+
end
45+
y = Array(ldiv!(CUDA.CuArray(cache.u), cache.cacheval, CUDA.CuArray(cache.b)))
46+
cache.u .= y
47+
SciMLBase.build_linear_solution(alg, y, nothing, cache)
48+
end
49+
50+
function LinearSolve.init_cacheval(alg::CudaOffloadLUFactorization, A, b, u, Pl, Pr,
51+
maxiters::Int, abstol, reltol, verbose::Bool,
52+
assumptions::OperatorAssumptions)
53+
T = eltype(A)
54+
noUnitT = typeof(zero(T))
55+
luT = LinearAlgebra.lutype(noUnitT)
56+
ipiv = CuVector{Int32}(undef, 0)
57+
info = zero(LinearAlgebra.BlasInt)
58+
return LU{luT}(CuMatrix{Float64}(undef, 0, 0), ipiv, info)
59+
end
60+
61+
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadQRFactorization;
62+
kwargs...)
63+
if cache.isfresh
64+
fact = qr(CUDA.CuArray(cache.A))
65+
cache.cacheval = fact
66+
cache.isfresh = false
67+
end
68+
y = Array(ldiv!(CUDA.CuArray(cache.u), cache.cacheval, CUDA.CuArray(cache.b)))
69+
cache.u .= y
70+
SciMLBase.build_linear_solution(alg, y, nothing, cache)
71+
end
72+
73+
function LinearSolve.init_cacheval(alg::CudaOffloadQRFactorization, A, b, u, Pl, Pr,
74+
maxiters::Int, abstol, reltol, verbose::Bool,
75+
assumptions::OperatorAssumptions)
76+
qr(CUDA.CuArray(A))
77+
end
78+
79+
# Keep the deprecated CudaOffloadFactorization working by forwarding to QR
3880
function SciMLBase.solve!(cache::LinearSolve.LinearCache, alg::CudaOffloadFactorization;
3981
kwargs...)
4082
if cache.isfresh

lib/LinearSolveAutotune/src/algorithms.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,10 @@ function get_gpu_algorithms(; skip_missing_algs::Bool = false)
8484
# CUDA algorithms
8585
if is_cuda_available()
8686
try
87-
push!(gpu_algs, CudaOffloadFactorization())
88-
push!(gpu_names, "CudaOffloadFactorization")
87+
push!(gpu_algs, CudaOffloadLUFactorization())
88+
push!(gpu_names, "CudaOffloadLUFactorization")
8989
catch e
90-
msg = "CUDA hardware detected but CudaOffloadFactorization not available: $e. Load CUDA.jl package."
90+
msg = "CUDA hardware detected but CudaOffloadLUFactorization not available: $e. Load CUDA.jl package."
9191
if skip_missing_algs
9292
@warn msg
9393
else

src/LinearSolve.jl

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,10 @@ export KrylovJL, KrylovJL_CG, KrylovJL_MINRES, KrylovJL_GMRES,
254254
export SimpleGMRES
255255

256256
export HYPREAlgorithm
257-
export CudaOffloadFactorization, AMDGPUOffloadLUFactorization, AMDGPUOffloadQRFactorization
257+
export CudaOffloadFactorization
258+
export CudaOffloadLUFactorization
259+
export CudaOffloadQRFactorization
260+
export AMDGPUOffloadLUFactorization, AMDGPUOffloadQRFactorization
258261
export MKLPardisoFactorize, MKLPardisoIterate
259262
export PanuaPardisoFactorize, PanuaPardisoIterate
260263
export PardisoJL

src/extension_algs.jl

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,23 +61,70 @@ struct HYPREAlgorithm <: SciMLLinearSolveAlgorithm
6161
end
6262
end
6363

64+
# Debug: About to define CudaOffloadLUFactorization
65+
"""
66+
`CudaOffloadLUFactorization()`
67+
68+
An offloading technique used to GPU-accelerate CPU-based computations using LU factorization.
69+
Requires a sufficiently large `A` to overcome the data transfer costs.
70+
71+
!!! note
72+
73+
Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
74+
"""
75+
struct CudaOffloadLUFactorization <: AbstractFactorization
76+
function CudaOffloadLUFactorization()
77+
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
78+
if ext === nothing
79+
error("CudaOffloadLUFactorization requires that CUDA is loaded, i.e. `using CUDA`")
80+
else
81+
return new()
82+
end
83+
end
84+
end
85+
86+
"""
87+
`CudaOffloadQRFactorization()`
88+
89+
An offloading technique used to GPU-accelerate CPU-based computations using QR factorization.
90+
Requires a sufficiently large `A` to overcome the data transfer costs.
91+
92+
!!! note
93+
94+
Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
95+
"""
96+
struct CudaOffloadQRFactorization <: AbstractFactorization
97+
function CudaOffloadQRFactorization()
98+
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
99+
if ext === nothing
100+
error("CudaOffloadQRFactorization requires that CUDA is loaded, i.e. `using CUDA`")
101+
else
102+
return new()
103+
end
104+
end
105+
end
106+
64107
"""
65108
`CudaOffloadFactorization()`
66109
110+
!!! warning
111+
This algorithm is deprecated. Use `CudaOffloadLUFactorization` or `CudaOffloadQRFactorization()` instead.
112+
67113
An offloading technique used to GPU-accelerate CPU-based computations.
68114
Requires a sufficiently large `A` to overcome the data transfer costs.
69115
70116
!!! note
71117
72118
Using this solver requires adding the package CUDA.jl, i.e. `using CUDA`
73119
"""
74-
struct CudaOffloadFactorization <: LinearSolve.AbstractFactorization
120+
struct CudaOffloadFactorization <: AbstractFactorization
75121
function CudaOffloadFactorization()
122+
Base.depwarn("`CudaOffloadFactorization` is deprecated, use `CudaOffloadLUFactorization` or `CudaOffloadQRFactorization` instead.", :CudaOffloadFactorization)
76123
ext = Base.get_extension(@__MODULE__, :LinearSolveCUDAExt)
77124
if ext === nothing
78125
error("CudaOffloadFactorization requires that CUDA is loaded, i.e. `using CUDA`")
79126
else
80-
return new{}()
127+
return new()
81128
end
82129
end
83130
end

test/gpu/cuda.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ function test_interface(alg, prob1, prob2)
4545
return
4646
end
4747

48-
@testset "$alg" for alg in (CudaOffloadFactorization(), NormalCholeskyFactorization())
48+
@testset "$alg" for alg in (CudaOffloadLUFactorization(), CudaOffloadQRFactorization(), NormalCholeskyFactorization())
4949
test_interface(alg, prob1, prob2)
5050
end
5151

test/resolve.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,8 @@ for alg in vcat(InteractiveUtils.subtypes(AbstractDenseFactorization),
1111
if !(alg in [
1212
DiagonalFactorization,
1313
CudaOffloadFactorization,
14+
CudaOffloadLUFactorization,
15+
CudaOffloadQRFactorization,
1416
CUSOLVERRFFactorization,
1517
AppleAccelerateLUFactorization,
1618
MetalLUFactorization

0 commit comments

Comments
 (0)