AcceleratedKernels v0.4.3
- Made ScanPrefixes the default accumulate / cumsum / cumprod algorithm. It is almost always faster on real-world data than DecoupledLookback, and doesn't depend on cross-block communication (even though theoretically DecoupledLookback has better asymptotic scalability).
- Prepared AcceleratedKernels for the future PoCL backend becoming the KernelAbstractions CPU default backend; the Threads-based algorithms will remain the defaults until PoCL ones become faster.
- A lot of housekeeping.
Merged pull requests:
- Typo in
accumulate
benchmarks (#42) (@christiangnrd) - Use UnsafeAtomics to fix race in accumulate (#44) (@vchuravy)
- Stop relying on backend type to determine algorithm used (#45) (@christiangnrd)
- Test both 1d
accumulate
algorithms when supported (#49) (@christiangnrd) neutral_element
fixes (#52) (@christiangnrd)- Deduplicate
reduce_group
(#55) (@christiangnrd) - Tweak backend selection (#56) (@christiangnrd)
- Vc/accumulate alg: made ScanPrefixes the default accumulate algorithm; added atomic orderings to DecoupledLookback. (#57) (@anicusan)
Closed issues:
- Port over GPUArrays
neutral_element
fixes (#51)