CUDA Backend
AdaptiveArrayPools provides native CUDA support through a package extension that loads automatically when CUDA.jl is available.
Quick Start
using AdaptiveArrayPools, CUDA
# Use :cuda backend for GPU arrays
@with_pool :cuda pool function gpu_computation(n)
A = acquire!(pool, Float64, n, n) # CuArray
B = acquire!(pool, Float64, n, n) # CuArray
fill!(A, 1.0)
fill!(B, 2.0)
return sum(A .+ B)
end
# Zero GPU allocation in hot loops
for i in 1:1000
gpu_computation(100) # GPU memory reused from pool
endAPI
The CUDA backend uses the same API as CPU, with :cuda backend specifier:
| Macro/Function | Description |
|---|---|
@with_pool :cuda pool expr | GPU pool with automatic checkpoint/rewind |
acquire!(pool, T, dims...) | Returns CuArray (always 0 bytes GPU alloc) |
acquire_view!(pool, T, dims...) | Returns CuArray (same as acquire! on CUDA) |
get_task_local_cuda_pool() | Returns the task-local CUDA pool |
pool_stats(:cuda) | Print CUDA pool statistics |
Return Types
| Function | 1D Return | N-D Return |
|---|---|---|
acquire! | CuArray{T,1} | CuArray{T,N} |
acquire_view! | CuArray{T,1} | CuArray{T,N} |
Allocation Behavior
GPU Memory: Always 0 bytes allocation after warmup. The underlying CuVector is resized as needed and reused.
CPU-side Wrapper Memory (for acquire! N-D on CUDA):
- The CUDA backend uses
arr_wrappers-based direct-index caching forCuArraywrapper reuse - Each dimensionality
Nhas one cached wrapper per slot, reused viasetfield!(:dims) - After warmup: zero CPU-side allocation for any number of dimension patterns (same
N) - Different
Nvalues each get their own cached wrapper (also zero-alloc after first use)
Fixed Slot Types
Optimized types with pre-allocated slots (same as CPU):
| Type | Field |
|---|---|
Float64 | .float64 |
Float32 | .float32 |
Float16 | .float16 |
Int64 | .int64 |
Int32 | .int32 |
ComplexF64 | .complexf64 |
ComplexF32 | .complexf32 |
Bool | .bool |
Other types use the fallback dictionary (.others).
Limitations
- Julia 1.11+: Required for
setfield!-based Array internals used by GPU extensions - No
@maybe_with_pool :cuda: Runtime toggle not supported for CUDA backend - Task-local only: Each Task gets its own CUDA pool, same as CPU
- Same device: All arrays in a pool use the same CUDA device
Example: Matrix Multiplication
using AdaptiveArrayPools, CUDA, LinearAlgebra
@with_pool :cuda pool function gpu_matmul(n)
A = acquire!(pool, Float64, n, n)
B = acquire!(pool, Float64, n, n)
C = acquire!(pool, Float64, n, n)
rand!(A); rand!(B)
mul!(C, A, B)
return sum(C)
end
# Warmup
gpu_matmul(100)
# Benchmark - zero GPU allocation
using BenchmarkTools
@benchmark gpu_matmul(1000)Debugging
# Check pool state
pool_stats(:cuda)
# Output:
# CuAdaptiveArrayPool (device 0)
# Float64 (fixed) [GPU]
# slots: 3 (active: 0)
# elements: 30000 (234.375 KiB)