CUDA Backend
AdaptiveArrayPools provides native CUDA support through a package extension that loads automatically when CUDA.jl is available.
Quick Start
using AdaptiveArrayPools, CUDA
# Use :cuda backend for GPU arrays
@with_pool :cuda pool function gpu_computation(n)
A = acquire!(pool, Float64, n, n) # CuArray view
B = acquire!(pool, Float64, n, n) # CuArray view
fill!(A, 1.0)
fill!(B, 2.0)
return sum(A .+ B)
end
# Zero GPU allocation in hot loops
for i in 1:1000
gpu_computation(100) # GPU memory reused from pool
endAPI
The CUDA backend uses the same API as CPU, with :cuda backend specifier:
| Macro/Function | Description |
|---|---|
@with_pool :cuda pool expr | GPU pool with automatic checkpoint/rewind |
acquire!(pool, T, dims...) | Returns CuArray view (always 0 bytes GPU alloc) |
unsafe_acquire!(pool, T, dims...) | Returns raw CuArray (for FFI/type constraints) |
get_task_local_cuda_pool() | Returns the task-local CUDA pool |
pool_stats(:cuda) | Print CUDA pool statistics |
Return Types
| Function | 1D Return | N-D Return |
|---|---|---|
acquire! | CuArray{T,1} (view) | CuArray{T,N} (view) |
unsafe_acquire! | CuArray{T,1} | CuArray{T,N} |
Allocation Behavior
GPU Memory: Always 0 bytes allocation after warmup. The underlying CuVector is resized as needed and reused.
CPU Memory:
- Cache hit (≤4 dimension patterns per slot): 0 bytes
- Cache miss (>4 patterns): ~100 bytes for wrapper metadata
# Example: 4 patterns fit in 4-way cache → zero CPU allocation
dims_list = ((10, 10), (5, 20), (20, 5), (4, 25))
for dims in dims_list
@with_pool :cuda p begin
A = acquire!(p, Float64, dims...)
# Use A...
end
endFixed Slot Types
Optimized types with pre-allocated slots (same as CPU):
| Type | Field |
|---|---|
Float64 | .float64 |
Float32 | .float32 |
Float16 | .float16 |
Int64 | .int64 |
Int32 | .int32 |
ComplexF64 | .complexf64 |
ComplexF32 | .complexf32 |
Bool | .bool |
Other types use the fallback dictionary (.others).
Limitations
- No
@maybe_with_pool :cuda: Runtime toggle not supported for CUDA backend - Task-local only: Each Task gets its own CUDA pool, same as CPU
- Same device: All arrays in a pool use the same CUDA device
Example: Matrix Multiplication
using AdaptiveArrayPools, CUDA, LinearAlgebra
@with_pool :cuda pool function gpu_matmul(n)
A = acquire!(pool, Float64, n, n)
B = acquire!(pool, Float64, n, n)
C = acquire!(pool, Float64, n, n)
rand!(A); rand!(B)
mul!(C, A, B)
return sum(C)
end
# Warmup
gpu_matmul(100)
# Benchmark - zero GPU allocation
using BenchmarkTools
@benchmark gpu_matmul(1000)Debugging
# Check pool state
pool_stats(:cuda)
# Output:
# CuAdaptiveArrayPool (device 0)
# Float64 (fixed) [GPU]
# slots: 3 (active: 0)
# elements: 30000 (234.375 KiB)