CUDA Backend

AdaptiveArrayPools provides native CUDA support through a package extension that loads automatically when CUDA.jl is available.

Quick Start

using AdaptiveArrayPools, CUDA

# Use :cuda backend for GPU arrays
@with_pool :cuda pool function gpu_computation(n)
    A = acquire!(pool, Float64, n, n)  # CuArray
    B = acquire!(pool, Float64, n, n)  # CuArray

    fill!(A, 1.0)
    fill!(B, 2.0)

    return sum(A .+ B)
end

# Zero GPU allocation in hot loops
for i in 1:1000
    gpu_computation(100)  # GPU memory reused from pool
end

API

The CUDA backend uses the same API as CPU, with :cuda backend specifier:

Macro/FunctionDescription
@with_pool :cuda pool exprGPU pool with automatic checkpoint/rewind
acquire!(pool, T, dims...)Returns CuArray (always 0 bytes GPU alloc)
acquire_view!(pool, T, dims...)Returns CuArray (same as acquire! on CUDA)
get_task_local_cuda_pool()Returns the task-local CUDA pool
pool_stats(:cuda)Print CUDA pool statistics

Return Types

Function1D ReturnN-D Return
acquire!CuArray{T,1}CuArray{T,N}
acquire_view!CuArray{T,1}CuArray{T,N}

Allocation Behavior

GPU Memory: Always 0 bytes allocation after warmup. The underlying CuVector is resized as needed and reused.

CPU-side Wrapper Memory (for acquire! N-D on CUDA):

  • The CUDA backend uses arr_wrappers-based direct-index caching for CuArray wrapper reuse
  • Each dimensionality N has one cached wrapper per slot, reused via setfield!(:dims)
  • After warmup: zero CPU-side allocation for any number of dimension patterns (same N)
  • Different N values each get their own cached wrapper (also zero-alloc after first use)

Fixed Slot Types

Optimized types with pre-allocated slots (same as CPU):

TypeField
Float64.float64
Float32.float32
Float16.float16
Int64.int64
Int32.int32
ComplexF64.complexf64
ComplexF32.complexf32
Bool.bool

Other types use the fallback dictionary (.others).

Limitations

  • Julia 1.11+: Required for setfield!-based Array internals used by GPU extensions
  • No @maybe_with_pool :cuda: Runtime toggle not supported for CUDA backend
  • Task-local only: Each Task gets its own CUDA pool, same as CPU
  • Same device: All arrays in a pool use the same CUDA device

Example: Matrix Multiplication

using AdaptiveArrayPools, CUDA, LinearAlgebra

@with_pool :cuda pool function gpu_matmul(n)
    A = acquire!(pool, Float64, n, n)
    B = acquire!(pool, Float64, n, n)
    C = acquire!(pool, Float64, n, n)

    rand!(A); rand!(B)
    mul!(C, A, B)

    return sum(C)
end

# Warmup
gpu_matmul(100)

# Benchmark - zero GPU allocation
using BenchmarkTools
@benchmark gpu_matmul(1000)

Debugging

# Check pool state
pool_stats(:cuda)

# Output:
# CuAdaptiveArrayPool (device 0)
#   Float64 (fixed) [GPU]
#     slots: 3 (active: 0)
#     elements: 30000 (234.375 KiB)