CUDA Backend

AdaptiveArrayPools provides native CUDA support through a package extension that loads automatically when CUDA.jl is available.

Quick Start

using AdaptiveArrayPools, CUDA

# Use :cuda backend for GPU arrays
@with_pool :cuda pool function gpu_computation(n)
    A = acquire!(pool, Float64, n, n)  # CuArray view
    B = acquire!(pool, Float64, n, n)  # CuArray view

    fill!(A, 1.0)
    fill!(B, 2.0)

    return sum(A .+ B)
end

# Zero GPU allocation in hot loops
for i in 1:1000
    gpu_computation(100)  # GPU memory reused from pool
end

API

The CUDA backend uses the same API as CPU, with :cuda backend specifier:

Macro/FunctionDescription
@with_pool :cuda pool exprGPU pool with automatic checkpoint/rewind
acquire!(pool, T, dims...)Returns CuArray view (always 0 bytes GPU alloc)
unsafe_acquire!(pool, T, dims...)Returns raw CuArray (for FFI/type constraints)
get_task_local_cuda_pool()Returns the task-local CUDA pool
pool_stats(:cuda)Print CUDA pool statistics

Return Types

Function1D ReturnN-D Return
acquire!CuArray{T,1} (view)CuArray{T,N} (view)
unsafe_acquire!CuArray{T,1}CuArray{T,N}

Allocation Behavior

GPU Memory: Always 0 bytes allocation after warmup. The underlying CuVector is resized as needed and reused.

CPU Memory:

  • Cache hit (≤4 dimension patterns per slot): 0 bytes
  • Cache miss (>4 patterns): ~100 bytes for wrapper metadata
# Example: 4 patterns fit in 4-way cache → zero CPU allocation
dims_list = ((10, 10), (5, 20), (20, 5), (4, 25))
for dims in dims_list
    @with_pool :cuda p begin
        A = acquire!(p, Float64, dims...)
        # Use A...
    end
end

Fixed Slot Types

Optimized types with pre-allocated slots (same as CPU):

TypeField
Float64.float64
Float32.float32
Float16.float16
Int64.int64
Int32.int32
ComplexF64.complexf64
ComplexF32.complexf32
Bool.bool

Other types use the fallback dictionary (.others).

Limitations

  • No @maybe_with_pool :cuda: Runtime toggle not supported for CUDA backend
  • Task-local only: Each Task gets its own CUDA pool, same as CPU
  • Same device: All arrays in a pool use the same CUDA device

Example: Matrix Multiplication

using AdaptiveArrayPools, CUDA, LinearAlgebra

@with_pool :cuda pool function gpu_matmul(n)
    A = acquire!(pool, Float64, n, n)
    B = acquire!(pool, Float64, n, n)
    C = acquire!(pool, Float64, n, n)

    rand!(A); rand!(B)
    mul!(C, A, B)

    return sum(C)
end

# Warmup
gpu_matmul(100)

# Benchmark - zero GPU allocation
using BenchmarkTools
@benchmark gpu_matmul(1000)

Debugging

# Check pool state
pool_stats(:cuda)

# Output:
# CuAdaptiveArrayPool (device 0)
#   Float64 (fixed) [GPU]
#     slots: 3 (active: 0)
#     elements: 30000 (234.375 KiB)