How It Works
This page explains the core mechanisms that enable zero-allocation array reuse.
The Zero-Allocation Promise
+-------------------------------------------------------------+
| Call 1 (warmup): |
| checkpoint! --> acquire! x 3 --> rewind! |
| | |
| +-- backing memory allocated |
| |
| Call 2+ (zero-alloc): |
| checkpoint! --> acquire! x 3 --> rewind! |
| | |
| +-- same memory reused, 0 bytes allocated |
+-------------------------------------------------------------+Checkpoint/Rewind Lifecycle
The core mechanism that enables memory reuse:
@with_pool pool function foo()
|
+---> checkpoint!(pool) # Save current state (n_active counters)
|
| A = acquire!(pool, ...) # n_active += 1
| B = acquire!(pool, ...) # n_active += 1
| C = acquire!(pool, ...) # n_active += 1
| ... compute ...
|
+---> rewind!(pool) # Restore n_active, arrays available for reuse
endOn repeated calls, the same memory is reused without any allocation.
Exception Safety: try...finally
The @with_pool macro generates code with exception-safe cleanup:
# What you write:
@with_pool pool begin
A = acquire!(pool, Float64, 100)
result = compute(A)
end
# What the macro generates:
let pool = get_task_local_pool()
checkpoint!(pool)
try
A = acquire!(pool, Float64, 100)
result = compute(A)
finally
rewind!(pool) # Always executes, even on exception
end
endKey guarantee: The finally block ensures rewind! is called even if an exception occurs, preventing memory leaks and state corruption.
Fixed-Slot Type Dispatch
To achieve zero-lookup overhead, common types have dedicated struct fields:
struct AdaptiveArrayPool
float64::TypedPool{Float64}
float32::TypedPool{Float32}
int64::TypedPool{Int64}
int32::TypedPool{Int32}
complexf64::TypedPool{ComplexF64}
complexf32::TypedPool{ComplexF32}
bool::TypedPool{Bool}
others::IdDict{DataType, Any} # Fallback for rare types
endWhen you call acquire!(pool, Float64, n), the compiler inlines directly to pool.float64 — no dictionary lookup, no type instability.
N-Way Set Associative Cache
For unsafe_acquire! (which returns native Array types), we use an N-way cache to reduce header allocation:
CACHE_WAYS = 4 (default)
┌────┬────┬────┬────┐
Slot 0 (Float64): │way0│way1│way2│way3│ ← round-robin eviction
└────┴────┴────┴────┘
┌────┬────┬────┬────┐
Slot 1 (Float32): │way0│way1│way2│way3│
└────┴────┴────┴────┘
...Cache Lookup Pseudocode
function unsafe_acquire!(pool, T, dims...)
typed_pool = get_typed_pool!(pool, T)
slot = n_active + 1
base = (slot - 1) * CACHE_WAYS
# Search all ways for matching dimensions
for k in 1:CACHE_WAYS
idx = base + k
if dims == typed_pool.nd_dims[idx]
# Cache hit! Check if underlying vector was resized
if pointer matches
return typed_pool.nd_arrays[idx]
end
end
end
# Cache miss: create new Array header, store in next way (round-robin)
way = typed_pool.nd_next_way[slot]
typed_pool.nd_next_way[slot] = (way + 1) % CACHE_WAYS
# ... create and cache Array ...
endKey insight: Even on cache miss, only the Array header (~80-144 bytes) is allocated. The actual data memory is always reused from the pool.
View vs Array Return Types
Type stability is critical for performance. AdaptiveArrayPools provides two APIs:
| API | 1D Return | N-D Return | Allocation |
|---|---|---|---|
acquire! | SubArray{T,1} | ReshapedArray{T,N} | Always 0 bytes |
unsafe_acquire! | Vector{T} | Array{T,N} | 0 bytes (hit) / ~100 bytes (miss) |
Why Two APIs?
acquire! (views) — The compiler can eliminate view wrappers entirely through SROA (Scalar Replacement of Aggregates) and escape analysis. This is why 1D SubArray and N-D ReshapedArray achieve true zero allocation.
unsafe_acquire! (arrays) — Sometimes you need a concrete Array type:
- FFI/C interop requiring
Ptr{T}from contiguous memory - Type signatures that explicitly require
Array{T,N} - Avoiding runtime dispatch in polymorphic code
Typed Checkpoint/Rewind Optimization
When the @with_pool macro can statically determine which types are used, it generates optimized code:
# If only Float64 is used in the block:
checkpoint!(pool, Float64) # ~77% faster than full checkpoint
# ... compute ...
rewind!(pool, Float64)This avoids iterating over all type slots and the others IdDict.
1-Based Sentinel Pattern
Internal state vectors use a sentinel at index 0 to eliminate isempty() checks:
_checkpoint_n_active = [0] # Sentinel at depth=0
_checkpoint_depths = [0] # Global scope markerThis pattern reduces branching in hot paths where every nanosecond counts.
Further Reading
For detailed design documents:
hybrid_api_design.md— Two-API strategy (acquire!vsunsafe_acquire!) and type stability analysisnd_array_approach_comparison.md— N-way cache design, boxing analysis, and ReshapedArray benchmarksuntracked_acquire_design.md— Macro-based untracked acquire detection and 1-based sentinel patternfixed_slots_codegen_design.md— Zero-allocation iteration via@generatedfunctionscuda_extension_design.md— CUDA backend architecture and extension loading