Tiered GPU Memory Pool (RMM)

Context

vibeSpatial’s overlay and spatial-join pipelines allocate 80-120 device arrays per operation at 1M geometry scale. CuPy’s built-in MemoryPool uses power-of-2 binning, which wastes up to 50% of VRAM on large coordinate arrays. More critically, there is no recovery path when an allocation fails — the process crashes with OutOfMemoryError.

RAPIDS RMM (RAPIDS Memory Manager) provides composable memory resource adaptors: coalescing pools, failure callbacks, managed memory, and statistics tracking. It integrates with CuPy via rmm.allocators.cupy.rmm_cupy_allocator.

Decision

Adopt a three-tier memory management architecture, with RMM as an optional dependency and CuPy’s pool as the fallback:

Tier

Activation

Allocator Stack

A (default)

RMM installed

PoolMemoryResourceCudaMemoryResource

B (safe)

VIBESPATIAL_GPU_OOM_SAFETY=1

FailureCallbackResourceAdaptor → Pool → Cuda

C (oversubscription)

VIBESPATIAL_GPU_MANAGED_MEMORY=1

Bare ManagedMemoryResource

Fallback

RMM not installed

CuPy MemoryPool

Design choices

  • Tier A uses initial_pool_size=0 to avoid starving other processes on shared GPUs. The pool grows on demand.

  • Tier B’s OOM callback calls gc.collect() and retries up to 3 times per allocation attempt, with a time-based reset (>1 s gap) so independent OOM events each get the full retry budget.

  • Tier C uses bare ManagedMemoryResource without pool wrapping. Pooling managed memory adds overhead without benefit because CUDA already handles demand paging. PrefetchResourceAdaptor was evaluated and rejected: vibeSpatial’s SoA coordinate layout means each segment access touches 4-8 separate array pages, making prefetch ineffective under oversubscription.

  • Deferred initialization: RMM resources require a CUDA context, but CudaDriverRuntime.__init__ runs at module import time before any CUDA call. RMM setup is deferred to _ensure_context(). If it fails, the runtime falls back to the CuPy pool with a warning.

  • CCCL one-shot cudaMallocAsync bypasses RMM: CCCL primitives without make_* precompilation allocate temp storage via the CUDA driver’s internal async pool. This is a known limitation affecting only cold-start paths.

Consequences

  • Positive: ~5-15% peak VRAM reduction from coalescing (Tier A); OOM resilience without overhead (Tier B); ability to process datasets exceeding VRAM (Tier C, with documented 2-10× slowdown).

  • Negative: New optional dependency (rmm). memory_pool_stats() returns different key sets per backend. free_pool_memory() becomes a no-op for RMM backends (the pool retains its arena for reuse).

  • Risk: The SoA coordinate layout is worst-case for managed memory page faults. The face-walk kernel’s pointer-chasing through next_edge_ids can degrade 50-100× under Tier C oversubscription. This is documented but not mitigated — Tier C is opt-in for users who accept the tradeoff.