Tiered GPU Memory Pool (RMM)¶
Context¶
vibeSpatial’s overlay and spatial-join pipelines allocate 80-120 device
arrays per operation at 1M geometry scale. CuPy’s built-in MemoryPool
uses power-of-2 binning, which wastes up to 50% of VRAM on large
coordinate arrays. More critically, there is no recovery path when an
allocation fails — the process crashes with OutOfMemoryError.
RAPIDS RMM (RAPIDS Memory Manager) provides composable memory resource
adaptors: coalescing pools, failure callbacks, managed memory, and
statistics tracking. It integrates with CuPy via
rmm.allocators.cupy.rmm_cupy_allocator.
Decision¶
Adopt a three-tier memory management architecture, with RMM as an optional dependency and CuPy’s pool as the fallback:
Tier |
Activation |
Allocator Stack |
|---|---|---|
A (default) |
RMM installed |
|
B (safe) |
|
|
C (oversubscription) |
|
Bare |
Fallback |
RMM not installed |
CuPy |
Design choices¶
Tier A uses
initial_pool_size=0to avoid starving other processes on shared GPUs. The pool grows on demand.Tier B’s OOM callback calls
gc.collect()and retries up to 3 times per allocation attempt, with a time-based reset (>1 s gap) so independent OOM events each get the full retry budget.Tier C uses bare
ManagedMemoryResourcewithout pool wrapping. Pooling managed memory adds overhead without benefit because CUDA already handles demand paging.PrefetchResourceAdaptorwas evaluated and rejected: vibeSpatial’s SoA coordinate layout means each segment access touches 4-8 separate array pages, making prefetch ineffective under oversubscription.Deferred initialization: RMM resources require a CUDA context, but
CudaDriverRuntime.__init__runs at module import time before any CUDA call. RMM setup is deferred to_ensure_context(). If it fails, the runtime falls back to the CuPy pool with a warning.CCCL one-shot
cudaMallocAsyncbypasses RMM: CCCL primitives withoutmake_*precompilation allocate temp storage via the CUDA driver’s internal async pool. This is a known limitation affecting only cold-start paths.
Consequences¶
Positive: ~5-15% peak VRAM reduction from coalescing (Tier A); OOM resilience without overhead (Tier B); ability to process datasets exceeding VRAM (Tier C, with documented 2-10× slowdown).
Negative: New optional dependency (rmm).
memory_pool_stats()returns different key sets per backend.free_pool_memory()becomes a no-op for RMM backends (the pool retains its arena for reuse).Risk: The SoA coordinate layout is worst-case for managed memory page faults. The face-walk kernel’s pointer-chasing through
next_edge_idscan degrade 50-100× under Tier C oversubscription. This is documented but not mitigated — Tier C is opt-in for users who accept the tradeoff.