ADR-0034: CCCL make_* Pre-Compilation and Warmup Strategy¶
Context¶
CCCL Python’s cuda.compute algorithms are JIT-compiled on first
invocation. The cold-call penalty is ~950–1460 ms per unique
(algorithm, dtype, op, compute-capability) combination (measured in
ADR-0033 benchmarks). After that first call, CCCL caches the compiled
GPU code in process memory and subsequent calls are fast.
The make_* object-based API goes further: it returns a reusable
callable with explicit temp-storage management, eliminating both the
JIT-compilation check and the per-call temp-storage query/allocation
overhead. Our benchmarks show make_* variants are 1.4–3.7x faster
than CuPy at all scales for scan and compaction.
Current state: All 9 CCCL algorithm call-sites in
cccl_primitives.py use the one-shot API. The make_* pattern is
only exercised in the benchmark script. NVRTC kernels are cached
in-memory by CudaDriverRuntime._module_cache but compilation
happens lazily on first kernel launch.
Problem: The first spatial operation a user runs pays the full
cold-JIT cost of every CCCL primitive in its execution path. For a
multi-stage pipeline (e.g., sjoin touching scan + select + sort +
segmented reduce), this can add 4–6 seconds of latency to the first
call — entirely host-side compilation, not useful GPU work.
Opportunity: vibeSpatial knows at import time the full inventory of CCCL primitives it uses. At GeoDataFrame construction time, it additionally knows the geometry families present, which narrows the relevant kernel set. At pipeline construction time, it knows the exact operation sequence. We can exploit all three levels of knowledge to pre-compile kernels on background CPU threads well before the user’s first operation lands.
Thread-Safety and Parallelism Analysis¶
A key question: does CCCL’s JIT release the GIL, and can we throw many CPU threads at compilation in parallel? Yes to both.
GIL release during compilation¶
The expensive work runs outside the GIL at every layer:
Layer |
Call |
GIL released? |
Evidence |
|---|---|---|---|
CCCL Cython bindings |
|
Yes |
|
cuda-python NVRTC |
|
Yes |
|
cuda-python driver |
|
Yes |
|
Numba op compilation |
|
No |
Holds GIL during LLVM pipeline (~50–100ms per unique op, |
The cccl_device_*_build() step is the dominant cost (~1–1.5s),
and it runs fully parallel across threads.
NVRTC is thread-safe¶
NVRTC operates on independent nvrtcProgram handles with no shared
mutable state between programs. Different threads can concurrently
compile different programs. NVRTC does not require a CUDA context
for compilation — context is only needed for cuModuleLoadData to
load the resulting CUBIN (compiled to native SASS via -arch=sm_XX).
Serialization bottlenecks¶
Bottleneck |
Severity |
Detail |
|---|---|---|
CCCL cache (plain dict, no lock) |
Low |
Check-then-set is not atomic, but different cache keys = zero contention. Same key = redundant compilation (not corruption). |
Numba |
Low |
GIL protects dict ops; first compile of each unique op holds GIL ~50–100ms. Only ~6 unique ops across 18 specs. |
|
Negligible |
|
CUDA driver |
Low |
Supports concurrent loads on the same context from multiple threads. |
Our |
Medium |
No lock. Two threads compiling the same kernel race. Must add a lock before Level 2 NVRTC warmup. |
Thread scaling¶
Each NVRTC compilation inside cccl_device_*_build() is
single-threaded C++ compilation (does not parallelize internally).
With N cores, N concurrent compilations achieve ~N× throughput
until CPU-bound:
Threads |
Wall time (18 specs) |
Speedup vs. serial |
Notes |
|---|---|---|---|
1 |
~18–22s |
1× |
Current lazy behavior |
4 |
~4–5s |
~4× |
Conservative, safe on 4-core laptops |
8 |
~2–3s |
~7× |
Sweet spot for most machines |
16 |
~1.5s |
~12× |
Works; slight diminishing returns from Numba GIL phase |
32 |
~1.2s |
~15× |
Past diminishing returns; memory waste (compiler temps ~100–200 MB per concurrent compile) |
The Numba GIL bottleneck is small: ~6 unique operators × ~100ms =
~600ms serialized, after which every thread hits the Numba cache and
skips straight to the nogil CCCL build.
Recommended max_workers¶
Level 1 (CCCL warmup):
max_workers=8. Only 2–12 specs submitted at a time (demand-driven). 8 threads finish any single module’s specs in <1s wall.Level 2 (NVRTC warmup):
max_workers=min(os.cpu_count(), 16). Larger kernel set, benefits from more cores, but needs the_module_cachelock fix first.
Design¶
Demand-Driven Warmup Architecture¶
Core principle: Don’t pre-compile everything at import time.
Pre-compile only the CCCL primitives that the user’s code path
actually needs, triggered by module-level dependency declarations.
The dependency graph from consumer modules to CCCL primitives is static and known at code-writing time:
sjoin / spatial_query.py
├── compact_indices → select
├── exclusive_sum → exclusive_scan
├── sort_pairs → radix_sort / merge_sort
├── lower_bound → lower_bound
├── upper_bound → upper_bound
└── segmented_reduce_* → segmented_reduce
overlay / overlay_gpu.py
├── exclusive_sum → exclusive_scan
├── sort_pairs → radix_sort / merge_sort
└── unique_sorted_pairs → unique_by_key
dissolve / dissolve_pipeline.py
└── sort_pairs → radix_sort / merge_sort
spatial index / indexing.py
└── sort_pairs → radix_sort / merge_sort
GeoArrow IO / io_arrow.py
└── exclusive_sum → exclusive_scan
point_in_polygon / predicates
└── compact_indices → select
segment_primitives.py, point_constructive.py
└── compact_indices → select
A user who does gs.within(other) only needs select (2 specs:
int32 and int64). Pre-compiling segmented_reduce, unique_by_key,
and sort for them wastes ~14s of background CPU.
Level 1: Demand-driven module warmup
- Triggered: when a consumer module first imports from cccl_primitives
- Scope: only the CCCL specs that module declares as dependencies
- Parallelism: ThreadPoolExecutor across the module's spec set
- Blocking: NO --- background threads; first real call blocks
only if its own spec isn't compiled yet
Level 2: GeoDataFrame-aware NVRTC warmup (deferred)
- Triggered: when GeoDataFrame is created with device-resident data
- Scope: NVRTC kernels for the geometry families present
Level 3: Pipeline-aware pre-compilation (deferred)
- Triggered: when a pipeline plan is constructed (sjoin, overlay)
- Scope: ensures all deps are warm before first stage executes
Level 1: Demand-Driven Module Warmup¶
Each consumer module declares the CCCL primitives it needs via a module-level constant. When that module is first imported, the precompiler picks up only those specs and submits them to background threads.
Primitive Spec Registry¶
@dataclass(frozen=True, slots=True)
class CCCLWarmupSpec:
"""Specification for a CCCL make_* pre-compilation target."""
name: str # e.g. "exclusive_scan_i32"
make_fn: str # e.g. "make_exclusive_scan"
input_dtype: np.dtype # e.g. np.int32
output_dtype: np.dtype # e.g. np.int32
op: Callable | OpKind # the binary/unary operator
h_init: np.ndarray | None # host initial value
extra_arrays: int # additional typed arrays needed
representative_n: int # dummy array size (e.g. 128)
The full inventory (derived from current cccl_primitives.py):
Spec Name |
make_* function |
Input dtype |
Op |
Used by |
|---|---|---|---|---|
|
|
int32 |
sum |
|
|
|
int64 |
sum |
|
|
|
int32 |
predicate |
|
|
|
int64 |
predicate |
|
|
|
float64 |
sum |
|
|
|
int32 |
sum |
|
|
|
float64 |
sum |
|
|
|
float64 |
min |
|
|
|
float64 |
max |
|
|
|
int32 |
None |
|
|
|
uint64 |
None |
|
|
|
int32 |
None |
|
|
|
uint64 |
None |
|
|
|
int32/int32 |
ascending |
|
|
|
uint64/int32 |
ascending |
|
|
|
uint64/int32 |
less |
|
|
|
int32/int32 |
eq |
|
|
|
uint64/int32 |
eq |
|
18 specs total, but no single operation uses all 18.
Module Dependency Declarations¶
Each consumer module declares its CCCL needs at the top of the file. This is a static declaration, not a runtime probe:
# spatial_query.py
from vibespatial.cuda.cccl_primitives import (
compact_indices,
exclusive_sum,
lower_bound,
segmented_reduce_min,
sort_pairs,
upper_bound,
)
from vibespatial.cuda.cccl_precompile import request_warmup
# Declare the 12 specs this module actually uses.
# request_warmup is a no-op if GPU is unavailable or specs
# are already submitted. It never blocks.
request_warmup([
"exclusive_scan_i32", "exclusive_scan_i64",
"select_i32", "select_i64",
"radix_sort_i32_i32", "radix_sort_u64_i32",
"merge_sort_u64_i32",
"lower_bound_i32", "lower_bound_u64",
"upper_bound_i32", "upper_bound_u64",
"segmented_reduce_min_f64",
])
# point_in_polygon.py --- only needs select
from vibespatial.cuda.cccl_precompile import request_warmup
request_warmup(["select_i32", "select_i64"])
# io_arrow.py --- only needs exclusive_scan
from vibespatial.cuda.cccl_precompile import request_warmup
request_warmup(["exclusive_scan_i32", "exclusive_scan_i64"])
Because many vibespatial consumer modules are lazily imported from
the vendored GeoPandas layer (e.g., spatial_query and indexing
are imported inside method bodies in array.py), the warmup
naturally fires only when the user first touches code that needs
those primitives — not at bare import geopandas.
What About Eagerly-Imported Modules?¶
Some consumer modules are imported eagerly at import geopandas
time (e.g., dissolve_pipeline via geodataframe.py). For those,
their request_warmup(["radix_sort_i32_i32", ...]) fires at import
time. This is fine — it’s only the 3 sort specs that dissolve
needs, not all 18. The cost is ~1s background CPU, not ~3s.
Deduplication¶
request_warmup is idempotent. If spatial_query.py requests
select_i32 and point_in_polygon.py also requests select_i32,
the spec is compiled once. The precompiler tracks submitted spec
names in a set and skips duplicates.
Cost by Operation¶
First operation used |
Specs warmed |
Background CPU cost |
|---|---|---|
|
2 (select) |
~1s on 2 threads |
|
2 (exclusive_scan) |
~1s on 2 threads |
|
12 (scan+select+sort+bound+seg_reduce) |
~2s on 8 threads |
|
6 (scan+sort+unique) |
~1.5s on 6 threads |
|
3 (sort) |
~1s on 3 threads |
Compare to the blast-everything-at-import approach: ~3s background CPU for every user, even those who never touch sjoin.
Compilation Flow¶
class CCCLPrecompiler:
"""Demand-driven background pre-compilation of CCCL make_* callables.
Unlike a blast-all-at-import precompiler, this accumulates specs
incrementally via request_warmup() calls from consumer modules.
Each consumer declares only the specs it needs.
"""
_instance: CCCLPrecompiler | None = None
def __init__(self, max_workers: int = 8):
self._spec_registry: dict[str, CCCLWarmupSpec] = {}
self._cache: dict[str, PrecompiledPrimitive] = {}
self._futures: dict[str, Future] = {}
self._submitted: set[str] = set() # dedup guard
self._lock = threading.Lock()
self._executor = ThreadPoolExecutor(
max_workers=max_workers,
thread_name_prefix="cccl-warmup",
)
self._start_time: float | None = None
self._diagnostics: list[WarmupDiagnostic] = []
@classmethod
def get(cls) -> CCCLPrecompiler:
"""Lazy singleton. Created on first request_warmup() call."""
if cls._instance is None:
cls._instance = cls()
return cls._instance
def request(self, spec_names: list[str]) -> None:
"""Submit specs for background compilation. Idempotent.
Called by consumer modules at import time. Never blocks.
Skips specs that are already submitted or compiled.
"""
with self._lock:
new_specs = [
n for n in spec_names
if n not in self._submitted
]
if not new_specs:
return
if self._start_time is None:
self._start_time = perf_counter()
for name in new_specs:
self._submitted.add(name)
spec = SPEC_REGISTRY[name]
future = self._executor.submit(self._compile_one, spec)
self._futures[name] = future
def _compile_one(self, spec: CCCLWarmupSpec) -> PrecompiledPrimitive:
"""JIT-compile one make_* callable with dummy arrays."""
import cupy as cp
from cuda.compute import algorithms
t0 = perf_counter()
# Create small dummy arrays with the right dtypes
d_in = cp.empty(spec.representative_n, dtype=spec.input_dtype)
d_out = cp.empty(spec.representative_n, dtype=spec.output_dtype)
# Call the make_* function (triggers JIT compilation)
make_fn = getattr(algorithms, spec.make_fn)
# ... build args from spec ...
callable_obj = make_fn(*args)
# Query temp storage size for a representative workload
temp_bytes = callable_obj(None, *query_args)
d_temp = cp.empty(max(int(temp_bytes), 1), dtype=cp.uint8)
elapsed_ms = (perf_counter() - t0) * 1000.0
result = PrecompiledPrimitive(
name=spec.name,
make_callable=callable_obj,
temp_storage=d_temp,
temp_storage_bytes=int(temp_bytes),
warmup_ms=elapsed_ms,
)
self._cache[spec.name] = result
return result
def get_compiled(self, name: str, timeout: float = 5.0) -> PrecompiledPrimitive | None:
"""Get a pre-compiled primitive. Blocks if compilation in progress.
If the spec was never requested (consumer didn't declare it),
returns None immediately --- caller uses one-shot fallback.
"""
if name in self._cache:
return self._cache[name]
if name in self._futures:
try:
return self._futures[name].result(timeout=timeout)
except (TimeoutError, Exception):
return None # Fall back to one-shot
return None
def status(self) -> dict:
"""Diagnostic snapshot for observability."""
return {
"submitted": len(self._submitted),
"compiled": len(self._cache),
"pending": sum(1 for f in self._futures.values() if not f.done()),
"failed": sum(1 for f in self._futures.values()
if f.done() and f.exception()),
"wall_ms": (perf_counter() - self._start_time) * 1000
if self._start_time else 0,
"per_primitive": [
{"name": d.name, "ms": d.elapsed_ms, "ok": d.success}
for d in self._diagnostics
],
}
# Module-level convenience function used by consumer modules.
def request_warmup(spec_names: list[str]) -> None:
"""Non-blocking request to pre-compile CCCL specs.
Safe to call at module scope. No-op if GPU is not available.
"""
if not has_gpu_runtime():
return
CCCLPrecompiler.get().request(spec_names)
Integration with cccl_primitives.py¶
Each function in cccl_primitives.py gains a fast path that checks
the precompiler cache before falling back to one-shot:
def exclusive_sum(values, out=None, *, strategy=PrimitiveStrategy.AUTO):
"""Exclusive prefix sum."""
# ... existing strategy selection ...
if use_cccl:
dtype_key = f"exclusive_scan_{_dtype_suffix(values.dtype)}"
precompiled = CCCLPrecompiler.get().get_compiled(dtype_key, timeout=2.0)
if precompiled is not None:
# Ensure temp storage is large enough
temp = _ensure_temp(precompiled, int(values.size))
precompiled.make_callable(
temp, values, out, _sum_op, int(values.size), _h_init
)
return out
# Fallback: one-shot API (cold or cache miss)
algorithms.exclusive_scan(values, out, _sum_op, _h_init, int(values.size))
return out
Temp Storage Sizing¶
The make_* callable’s temp storage requirement depends on
num_items, not on the data content. For most CUB algorithms, temp
storage scales as O(n) or less. Strategy:
Pre-allocate temp storage at
representative_n = 128during warmup (just to trigger JIT; actual buffer is tiny).At first real call, re-query temp storage for actual
num_itemsand allocate. Cache this as the “high-water mark” buffer.On subsequent calls, reuse if
num_items <= high_water_n, else re-query and grow. Never shrink — spatial workloads are typically consistent in scale within a session.
def _ensure_temp(precompiled: PrecompiledPrimitive, num_items: int):
"""Grow temp storage if the current buffer is too small."""
if num_items <= precompiled.high_water_n:
return precompiled.temp_storage
# Re-query for the new size
needed = precompiled.make_callable(None, *query_args_for(num_items))
if needed > precompiled.temp_storage_bytes:
precompiled.temp_storage = cp.empty(needed, dtype=cp.uint8)
precompiled.temp_storage_bytes = needed
precompiled.high_water_n = num_items
return precompiled.temp_storage
Level 2: Demand-Driven NVRTC Warmup¶
The same demand-driven pattern applies to our 14 custom NVRTC compilation units. Unlike CCCL (where each spec is a unique algorithm × dtype × op combo), NVRTC kernels are simpler: each source string compiles exactly once per process — no dtype variance, no compile-option variance, deterministic cache keys.
NVRTC Kernel Inventory¶
# |
Source Constant |
Module |
CUDA Lines |
Kernels |
Est. Compile Cost |
|---|---|---|---|---|---|
1 |
|
|
948 |
19 |
~200–400ms |
2 |
|
|
750 |
10 |
~150–300ms |
3 |
|
|
493 |
10 |
~100–250ms |
4 |
|
|
434 |
8 |
~80–200ms |
5 |
|
|
399 |
4 |
~80–180ms |
6 |
|
|
345 |
5 |
~70–150ms |
7 |
|
|
329 |
6 |
~60–140ms |
8 |
|
|
253 |
4 |
~50–120ms |
9 |
|
|
247 |
6 |
~50–100ms |
10 |
|
|
213 |
6 |
~40–90ms |
11 |
|
|
133 |
3 |
~30–60ms |
12 |
|
|
117 |
1 |
~25–50ms |
13 |
|
|
83 |
3 |
~20–40ms |
14 |
|
|
41 |
1 |
~10–20ms |
Totals: ~4,785 lines of CUDA C, 85 __global__ entry points,
14 compilation units, ~1.0–2.5s serial compilation.
Why It’s Simpler Than CCCL Warmup¶
No type specialization. All kernels use FP64 exclusively. One source string → one cache key → one compilation. No combinatorial explosion of (algorithm, dtype, op) tuples.
No operator JIT. No Numba compilation of Python callables. The CUDA C source is a module-level Python string constant, fully determined at import time.
Deterministic cache keys.
make_kernel_cache_key(prefix, source)hashes the source string. Since sources are constants, cache keys are process-invariant.Same GIL release.
nvrtcCompileProgramandcuModuleLoadDataboth run withnogilin Cython. 14 compilations on 8 threads = ~200–400ms wall time vs. ~1–2.5s serial.
Trigger Pattern¶
Like CCCL Level 1, use the same demand-driven module-import trigger. Each module that defines NVRTC kernel sources declares its compile units at module scope:
# spatial_query.py
from vibespatial.cuda.nvrtc_precompile import request_nvrtc_warmup
request_nvrtc_warmup([
("spatial-query", _SPATIAL_QUERY_KERNEL_SOURCE, _SQ_KERNEL_NAMES),
("morton-range", _MORTON_RANGE_KERNEL_SOURCE, _MR_KERNEL_NAMES),
])
# point_in_polygon.py
from vibespatial.cuda.nvrtc_precompile import request_nvrtc_warmup
request_nvrtc_warmup([
("pip", _POINT_IN_POLYGON_KERNEL_SOURCE, _PIP_KERNEL_NAMES),
])
The singleton precompiler deduplicates by cache key and submits
new compilations to its ThreadPoolExecutor. The _xxx_kernels()
lazy-init functions gain a fast path checking the cache first:
def _spatial_query_kernels():
runtime = get_cuda_runtime()
cache_key = make_kernel_cache_key("spatial-query",
_SPATIAL_QUERY_KERNEL_SOURCE)
# Fast path: already warm from background thread
if cache_key in runtime._module_cache:
return runtime._module_cache[cache_key]
# Slow path: compile synchronously (first call before warmup finishes)
return runtime.compile_kernels(
cache_key=cache_key,
source=_SPATIAL_QUERY_KERNEL_SOURCE,
kernel_names=_SQ_KERNEL_NAMES,
)
NVRTCPrecompiler Sketch¶
class NVRTCPrecompiler:
"""Demand-driven background NVRTC compilation.
Follows the same singleton + request() + dedup pattern
as CCCLPrecompiler, but simpler: no type specs, just
(prefix, source, kernel_names) tuples.
"""
_instance: NVRTCPrecompiler | None = None
def __init__(self, max_workers: int | None = None):
self._submitted: set[str] = set() # cache keys
self._lock = threading.Lock()
self._executor = ThreadPoolExecutor(
max_workers=max_workers or min(os.cpu_count() or 4, 16),
thread_name_prefix="nvrtc-warmup",
)
@classmethod
def get(cls) -> NVRTCPrecompiler:
if cls._instance is None:
cls._instance = cls()
return cls._instance
def request(self, units: list[tuple[str, str, tuple[str, ...]]]) -> None:
"""Submit (prefix, source, kernel_names) for background compile."""
runtime = get_cuda_runtime()
with self._lock:
for prefix, source, kernel_names in units:
cache_key = make_kernel_cache_key(prefix, source)
if cache_key in self._submitted:
continue
self._submitted.add(cache_key)
self._executor.submit(
runtime.compile_kernels,
cache_key=cache_key,
source=source,
kernel_names=kernel_names,
)
Cost by Operation (NVRTC only)¶
First operation used |
Compilation units |
Background wall time |
|---|---|---|
|
1 unit (434 lines) |
~80–200ms |
|
1 unit (399 lines) |
~80–180ms |
|
2–3 units (spatial_query + bounds + pip) |
~200–400ms |
|
2 units (overlay + seg_intersection) |
~75–170ms |
All 14 units |
14 units (~4,785 lines) |
~200–400ms on 8+ threads |
Prerequisite: _module_cache Thread Safety¶
CudaDriverRuntime.compile_kernels() currently has no lock on
_module_cache. Before enabling NVRTC warmup, add a
threading.Lock to guard the check-then-compile-then-store sequence:
class CudaDriverRuntime:
def __init__(self, ...):
...
self._module_cache: dict[str, dict[str, Any]] = {}
self._module_cache_lock = threading.Lock()
def compile_kernels(self, *, cache_key, source, kernel_names, options=()):
# Fast path: no lock needed for reads (dict reads are GIL-atomic)
if cache_key in self._module_cache:
return self._module_cache[cache_key]
with self._module_cache_lock:
# Double-check after acquiring lock
if cache_key in self._module_cache:
return self._module_cache[cache_key]
# ... compile ...
self._module_cache[cache_key] = kernels
return kernels
Level 3: Pipeline-Aware Pre-Compilation¶
When a high-level operation like sjoin, overlay, or dissolve is
invoked, it constructs an execution plan. Before executing the first
stage, the plan can declare its CCCL and NVRTC requirements:
class PipelinePlan:
"""Execution plan for a multi-stage spatial operation."""
stages: list[PipelineStage]
cccl_requirements: list[str] # spec names needed
nvrtc_requirements: list[str] # cache keys needed
def ensure_warm(self, timeout: float = 3.0) -> list[str]:
"""Block until all required primitives are compiled.
Returns list of any specs that timed out (will use one-shot)."""
cold = []
for spec_name in self.cccl_requirements:
result = _precompiler.get(spec_name, timeout=timeout)
if result is None:
cold.append(spec_name)
return cold
This level has the highest precision but requires pipeline code to declare its requirements. It naturally emerges as pipelines are built.
Key Design Properties¶
1. Never blocks import¶
Level 1 warmup runs on daemon threads. import vibespatial returns
immediately. If a user operation arrives before warmup completes,
get() blocks for up to timeout seconds on just the needed
primitive, then falls back to one-shot. Worst case = same as today.
2. Never fails loudly¶
If CCCL or GPU is not available, the precompiler is never
instantiated. If individual make_* calls fail (unsupported API
version, etc.), the fallback is the existing one-shot path. All
failures are logged to diagnostics, never raised.
3. Observable¶
Warmup status is visible via vibespatial.precompile_status() and
dispatch events log whether a pre-compiled callable was used:
>>> import geopandas as gpd
>>> # At this point, dissolve_pipeline imported eagerly.
>>> # Only 3 specs (sort variants) are warming in background.
>>> from vibespatial.cuda.cccl_precompile import CCCLPrecompiler
>>> CCCLPrecompiler.get().status()
{"submitted": 3, "compiled": 3, "pending": 0, "wall_ms": 1012.4, ...}
>>> # User calls sjoin --- spatial_query.py is lazily imported.
>>> # Its request_warmup() fires, adding 12 specs (some deduplicated
>>> # against the 3 sort specs already compiled by dissolve).
>>> gpd.sjoin(a, b)
>>> CCCLPrecompiler.get().status()
{"submitted": 15, "compiled": 15, "pending": 0, "wall_ms": 2341.7, ...}
4. Thread-safe — and truly parallel¶
Every
cccl_device_*_build()call releases the GIL via Cythonwith nogil:. Threads run on separate CPU cores during the ~1–1.5s compilation phase — this is real parallelism, not just concurrency.CCCL’s internal JIT cache is thread-safe (keyed by compute capability, dtype, op bytecode). Different specs use different cache keys → zero contention.
make_*returns an independent callable; no shared mutable state.Pre-compiled objects are stored in a
dictprotected by our_lockand the GIL (safe for concurrent reads after write).CuPy allocations on background threads use CuPy’s per-thread default memory pool.
See “Thread-Safety and Parallelism Analysis” in Context for full bottleneck inventory and scaling data.
5. Compatible with CCCL’s own caching¶
CCCL already caches compiled GPU code internally. Our make_*
pre-compilation triggers that cache population as a side effect. Even
if the user later calls the one-shot API (bypassing our make_*
object), the one-shot call also hits CCCL’s warm cache. The
pre-compilation is strictly additive.
Cost Model¶
CCCL (Level 1):
What |
Cost |
When Paid |
|---|---|---|
Per-spec |
~1–1.5s CPU per spec |
Background, on first import of consumer module |
Dummy arrays per spec (128 elements) |
~2 KB device memory |
Freed after warmup |
Temp-storage buffers per spec |
~few hundred bytes |
Grown on first real call |
|
~1 MB RSS |
Lazy singleton, lives for process |
Worst case (user touches every operation): all 18 specs warm across multiple background bursts, total ~2–3s wall on 8 threads. Typical case (single operation family): 2–6 specs, ~1s wall.
NVRTC (Level 2):
What |
Cost |
When Paid |
|---|---|---|
Per-unit NVRTC compile |
~20–400ms CPU per unit |
Background, on first import of kernel module |
14 compilation units total |
~1.0–2.5s serial |
~200–400ms wall on 8+ threads |
|
~1 MB RSS |
Lazy singleton, lives for process |
NVRTC warmup is cheaper per unit and lacks the Numba GIL bottleneck.
Thread scaling is near-linear because nvrtcCompileProgram and
cuModuleLoadData both release the GIL.
Combined worst case: user imports every operation module and triggers all 18 CCCL specs + 14 NVRTC units. Background CPU time: ~22–30s total across two thread pools, but wall time is ~2–3s (CCCL-dominated). CPU-only users pay exactly zero.
What About On-Disk Caching?¶
CCCL does not currently expose an API to serialize compiled GPU code
to disk. NVRTC CUBIN output is already disk-cached by
CudaDriverRuntime (see _read_cached_cubin / _write_cached_cubin
in cuda_runtime.py), keyed by compute capability, NVRTC version,
source hash, and compile options. This eliminates NVRTC recompilation
across process restarts. Defer CCCL on-disk caching until CCCL ships
its own persistent cache or benchmarks show restart latency is
critical.
What About cuda.coop?¶
cuda.coop provides block- and warp-level primitives for use inside
Numba CUDA kernels. These have a different compilation model (they
produce .files that link into @cuda.jit kernels). Pre-compilation
would mean calling make_* eagerly at import time, which pre-compiles
the LTO-IR. This is compatible with the same background-thread
pattern but only becomes relevant if we adopt Numba CUDA for kernel
authoring. Currently all our kernels are NVRTC-based.
Action: Revisit when we evaluate Numba CUDA for any kernel family.
Decision¶
Implement Level 1 (demand-driven CCCL warmup) as the first deliverable. Each consumer module declares its specs via
request_warmup()at module scope; the singleton precompiler accumulates and deduplicates across modules.Implement Level 2 (demand-driven NVRTC warmup) alongside Level 1. Same pattern: each kernel module declares its
(prefix, source, kernel_names)tuples viarequest_nvrtc_warmup()at module scope. Simpler than Level 1 (no dtype/op variance — just 14 fixed compilation units).Add
_module_cache_locktoCudaDriverRuntimewith double-checked locking incompile_kernels(). Required prerequisite for Level 2 thread safety.Refactor
cccl_primitives.pyto usemake_*callables at runtime, with the precompiler cache as the primary path and one-shot as fallback.Add
CCCLWarmupSpecdeclarations in a centralSPEC_REGISTRYdict insidecccl_precompile.py, keyed by spec name.Add
request_warmup()/request_nvrtc_warmup()calls to each consumer module listing only their own deps.Expose
precompile_status()for observability across both CCCL and NVRTC precompilers.Defer Level 3 (pipeline-aware pre-compilation) until pipeline plan objects are mature enough to declare requirements.
Do not implement on-disk caching until CCCL provides a stable serialization API or benchmarks show restart latency is critical.
Consequences¶
Positive¶
GPU operations see zero JIT latency when their consumer module was imported early enough for background threads to finish.
Both CCCL and NVRTC compilation run outside the GIL, enabling true multi-core parallelism during background warmup.
Users who only touch one operation family (e.g., point-in-polygon) pay only for their CCCL specs + 1 NVRTC unit, not all 18 + 14.
make_*reusable callables with pre-allocated temp storage are 1.4–3.7x faster than CuPy for scan/compaction at every call, not just the first.NVRTC warmup eliminates the ~50–400ms first-launch penalty per kernel module, which compounds to ~1–2.5s in multi-stage pipelines.
Observable warmup status prevents silent cold-start regressions.
Strictly additive: no risk of breaking existing one-shot or lazy-compile paths.
CPU-only users pay exactly zero (both
request_warmupandrequest_nvrtc_warmupare no-ops without GPU).
Negative¶
Background threads consume CPU when consumer modules are imported (mitigated: only the specs/units that module needs, bounded pools).
Two thread pools (CCCL + NVRTC) may compete for CPU with user code in the first ~2–3s (mitigated: daemon threads at normal priority, total background CPU is bounded).
Small device memory allocation during CCCL warmup (mitigated: dummy arrays are 128 elements, freed after warmup).
Each consumer module needs a
request_warmup()/request_nvrtc_warmup()call at module scope (mitigated: simple one-liner, enforced by test).Adding
_module_cache_locktoCudaDriverRuntimeadds a small overhead to the cache-hit path (mitigated: double-checked locking means the lock is only acquired on cache miss).
Neutral¶
CCCL version upgrades may change
make_*signatures. Guarded byhasattrchecks, same as current benchmark code.Does not change the GPU-first / CPU-fallback contract.
Alternatives Considered¶
A: Lazy caching only (no background warmup)¶
Let CCCL’s internal cache handle warm-up. First call pays JIT, all subsequent calls are fast.
Rejected: This is the status quo. The first-operation latency
spike is a real UX problem, especially in notebook workflows where
users expect sub-second response after import.
B: Blast all 18 specs at import vibespatial¶
Pre-compile every CCCL primitive signature at import time regardless of which operations the user will actually call.
Rejected: Wastes ~3s of background CPU for users who only need 2 specs. Also forces CUDA context initialization for CPU-only users unless guarded. Demand-driven warmup is strictly better: same background-thread mechanism, but scoped to actual usage.
C: Explicit vibespatial.warmup() API only¶
Require the user to opt in to pre-compilation.
Rejected: Users who don’t know about warmup get the worst experience. The background-thread approach gives the benefit automatically without requiring user action.
D: Fork / multiprocessing for warmup isolation¶
Use a separate process for warmup to avoid GIL contention.
Rejected: CCCL JIT caches are per-process. Compiling in a child
process and transferring compiled GPU code back is not supported.
Thread-based warmup is sufficient because every cccl_device_*_build()
call releases the GIL via Cython with nogil: — threads achieve
true CPU parallelism, not just concurrency. Measured scaling is
near-linear up to 8 threads (see Thread-Safety Analysis in Context).
E: max_workers=32 to minimize wall time¶
Throw maximum CPU at warmup to finish compilation as fast as possible.
Rejected for Level 1: We only have 2–12 specs per demand burst. 8 threads already finish in <1s wall. Beyond 8 threads, diminishing returns from Numba GIL serialization (~600ms) and compiler memory overhead (~100–200 MB per concurrent NVRTC compile). Revisit for Level 2 if the NVRTC kernel set grows large.