GPU Kernel Caching¶
vibeSpatial JIT-compiles two families of GPU code at runtime:
NVRTC kernels – custom CUDA C kernels compiled via NVRTC (spatial query, point-in-polygon, overlay, bounds, WKB encode, etc.)
CCCL algorithms – CUB-based primitives compiled via CCCL’s
make_*API (scan, reduce, radix sort, merge sort, binary search, unique-by-key, segmented reduce, select, segmented sort)
Both families use on-disk CUBIN caches so the JIT cost is paid once per install, not once per Python process.
Two-tier caching architecture¶
Process start
|
|-- CCCL precompiler (8 threads)
| For each spec:
| 1. Check CCCL CUBIN disk cache (~/.cache/vibespatial/cccl/)
| 2. Hit: cuLibraryLoadData (1-40 ms) --> ready
| 3. Miss: CCCL make_* build (1,300-9,000 ms) --> extract CUBIN --> save to disk
|
|-- NVRTC precompiler (16 threads)
| For each unit:
| 1. Check NVRTC CUBIN disk cache (~/.cache/vibespatial/nvrtc/)
| 2. Hit: cuModuleLoadData (50-200 ms) --> ready
| 3. Miss: nvrtcCompileProgram (80-150 ms) --> save to disk
|
V
Pipeline execution (all kernels warm)
Timing summary¶
Scenario |
CCCL (21 specs) |
NVRTC (61 units) |
Total wall |
|---|---|---|---|
Cold (no cache) |
~3,400 ms |
~200-400 ms |
~3.5 s |
Warm (disk hit, lazy) |
~0.1 ms request |
~0.1 ms request |
~0.2 ms at import |
In-memory hit |
0.04 ms |
<0.01 ms |
instant |
With lazy warmup, disk-cached specs are deferred at request_warmup() time
and loaded on first get_compiled() call (~2 ms/spec). No thread pool is
created when all specs are cached. See “Lazy warmup” below.
CCCL CUBIN disk cache¶
This is the novel component. CCCL has no built-in on-disk cache – each Python process re-runs NVRTC + nvJitLink on first use of each algorithm spec. The CCCL CUBIN cache eliminates this by intercepting the build result and replaying it via ctypes on subsequent starts.
How it works¶
First run (cache miss):
CCCLPrecompiler._compile_one()callsalgorithms.make_exclusive_scan()which triggers CCCL’s C build function (NVRTC compile + nvJitLink, ~1,300 ms)After the build,
extract_cache_entry()reads the Cbuild_resultstruct directly from the Cython object’s memory via ctypesIt extracts: the compiled CUBIN bytes (
_get_cubin()), kernel entry-point names (parsed from the CUBIN’s ELF symbol table), runtime policy bytes (viamalloc_usable_size), and all scalar metadata fieldsThe CUBIN is normalized (nvJitLink session hash zeroed) for content-addressable keying, then the entry is atomically written to disk
Subsequent runs (cache hit):
_compile_one()checks the disk cache before calling CCCLOn hit,
reconstruct_build()loads the cached CUBIN viacuLibraryLoadData(~1-2 ms), gets kernel handles viacuLibraryGetKernel, restores the runtime policy, and populates a ctypes replica of the C build result structA
_CachedScan(or_CachedReduce, etc.) wrapper is constructed with the same__call__protocol as CCCL’s_Scan/_ReduceOn each compute call, iterator/op/value arguments are serialized via CCCL’s
as_bytes()method into opaque ctypes buffers, and the C compute function (cccl_device_exclusive_scan(), etc.) is called directly throughlibcccl.c.parallel.sovia ctypes
Fallback: any exception during cache load silently falls back to the standard CCCL build at zero additional cost.
Why ctypes replay?¶
The CCCL C API functions (cccl_device_exclusive_scan(), etc.) take the
build result struct by value and handle all dispatch logic internally
(grid/block sizing, argument marshaling, CUB dispatch). We do not
replicate CUB internals – we call the same C function CCCL uses, just
with a struct we populated from cache instead of from NVRTC.
The build result structs (defined in cccl/c/scan.h, cccl/c/reduce.h,
etc.) contain: compute capability, CUBIN pointer, CUlibrary handle,
CUkernel handles, runtime policy pointer, and algorithm-specific metadata
(accumulator type, tile sizes, sort order, etc.).
CUBIN normalization¶
nvJitLink embeds _INTERNAL_..._XXXXXXXX_ symbols with a unique 8-char
hex session hash per build. This hash differs between builds even when the
source is identical. We zero all occurrences (always exactly one unique
hash, at ~72 positions) to produce a content-addressable CUBIN. The
normalized CUBIN loads correctly – the hash is in the ELF string table,
not in code sections.
Cache key format¶
v1-sm{CC}-cccl{VERSION}-{spec_name}-{normalized_cubin_sha256_12}.cache
Example: v1-sm89-cccl0.5.1-exclusive_scan_i32-dd7dbbd47276.cache
Components: format version, compute capability, CCCL package version, spec name, and truncated SHA-256 of the normalized CUBIN. A CCCL version change automatically invalidates the entire cache.
Cache file format¶
Each .cache file uses a safe binary format with no executable
deserialization (no pickle):
Offset Size Content
0 8 Magic: "CCCLCCH\0"
8 4 header_len (little-endian uint32)
12 header_len JSON header (UTF-8)
12+N cubin_size Raw CUBIN bytes
... policy_size Raw runtime_policy bytes
The JSON header contains: spec_name, family, kernel_names (dict
mapping struct field names to ELF entry-point names), metadata (all
scalar fields: cc, tile sizes, accumulator type, etc.), cubin_size,
and policy_size. The two large binary blobs (CUBIN and policy) are
appended as raw bytes after the header, referenced by size fields in
the JSON. This avoids any code execution surface while keeping the
format self-describing and trivially auditable.
Struct definitions¶
ctypes Structure subclasses mirror the CCCL C headers exactly. Sizes
are validated at extraction time by locating the cubin_size field in
the Cython object’s memory.
Family |
C struct |
sizeof |
Kernels |
|---|---|---|---|
Scan |
|
104 |
2 |
Reduce |
|
88 |
4 |
SegmentedReduce |
|
56 |
1 |
RadixSort |
|
168 |
9 |
MergeSort |
|
112 |
3 |
UniqueByKey |
|
72 |
2 |
BinarySearch |
|
40 |
1 |
Families not cached¶
Select – uses
DeviceThreeWayPartitionBuildResultwith Numba-compiled predicate LTOIR embedded in the build. The predicate’s state arrays make caching non-trivial.SegmentedSort – build result struct embeds
cccl_op_tsub-structs with LTOIR code pointers that require special serialization.
Both fall through to the standard CCCL build path with no performance regression.
NVRTC CUBIN disk cache¶
The NVRTC disk cache (cuda_runtime.py) is simpler – NVRTC produces a
standard CUBIN that can be loaded directly via cuModuleLoadData. This
cache predates the CCCL cache and uses the same patterns:
Cache key:
v2-sm{CC}-nvrtc{VER}-{source_hash}[-opts-{hash}].cubinAtomic writes: temp file +
os.replace()Corruption recovery: if
cuModuleLoadDatafails on a cached CUBIN, the file is deleted and the kernel is recompiledLocation:
~/.cache/vibespatial/nvrtc/
Lazy warmup for disk-cached specs¶
When request_warmup() or request_nvrtc_warmup() is called at module
scope, each spec/unit is probed against the disk cache before any work
is submitted:
Batch probe:
_cached_spec_name_set()(CCCL) or_nvrtc_cached_key_set()(NVRTC) scans the cache directory once and returns the set of cached names. The underlying helpers (_compute_capability,_cccl_version,_get_cache_dir) are all@lru_cache’d, so repeated probes are cheap.Defer on hit: specs with disk cache entries are added to a
_deferred_diskset. No thread pool task is created for them.Lazy load: on the first
get_compiled()call (CCCL) orcompile_kernels()call (NVRTC), the deferred spec is loaded from disk synchronously (~2 ms), then cached in memory.No thread pool if all cached: the
ThreadPoolExecutoris created lazily. If every spec is deferred, no threads are spawned.
ensure_warm() / ensure_pipelines_warm() handle deferred specs
correctly – they trigger lazy loads before waiting on futures.
The status() dict includes a "deferred" count alongside "compiled"
and "pending".
Environment variables¶
Variable |
Default |
Effect |
|---|---|---|
|
enabled |
Set |
|
enabled |
Set |
|
|
Override CCCL cache directory |
|
|
Override NVRTC cache directory |
|
enabled |
Set |
All variables respect XDG_CACHE_HOME when the _DIR override is not set.
Pre-compilation API¶
from vibespatial.cuda.cccl_precompile import precompile_all
# Compile everything and block until done (CI warm-up, post-install)
result = precompile_all(timeout=120.0)
# {'cccl': {'compiled': 12, 'submitted': 21, ...},
# 'nvrtc': {'compiled': 61, 'submitted': 61, ...},
# 'cccl_cold': [], 'nvrtc_cold': []}
For demand-driven warmup (the default):
from vibespatial.cuda.cccl_precompile import request_warmup, ensure_pipelines_warm
# Non-blocking: request specific specs (typically called at module scope)
request_warmup(["exclusive_scan_i32", "radix_sort_i32_i32"])
# Blocking: wait for all requested compilations before pipeline execution
cold = ensure_pipelines_warm(timeout=60.0)
Cache management¶
from vibespatial.cuda.cccl_cubin_cache import clear_cache, cache_stats
from vibespatial.cuda._runtime import clear_nvrtc_cache, nvrtc_cache_stats
# Inspect
print(cache_stats()) # CCCL: file count, total bytes, directory
print(nvrtc_cache_stats()) # NVRTC: file count, total bytes, directory
# Clear (e.g. after CUDA driver upgrade)
clear_cache() # CCCL
clear_nvrtc_cache() # NVRTC
Key risks and mitigations¶
ctypes struct layout mismatch: if CCCL changes its C struct fields
between versions, the cached build result will have the wrong layout. This
is mitigated by including the CCCL version in the cache key (version change
= automatic cache miss) and by validating the cubin_size field offset at
extraction time.
Runtime policy changes: the runtime policy struct is an opaque
allocation whose size we read via malloc_usable_size. If the policy
format changes, the restored bytes may be invalid. The CCCL version key
handles this for inter-version changes; intra-version changes would
require a cache clear.
Pointer invalidation: kernel handles and library handles are
process-specific. On cache hit we call cuLibraryLoadData and
cuLibraryGetKernel to obtain fresh handles for the current process.
Source files¶
File |
Role |
|---|---|
|
CCCL CUBIN cache: ctypes structs, extraction, reconstruction, disk I/O, cached algorithm wrappers |
|
CCCL precompiler singleton, cache integration, |
|
NVRTC disk cache, CUDA driver runtime |
|
NVRTC precompiler singleton |
|
Cache unit tests (no GPU required) |
|
Precompiler unit tests |
|
NVRTC cache unit tests |