GPU Performance Remediation Plan¶
Intent¶
Turn the GPU performance audit into an execution plan for the next major performance push. This document defines workstreams, sequencing, milestone checklists, measurement gates, and completion criteria for fixing the known CPU-shaped GPU behavior in vibeSpatial.
Request Signals¶
gpu remediation plan
performance push
fix gpu performance
execution plan
milestone plan
de-host gpu path
stream and transfer cleanup
cccl plan
overlay performance plan
Open First¶
docs/testing/gpu-performance-remediation-plan.md
docs/testing/gpu-performance-checklist.md
docs/architecture/runtime.md
docs/architecture/residency.md
docs/testing/performance-tiers.md
docs/testing/profiling-rails.md
Verify¶
uv run python scripts/check_docs.py --checkuv run python scripts/profile_kernels.py --kernel all --rows 10000 --repeat 1uv run python scripts/health.py --gpu-coverageuv run python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline
Risks¶
Fixing low-level synchronization without changing host orchestration can produce cleaner code without materially improving throughput.
Reworking residency and metadata ownership can regress correctness if public host boundaries are not revalidated.
Stream-enabling CCCL wrappers without a caller-level contract can introduce races instead of speedups.
Overlay and microcell work can absorb the entire push unless earlier tracks reduce known structural bottlenecks first.
GPU coverage numbers can improve without end-to-end speed improving if the newly-GPU stages are still dominated by host setup and transfer costs.
Mission¶
The next performance push is not a general cleanup. It is a structural rewrite campaign focused on removing the code shapes that make GPU paths behave like CPU paths.
The push is successful only if it achieves all of the following:
device-resident paths remain device-resident longer
reusable GPU helpers stop forcing null-stream completion
profiling rails begin selecting GPU for important benchmark surfaces at meaningful scales
end-to-end pipeline profiles show less CPU-dominated orchestration
real-world public workflows improve through reusable physical-plan shapes, not one-off benchmark patches
GPU acceleration coverage improves materially from the April 7, 2026 baseline
Baseline Snapshot¶
This plan starts from the audit snapshot recorded on April 7, 2026:
profile_kernels.py --kernel all --rows 10000 --repeat 1selected CPU for both join and overlay on the local RTX 4090that run showed effectively 0% GPU utilization for those profiled surfaces
health.py --gpu-coveragereported:total dispatches:
10134GPU dispatches:
400CPU dispatches:
9154fallback dispatches:
361GPU acceleration rate:
3.95%
These values are the baseline to beat, not a permanent target. Re-capture them at the start of the push and again at every milestone boundary.
Non-Goals¶
This push is not done when:
a few
.get()calls have been removeda handful of kernels have nicer launch parameters
profiler output looks cleaner but execution is still mostly CPU
GPU dispatch count rises only because tiny helper stages moved to device
docs claim GPU-first behavior without corresponding runtime evidence
This push is also not about:
changing user-facing APIs unless required by explicit fallback visibility
polishing cold-start latency before hot-path execution shape is fixed
broad architectural abstraction work not tied to measured bottlenecks
Working Principles¶
Apply these principles throughout the push:
fix structural blockers before localized micro-optimizations
prefer batching, scans, compaction, and segmented primitives over Python loops
prefer caller-controlled synchronization over helper-controlled synchronization
prefer device-native metadata ownership over convenience host mirrors
verify with profiler rails and pipeline benchmarks after each milestone
do not accept “planner would choose GPU” as evidence of real improvement
Program Structure¶
This push is divided into six workstreams. They are intentionally ordered. Do not start with overlay micro-optimizations while residency, CCCL wrappers, and device-native decode are still structurally wrong.
Milestone |
Name |
Primary Surfaces |
Why First |
|---|---|---|---|
M0 |
Baseline And Guardrails |
profiling, health, docs, benchmark rails |
Prevents the push from drifting without evidence |
M1 |
Residency And Metadata Ownership |
|
Removes hidden D2H taxes from “device” paths |
M2 |
CCCL And Synchronization Contract |
|
Unlocks overlap and same-stream composition |
M3 |
Device-Native Decode And Compaction |
WKB and related count-scatter paths |
Removes host-driven nested decode loops |
M4 |
Predicate And Query Execution Shape |
PIP, candidate assembly, work estimation |
Fixes a core refine primitive and query path |
M5 |
Overlay And Constructive De-Hosting |
grouped overlay, microcells, contraction, union-all |
Fixes the highest-value structural CPU orchestration |
M6 |
Public Physical-Plan Coverage |
shootouts, semijoins, anti-joins, mask clip, grouped reduce |
Proves performance generalizes beyond focused workflows |
Milestone M0: Baseline And Guardrails¶
Goal¶
Establish current measurements, ensure the rails report actual execution device, and define hard acceptance gates for the rest of the push.
Primary Surfaces¶
scripts/profile_kernels.pysrc/vibespatial/bench/profiling.pysrc/vibespatial/bench/profile_rails.pyscripts/benchmark_pipelines.pyscripts/health.py --gpu-coverage
Checklist¶
[ ] Re-run the baseline profiler on the target machine.
[ ] Re-run GPU coverage and record the exact percentages.
[ ] Capture
nvidia-smi -L,/dev/nvidia*, andCUDA_VISIBLE_DEVICES.[ ] Confirm profiler rails report actual selected runtime, not only planned runtime.
[ ] Confirm pipeline benchmark stage names are sufficient to identify CPU orchestration bottlenecks.
[ ] Write down the baseline for:
GPU acceleration coverage
join profiler selected runtime
overlay profiler selected runtime
pipeline sparkline stage times for the 1M run
Exit Criteria¶
documented baseline is captured for the target machine
profiler rails and health rails are trusted as evidence sources
later milestones can compare against a stable before-state
Milestone M1: Residency And Metadata Ownership¶
Goal¶
Stop calling paths “device-resident” when they eagerly materialize host metadata during construction.
Primary Surfaces¶
src/vibespatial/io/pylibcudf.pyowned geometry builders and host-state helpers
residency diagnostics and transfer visibility surfaces
Known Problems To Fix¶
_build_device_single_family_ownedeagerly copies validity, tags, family-row offsets, geometry offsets, empty masks, and optional part/ring offsets to host_build_device_mixed_owneddoes the same for mixed-family casesdecode helpers build host mirrors before downstream GPU work requests them
Checklist¶
[ ] Inventory which host arrays are truly required at construction time.
[ ] Split mandatory public-boundary metadata from convenience mirrors.
[ ] Make host structural metadata lazy where possible.
[ ] Preserve explicit materialization events and diagnostics.
[ ] Re-run transfer audits for any path that starts from pylibcudf decode.
[ ] Re-check downstream callers that may have been relying on implicit host mirrors.
[ ] Update tests so device-resident outputs do not require host metadata unless explicitly materialized.
Exit Criteria¶
device-backed builders no longer force broad D2H copies by default
downstream GPU consumers can continue from decode without hidden host setup
transfer counts for decode-to-GPU pipelines decrease measurably
Milestone M2: CCCL And Synchronization Contract¶
Goal¶
Move synchronization ownership to callers and stop null-stream completion from being baked into reusable primitive wrappers.
Primary Surfaces¶
src/vibespatial/cuda/cccl_primitives.pysrc/vibespatial/cuda/_runtime.pyCCCL helper call sites in indexing, queries, overlay, and constructive code
Known Problems To Fix¶
primitive wrappers default to
Stream.null.synchronize()count-returning wrappers read scalar results on host immediately
wrappers do not expose a stream-aware contract even though lower layers can already accept streams
Checklist¶
[ ] Review every CCCL helper for hardcoded null-stream synchronization.
[ ] Add or normalize
synchronize=behavior so callers can defer completion.[ ] Decide which primitives need stream parameters immediately and which can stay same-stream but caller-synchronized first.
[ ] Replace legacy call sites that still expect helper-owned completion.
[ ] Verify count-scatter, sort, search, and segmented reduce stages compose without extra host syncs.
[ ] Keep correctness by adding tests around deferred completion paths.
Exit Criteria¶
CCCL wrappers no longer force completion by default in hot reusable paths
same-stream pipelines can chain primitives without repeated barriers
stream-aware extension path is clear and minimally invasive
Milestone M3: Device-Native Decode And Compaction¶
Goal¶
Replace Python-controlled nested decode logic with device-native staged scans and count-scatter passes.
Primary Surfaces¶
src/vibespatial/io/pylibcudf.pylegacy count-scatter total sites in
io/shp_gpu.pyandio/fgb_gpu.pyany adjacent decode helpers still driven by host maxima
Known Problems To Fix¶
WKB polygon, multilinestring, and multipolygon decode walk nested structure with Python
for range(max_...)loopsrepeated
cp.asnumpy(...max())reads drive control flowsome count-scatter totals still use
runtime.synchronize()plus multiple.get()calls instead of the async helper
Checklist¶
[ ] Replace host maxima loops with segmented device passes where feasible.
[ ] Batch byte-start discovery and nested offset generation on device.
[ ] Use
count_scatter_total()orcount_scatter_total_with_transfer()for legacy total sites.[ ] Verify mixed, polygon, and multipolygon decode correctness against existing fixtures.
[ ] Re-measure ingest surfaces that were previously paying hidden host loops.
Exit Criteria¶
decode control flow is no longer Python-driven for nested WKB structure
legacy sync-plus-
.get()count-scatter sites are removed from major IO pathsdecode throughput and transfer shape improve on realistic mixed geometry data
Milestone M4: Predicate And Query Execution Shape¶
Goal¶
Fix point-in-polygon and adjacent query paths so refine logic behaves like a GPU pipeline, not a host-managed dispatch loop.
Primary Surfaces¶
src/vibespatial/kernels/predicates/point_in_polygon.pycandidate assembly helpers in query code
any work estimation or binning code that pulls candidate rows to host
Known Problems To Fix¶
dense and compacted helpers end with unconditional
runtime.synchronize()binned mode copies candidate rows to host for work estimation
same-stream launch chains synchronize before the caller actually needs host data
Checklist¶
[ ] Remove helper-level syncs where same-stream ordering already guarantees correctness.
[ ] Move work estimation and bin selection onto device where possible.
[ ] Keep candidate rows device-side through coarse filter and refine.
[ ] Re-check dense, compacted, binned, and fused paths separately.
[ ] Re-run profiler rails and inspect selected runtime and stage times.
Exit Criteria¶
PIP helpers synchronize only at real host boundaries
work estimation no longer requires candidate rows on host
query and predicate rails show a more GPU-shaped refine stage
Milestone M5: Overlay And Constructive De-Hosting¶
Goal¶
Remove the highest-value remaining host orchestration from grouped overlay, microcells, contraction, and constructive reduction flows.
Primary Surfaces¶
src/vibespatial/overlay/gpu.pysrc/vibespatial/overlay/microcells.pysrc/vibespatial/overlay/contract.pysrc/vibespatial/overlay/assemble.pysrc/vibespatial/constructive/union_all.pyadjacent constructive helpers still forcing same-stream syncs
Known Problems To Fix¶
grouped overlay materializes group boundaries to host and iterates in Python
current stream pool is limited by null-stream NVRTC and wrapper behavior
microcell labeling loops row-by-row over host materialized row ids
contraction moves full band arrays to host and runs union-find in Python
some constructive helpers still synchronize after same-stream scatter
union-all still performs Python tree reduction over single-row objects
Checklist¶
[ ] Replace host grouping loops with device-side grouping or batched planning where correctness allows.
[ ] Revisit stream pool usefulness after M2 lands.
[ ] Move microcell labeling control flow off host.
[ ] Replace host union-find contraction with a device-friendly contraction plan or a clearly bounded fallback path.
[ ] Audit overlay assembly for remaining hardcoded launch geometry and avoidable host scalar control decisions.
[ ] Reassess union-all reduction shape once overlay batching improves.
[ ] Re-run the full pipeline benchmark and inspect the 1M sparkline.
Exit Criteria¶
grouped overlay no longer depends on host-materialized per-group ranges in its main GPU path
microcells and contraction are no longer structurally host-managed
overlay pipeline stages show materially less CPU orchestration in profiling
Cross-Cutting Cleanup Sweep¶
After M1 through M5 land, do a sweep for smaller but repeated anti-patterns.
Sweep Targets¶
src/vibespatial/io/shp_gpu.pysrc/vibespatial/io/fgb_gpu.pysrc/vibespatial/constructive/clip_rect.pysrc/vibespatial/constructive/linestring.pysrc/vibespatial/constructive/shortest_line.pysrc/vibespatial/overlay/assemble.pyany residual hardcoded
(256, 1, 1)launch patterns not justified by data
Sweep Checklist¶
[ ] replace sync-plus-scalar-read totals with the runtime helper
[ ] remove same-stream syncs that only guard later device work
[ ] switch obvious hardcoded launch sizes to occupancy-aware launch config
[ ] collapse repeated small D2H reads into one batched transfer where host reads remain necessary
Milestone M6: Public Physical-Plan Coverage¶
Goal¶
Make real-world public workflow performance generalize through reusable physical execution shapes.
Primary Surfaces¶
src/vibespatial/bench/shootout.pysrc/vibespatial/bench/profile_rails.pypublic
sjoin,clip,overlay,dissolve, and buffer chainsreal-world shootouts under
benchmarks/shootout
Known Problems To Fix¶
new real-world shootouts can be correct while still 4x to 65x slower than GeoPandas at 10K
end-to-end shootout timings do not yet explain stage time, actual backend, fallback events, transfers, materialization, or hotpath stage dominance
semijoin, anti-semijoin, many-few overlay, mask clip, grouped geometry reduce, and area-filter-after-overlay are not tracked as first-class shapes
Checklist¶
[ ] Extend shootout artifacts with physical-plan stage evidence.
[ ] Tag public workflow canaries by reusable physical shape.
[ ] Profile emergency response catchments and retail trade-area screening before touching workflow code.
[ ] Add shape-level benchmarks or profiler rails for the slowest common patterns.
[ ] Fix shared execution shapes before applying workflow-specific changes.
Exit Criteria¶
each real-world shootout has a physical-plan breakdown
sub-par workflow results name the reusable shape that explains the cost
remediation improves a shared shape or documents an external-bound limit
Measurement Gates¶
Every milestone must report:
exact commands used
machine and GPU model
selected runtime
before and after timing
before and after transfer shape if the milestone touches residency
At minimum, re-run these after each milestone:
uv run python scripts/profile_kernels.py --kernel all --rows 10000 --repeat 1
uv run python scripts/health.py --gpu-coverage
uv run python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline
If a milestone changes join, overlay, IO, predicate, or runtime surfaces, also run the narrowest relevant pytest slice before the broad rails.
Program-Level Exit Criteria¶
The full push is complete only when all of the following are true:
GPU acceleration coverage is materially above the April 7, 2026 baseline of
3.95%profiler rails select GPU for important benchmark surfaces at useful scales, not just tiny helper stages
end-to-end pipeline sparkline shows reduced CPU-heavy orchestration relative to the starting baseline
major device-resident pipelines no longer pay eager host metadata mirroring
major reusable GPU helpers no longer force null-stream synchronization by default
grouped overlay and microcells are no longer fundamentally host-managed in their mainline GPU execution shape
What Counts As Failure¶
This push fails if it ends with:
more GPU dispatches but no meaningful end-to-end improvement
more stream code but the same null-stream serialization
cleaner helper APIs but unchanged host-driven decode and grouping
overlay still structurally controlled by Python loops
new correctness regressions accepted as the price of performance
Recommended Delivery Order¶
Use this order unless measurement proves otherwise:
M0 baseline and measurement hardening
M1 residency and metadata ownership
M2 CCCL synchronization contract
M3 device-native decode and compaction
M4 predicate and query execution shape
M5 overlay and constructive de-hosting
M6 public physical-plan coverage
cross-cutting cleanup sweep
The ordering matters because:
M1 removes hidden D2H taxes that pollute later measurements
M2 makes later stream and composition work possible
M3 and M4 fix shared primitives used by many higher-level paths
M5 is the hardest and should start after the foundations stop fighting back
M6 validates that the lower-level fixes generalize to public workflows instead of only improving focused benchmark surfaces
Session Checklist¶
Use this at the start of each session in the push:
[ ] Which milestone is active?
[ ] What exact baseline numbers am I trying to beat?
[ ] What structural blocker am I removing first?
[ ] What profiler and benchmark commands will prove the change mattered?
[ ] What correctness slices must stay green while I change the execution shape?
Use this at the end of each session:
[ ] What changed in execution shape, not just code structure?
[ ] What measurements improved?
[ ] What measurements did not move?
[ ] What blocker remains for the current milestone?
[ ] What is the next smallest structural step?