Profiling Rails¶
This repo now has a dedicated profiling rail for join and overlay kernel work.
Intent¶
Provide one local entrypoint that reports stage-level wall-clock time, actual execution device, row flow, and Nsight-friendly range boundaries for the current join and overlay hot paths.
Request Signals¶
profiler
profiling
nsight
nvtx
benchmark rail
join profile
overlay profile
Open First¶
docs/testing/profiling-rails.md
scripts/profile_kernels.py
src/vibespatial/bench/profiling.py
src/vibespatial/bench/profile_rails.py
Verify¶
uv run python scripts/profile_kernels.py --kernel join --rows 1000 --tile-size 256uv run python scripts/profile_kernels.py --kernel overlay --rows 500 --tile-size 256uv run python scripts/check_docs.py --check
Risks¶
Reporting planned GPU selection instead of actual execution device hides CPU fallback costs and makes traces misleading.
End-to-end timers alone blur sort, filter, and refine costs together.
Profiling rails that are not machine-readable are hard to diff and easy to ignore during performance regressions.
Entry Point¶
Run:
uv run python scripts/profile_kernels.py --kernel all --rows 10000
Available kernels:
joinoverlayall
Useful flags:
--rows--join-rows--overlay-rows--tile-size--repeat--nvtx
Stage Contracts¶
The JSON trace must include stage categories that make the current execution shape obvious:
setupsortfilterrefine
Join profiling currently records:
owned-buffer build
flat-index build
coarse candidate / fast-path selection
optional predicate refine
output sort
Overlay profiling currently records:
owned-buffer build
segment extraction
segment MBR filter
exact intersection refine
reconstruction-event sort
Each stage reports:
deviceelapsed_secondsrows_inrows_outstage metadata such as
pairs_examined,ambiguous_pairs, ortile_size
Trace Interpretation¶
Top-level selected_runtime is the device that actually executed the profiled
join or overlay hot path. If metadata.planner_selected_runtime differs, the
planner and the realised path diverged; for example, a join may build its flat
index on CPU but still execute the candidate query path on GPU via a regular-grid
or owned-query fast path.
This distinction is intentional. The profiling rail is meant to explain real execution, not aspirational dispatch.
NVTX¶
When the optional nvtx Python package is installed, the rail emits NVTX
ranges per stage when --nvtx is passed. That makes the same stage boundaries
visible inside external profilers such as Nsight Systems.
Example:
nsys profile --trace=cuda,nvtx uv run python scripts/profile_kernels.py --kernel all --rows 10000 --nvtx
The JSON trace remains the source of truth inside the repo. NVTX is an external augmentation, not a replacement.
Shootout Physical Plans¶
Public shootouts should expose physical-plan evidence when they are used for performance decisions. Whole-script medians are not enough once correctness is already passing.
Shootout artifacts should report:
actual backend by stage
fallback events and reasons
host materialization counts, owned transfer-boundary counts, and runtime D2H copy counts/bytes/synchronous seconds
top hotpath stages by elapsed time
statement-level
timed_stageswith source line spans and physical-shape tagsstage_totals_by_tagandstage_totals_by_backendhotpath_total_seconds,composition_overhead_seconds, andcomposition_overhead_ratiorow-flow counts through joins, overlays, filters, and grouped reductions
physical shape tags such as semijoin, anti-semijoin, many-few overlay, mask clip, grouped geometry reduce, and area-filter-after-overlay
Interpretation rules:
a passing fingerprint is correctness evidence, not performance evidence
a fast individual kernel does not prove the public workflow shape is fast
a GPU-only execution trace with high composition overhead points at public API orchestration, scalar synchronization, or frame assembly rather than kernel throughput
workflow fixes should improve a named reusable shape or document a measured external-bound limit
operation-vs-operation floor checks should accompany workflow analysis so stage-floor gaps are not confused with composition overhead