Profiling Rails¶
This repo now has a dedicated profiling rail for join and overlay kernel work.
Intent¶
Provide one local entrypoint that reports stage-level wall-clock time, actual execution device, row flow, and Nsight-friendly range boundaries for the current join and overlay hot paths.
Request Signals¶
profiler
profiling
nsight
nvtx
benchmark rail
join profile
overlay profile
Open First¶
docs/testing/profiling-rails.md
scripts/profile_kernels.py
src/vibespatial/bench/profiling.py
src/vibespatial/bench/profile_rails.py
Verify¶
uv run python scripts/profile_kernels.py --kernel join --rows 1000 --tile-size 256uv run python scripts/profile_kernels.py --kernel overlay --rows 500 --tile-size 256uv run python scripts/check_docs.py --check
Risks¶
Reporting planned GPU selection instead of actual execution device hides CPU fallback costs and makes traces misleading.
End-to-end timers alone blur sort, filter, and refine costs together.
Profiling rails that are not machine-readable are hard to diff and easy to ignore during performance regressions.
Entry Point¶
Run:
uv run python scripts/profile_kernels.py --kernel all --rows 10000
Available kernels:
joinoverlayall
Useful flags:
--rows--join-rows--overlay-rows--tile-size--repeat--nvtx
Stage Contracts¶
The JSON trace must include stage categories that make the current execution shape obvious:
setupsortfilterrefine
Join profiling currently records:
owned-buffer build
bounds computation
Morton sort
coarse candidate filter
predicate refine
output sort
Overlay profiling currently records:
owned-buffer build
segment extraction
segment MBR filter
exact intersection refine
reconstruction-event sort
Each stage reports:
deviceelapsed_secondsrows_inrows_outstage metadata such as
pairs_examined,ambiguous_pairs, ortile_size
Trace Interpretation¶
Top-level selected_runtime is the device that actually executed the profiled
stages. If metadata.planner_selected_runtime differs, the runtime planner
would prefer GPU on this machine but the current implementation surface still
executed on CPU.
This distinction is intentional. The profiling rail is meant to explain real execution, not aspirational dispatch.
NVTX¶
When the optional nvtx Python package is installed, the rail emits NVTX
ranges per stage when --nvtx is passed. That makes the same stage boundaries
visible inside external profilers such as Nsight Systems.
Example:
nsys profile --trace=cuda,nvtx uv run python scripts/profile_kernels.py --kernel all --rows 10000 --nvtx
The JSON trace remains the source of truth inside the repo. NVTX is an external augmentation, not a replacement.