Profiling Rails¶

This repo now has a dedicated profiling rail for join and overlay kernel work.

Intent¶

Provide one local entrypoint that reports stage-level wall-clock time, actual execution device, row flow, and Nsight-friendly range boundaries for the current join and overlay hot paths.

Request Signals¶

profiler
profiling
nsight
nvtx
benchmark rail
join profile
overlay profile

Open First¶

docs/testing/profiling-rails.md
scripts/profile_kernels.py
src/vibespatial/bench/profiling.py
src/vibespatial/bench/profile_rails.py

Verify¶

uv run python scripts/profile_kernels.py --kernel join --rows 1000 --tile-size 256
uv run python scripts/profile_kernels.py --kernel overlay --rows 500 --tile-size 256
uv run python scripts/check_docs.py --check

Risks¶

Reporting planned GPU selection instead of actual execution device hides CPU fallback costs and makes traces misleading.
End-to-end timers alone blur sort, filter, and refine costs together.
Profiling rails that are not machine-readable are hard to diff and easy to ignore during performance regressions.

Entry Point¶

Run:

uv run python scripts/profile_kernels.py --kernel all --rows 10000

Available kernels:

join
overlay
all

Useful flags:

--rows
--join-rows
--overlay-rows
--tile-size
--repeat
--nvtx

Stage Contracts¶

The JSON trace must include stage categories that make the current execution shape obvious:

setup
sort
filter
refine

Join profiling currently records:

owned-buffer build
flat-index build
coarse candidate / fast-path selection
optional predicate refine
output sort

Overlay profiling currently records:

owned-buffer build
segment extraction
segment MBR filter
exact intersection refine
reconstruction-event sort

Each stage reports:

device
elapsed_seconds
rows_in
rows_out
stage metadata such as pairs_examined, ambiguous_pairs, or tile_size

Trace Interpretation¶

Top-level selected_runtime is the device that actually executed the profiled join or overlay hot path. If metadata.planner_selected_runtime differs, the planner and the realised path diverged; for example, a join may build its flat index on CPU but still execute the candidate query path on GPU via a regular-grid or owned-query fast path.

This distinction is intentional. The profiling rail is meant to explain real execution, not aspirational dispatch.

NVTX¶

When the optional nvtx Python package is installed, the rail emits NVTX ranges per stage when --nvtx is passed. That makes the same stage boundaries visible inside external profilers such as Nsight Systems.

Example:

nsys profile --trace=cuda,nvtx uv run python scripts/profile_kernels.py --kernel all --rows 10000 --nvtx

The JSON trace remains the source of truth inside the repo. NVTX is an external augmentation, not a replacement.

Shootout Physical Plans¶

Public shootouts should expose physical-plan evidence when they are used for performance decisions. Whole-script medians are not enough once correctness is already passing.

Shootout artifacts should report:

actual backend by stage
fallback events and reasons
host materialization counts, owned transfer-boundary counts, and runtime D2H copy counts/bytes/synchronous seconds
top hotpath stages by elapsed time
statement-level timed_stages with source line spans and physical-shape tags
stage_totals_by_tag and stage_totals_by_backend
hotpath_total_seconds, composition_overhead_seconds, and composition_overhead_ratio
row-flow counts through joins, overlays, filters, and grouped reductions
physical shape tags such as semijoin, anti-semijoin, many-few overlay, mask clip, grouped geometry reduce, and area-filter-after-overlay

Interpretation rules:

a passing fingerprint is correctness evidence, not performance evidence
a fast individual kernel does not prove the public workflow shape is fast
a GPU-only execution trace with high composition overhead points at public API orchestration, scalar synchronization, or frame assembly rather than kernel throughput
workflow fixes should improve a named reusable shape or document a measured external-bound limit
operation-vs-operation floor checks should accompany workflow analysis so stage-floor gaps are not confused with composition overhead