Profiling Rails

This repo now has a dedicated profiling rail for join and overlay kernel work.

Intent

Provide one local entrypoint that reports stage-level wall-clock time, actual execution device, row flow, and Nsight-friendly range boundaries for the current join and overlay hot paths.

Request Signals

  • profiler

  • profiling

  • nsight

  • nvtx

  • benchmark rail

  • join profile

  • overlay profile

Open First

  • docs/testing/profiling-rails.md

  • scripts/profile_kernels.py

  • src/vibespatial/bench/profiling.py

  • src/vibespatial/bench/profile_rails.py

Verify

  • uv run python scripts/profile_kernels.py --kernel join --rows 1000 --tile-size 256

  • uv run python scripts/profile_kernels.py --kernel overlay --rows 500 --tile-size 256

  • uv run python scripts/check_docs.py --check

Risks

  • Reporting planned GPU selection instead of actual execution device hides CPU fallback costs and makes traces misleading.

  • End-to-end timers alone blur sort, filter, and refine costs together.

  • Profiling rails that are not machine-readable are hard to diff and easy to ignore during performance regressions.

Entry Point

Run:

uv run python scripts/profile_kernels.py --kernel all --rows 10000

Available kernels:

  • join

  • overlay

  • all

Useful flags:

  • --rows

  • --join-rows

  • --overlay-rows

  • --tile-size

  • --repeat

  • --nvtx

Stage Contracts

The JSON trace must include stage categories that make the current execution shape obvious:

  • setup

  • sort

  • filter

  • refine

Join profiling currently records:

  • owned-buffer build

  • flat-index build

  • coarse candidate / fast-path selection

  • optional predicate refine

  • output sort

Overlay profiling currently records:

  • owned-buffer build

  • segment extraction

  • segment MBR filter

  • exact intersection refine

  • reconstruction-event sort

Each stage reports:

  • device

  • elapsed_seconds

  • rows_in

  • rows_out

  • stage metadata such as pairs_examined, ambiguous_pairs, or tile_size

Trace Interpretation

Top-level selected_runtime is the device that actually executed the profiled join or overlay hot path. If metadata.planner_selected_runtime differs, the planner and the realised path diverged; for example, a join may build its flat index on CPU but still execute the candidate query path on GPU via a regular-grid or owned-query fast path.

This distinction is intentional. The profiling rail is meant to explain real execution, not aspirational dispatch.

NVTX

When the optional nvtx Python package is installed, the rail emits NVTX ranges per stage when --nvtx is passed. That makes the same stage boundaries visible inside external profilers such as Nsight Systems.

Example:

nsys profile --trace=cuda,nvtx uv run python scripts/profile_kernels.py --kernel all --rows 10000 --nvtx

The JSON trace remains the source of truth inside the repo. NVTX is an external augmentation, not a replacement.

Shootout Physical Plans

Public shootouts should expose physical-plan evidence when they are used for performance decisions. Whole-script medians are not enough once correctness is already passing.

Shootout artifacts should report:

  • actual backend by stage

  • fallback events and reasons

  • host materialization counts, owned transfer-boundary counts, and runtime D2H copy counts/bytes/synchronous seconds

  • top hotpath stages by elapsed time

  • statement-level timed_stages with source line spans and physical-shape tags

  • stage_totals_by_tag and stage_totals_by_backend

  • hotpath_total_seconds, composition_overhead_seconds, and composition_overhead_ratio

  • row-flow counts through joins, overlays, filters, and grouped reductions

  • physical shape tags such as semijoin, anti-semijoin, many-few overlay, mask clip, grouped geometry reduce, and area-filter-after-overlay

Interpretation rules:

  • a passing fingerprint is correctness evidence, not performance evidence

  • a fast individual kernel does not prove the public workflow shape is fast

  • a GPU-only execution trace with high composition overhead points at public API orchestration, scalar synchronization, or frame assembly rather than kernel throughput

  • workflow fixes should improve a named reusable shape or document a measured external-bound limit

  • operation-vs-operation floor checks should accompany workflow analysis so stage-floor gaps are not confused with composition overhead