Pipeline Benchmarks¶
This repo now has a dedicated end-to-end pipeline benchmark rail for regression gating.
Intent¶
Measure whole-pipeline cost, not just kernel microbenchmarks. The rail is meant to catch regressions from host<->device movement, materialization, allocation churn, and bad execution-shape changes that do not show up in isolated kernel timers.
Request Signals¶
pipeline benchmark
regression gate
ci perf
nvtx
cpu gpu movement
benchmark artifact
vsbench
bench cli
benchmark operation
benchmark suite
benchmark compare
nvbench kernel
shootout
geopandas vs vibespatial
Open First¶
docs/testing/pipeline-benchmarks.md
src/vibespatial/bench/cli.py
src/vibespatial/bench/catalog.py
src/vibespatial/bench/runner.py
src/vibespatial/bench/schema.py
src/vibespatial/bench/fixtures.py
src/vibespatial/bench/fixture_loader.py
src/vibespatial/bench/pipeline.py
src/vibespatial/bench/compare.py
src/vibespatial/bench/shootout.py
scripts/benchmark_pipelines.py
.github/workflows/pipeline-benchmarks.yml
Verify¶
uv run vsbench list operationsuv run vsbench run bounds --scale 1k --repeat 1 --quietuv run vsbench fixtures generate --scale 1k --format parquetuv run vsbench compare baseline.json current.jsonuv run pytest tests/test_pipeline_benchmarks.py tests/test_profiling_rails.py -quv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2uv run python scripts/check_docs.py --check
Risks¶
Comparing current results to a stale or missing baseline can hide regressions or create false confidence.
Reporting planner-selected GPU instead of actual hybrid execution hides where host materialization or transfer churn still dominates.
Single-run timings are noisy; median-over-repeats is the local source of truth for wall-clock regression checks.
Entry Points¶
Run the local smoke suite:
uv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2
Pipeline benchmarks default to --profile-mode lean, which keeps wall-clock
stage timing plus runtime D2H count/byte/seconds counters. Use
--profile-mode audit when you need NVML samples and CUDA event stage timing.
--gpu-trace and --gpu-sparkline imply audit mode.
Compare a current run against a baseline artifact:
uv run vsbench compare baseline.json current.json
Discover operation-specific arguments before running a benchmark:
uv run vsbench list operations --json
uv run vsbench run clip-rect --arg kind=polygon --arg rect=100,100,700,700
uv run vsbench run bounds-pairs --rows 20000 --arg dataset=uniform --arg tile_size=256
Default operation listings and suites are public-API benchmarks only. Internal
owned-array or kernel diagnostics are hidden from vsbench list operations and
excluded from vsbench suite; use --include-internal or vsbench kernel
when you explicitly want private-path diagnostics.
vsbench suite runs serially and isolates each operation, pipeline, or kernel
item in a child process by default. That keeps CUDA allocator state and OOM
failures from bleeding across benchmark items. Use --in-process only for
local debugging when you intentionally want the old single-process behavior.
Pipelines¶
The active benchmarked pipelines are:
join-heavyread_parquet -> build_index -> sjoin_query -> dissolve -> to_parquet
constructiveread_parquet -> clip -> buffer -> to_parquet
predicate-heavyread_geojson -> load cached polygons -> point_in_polygon -> filter -> DGA-backed to_parquet
predicate-heavy-geopandasread_geojson(pyogrio-first) -> covers -> filter -> to_parquet
raster-to-vectorcurrently emitted as
deferreduntil Phase 8 polygonize work lands
Suites¶
smoke1Krows, local verification only
ci100Krows, intended for pull requests
full100Kand1Mrows, intended formainand manual GPU runs
Each pipeline/scale can be repeated with --repeat N. Reported wall-clock is
the median elapsed time across repeats. Device memory and movement counters are
reported conservatively from the worst observed sample.
The suite CLI enforces per-item timeouts with --item-timeout N for isolated
runs. On timeout it kills only the owned child process group and records any
remaining non-orchestrator nvidia-smi compute apps in result metadata; it
does not kill unrelated GPU work on the machine.
Regression Rules¶
The regression checker currently fails when:
wall-clock grows by more than
5%peak device memory grows by more than
10%CUDA-runtime D2H transfer count increases
host materialization count increases
Trace Contract¶
Each pipeline result includes:
top-level
selected_runtimeplanner_selected_runtimetransfer_countowned_transfer_countruntime_d2h_transfer_countruntime_d2h_transfer_bytesruntime_d2h_transfer_secondsmaterialization_countpeak_device_memory_bytesstage traces with per-stage
device
transfer_count is the runtime D2H count in current artifacts. Older
artifacts used it for owned-array residency transfer diagnostics, so new
artifacts also include owned_transfer_count to keep that semantic boundary
visible without hiding internal runtime copies.
When a pipeline runs partly on GPU and partly on CPU, selected_runtime becomes
hybrid. This is intentional. The benchmark rail reports what actually
executed, not what the planner wished would execute.
Each stage may also carry:
requested_backend/actual_backendrequested_mode/actual_modefallback_notetransfer_count_deltaowned_transfer_count_deltaruntime_d2h_transfer_count_deltaruntime_d2h_transfer_bytes_deltaruntime_d2h_transfer_seconds_deltamaterialization_count_deltapeak_device_memory_bytes
That makes CPU<->GPU movement visible in the same artifact as the wall-clock timing.
CI Workflow¶
.github/workflows/pipeline-benchmarks.yml runs the suite in two modes:
CPU job
PRs:
cisuitemain/ manual:fullsuite
optional GPU job on a self-hosted NVIDIA runner
fullsuite with--nvtx
The workflow runs the current commit, attempts the same suite on the base
commit in a detached worktree, stores both artifacts, and diffs them with
uv run vsbench compare.
Bootstrap note:
if the base commit predates these scripts, the workflow uploads the current artifact and records the baseline comparison as unavailable instead of pretending the gate ran