Pipeline Benchmarks

This repo now has a dedicated end-to-end pipeline benchmark rail for regression gating.

Intent

Measure whole-pipeline cost, not just kernel microbenchmarks. The rail is meant to catch regressions from host<->device movement, materialization, allocation churn, and bad execution-shape changes that do not show up in isolated kernel timers.

Request Signals

  • pipeline benchmark

  • regression gate

  • ci perf

  • nvtx

  • cpu gpu movement

  • benchmark artifact

  • vsbench

  • bench cli

  • benchmark operation

  • benchmark suite

  • benchmark compare

  • nvbench kernel

  • shootout

  • geopandas vs vibespatial

Open First

  • docs/testing/pipeline-benchmarks.md

  • src/vibespatial/bench/cli.py

  • src/vibespatial/bench/catalog.py

  • src/vibespatial/bench/runner.py

  • src/vibespatial/bench/schema.py

  • src/vibespatial/bench/fixtures.py

  • src/vibespatial/bench/fixture_loader.py

  • src/vibespatial/bench/pipeline.py

  • src/vibespatial/bench/compare.py

  • src/vibespatial/bench/shootout.py

  • scripts/benchmark_pipelines.py

  • .github/workflows/pipeline-benchmarks.yml

Verify

  • uv run vsbench list operations

  • uv run vsbench run bounds --scale 1k --repeat 1 --quiet

  • uv run vsbench fixtures generate --scale 1k --format parquet

  • uv run vsbench compare baseline.json current.json

  • uv run pytest tests/test_pipeline_benchmarks.py tests/test_profiling_rails.py -q

  • uv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2

  • uv run python scripts/check_docs.py --check

Risks

  • Comparing current results to a stale or missing baseline can hide regressions or create false confidence.

  • Reporting planner-selected GPU instead of actual hybrid execution hides where host materialization or transfer churn still dominates.

  • Single-run timings are noisy; median-over-repeats is the local source of truth for wall-clock regression checks.

Entry Points

Run the local smoke suite:

uv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2

Pipeline benchmarks default to --profile-mode lean, which keeps wall-clock stage timing plus runtime D2H count/byte/seconds counters. Use --profile-mode audit when you need NVML samples and CUDA event stage timing. --gpu-trace and --gpu-sparkline imply audit mode.

Compare a current run against a baseline artifact:

uv run vsbench compare baseline.json current.json

Discover operation-specific arguments before running a benchmark:

uv run vsbench list operations --json
uv run vsbench run clip-rect --arg kind=polygon --arg rect=100,100,700,700
uv run vsbench run bounds-pairs --rows 20000 --arg dataset=uniform --arg tile_size=256

Default operation listings and suites are public-API benchmarks only. Internal owned-array or kernel diagnostics are hidden from vsbench list operations and excluded from vsbench suite; use --include-internal or vsbench kernel when you explicitly want private-path diagnostics.

vsbench suite runs serially and isolates each operation, pipeline, or kernel item in a child process by default. That keeps CUDA allocator state and OOM failures from bleeding across benchmark items. Use --in-process only for local debugging when you intentionally want the old single-process behavior.

Pipelines

The active benchmarked pipelines are:

  • join-heavy

    • read_parquet -> build_index -> sjoin_query -> dissolve -> to_parquet

  • constructive

    • read_parquet -> clip -> buffer -> to_parquet

  • predicate-heavy

    • read_geojson -> load cached polygons -> point_in_polygon -> filter -> DGA-backed to_parquet

  • predicate-heavy-geopandas

    • read_geojson(pyogrio-first) -> covers -> filter -> to_parquet

  • raster-to-vector

    • currently emitted as deferred until Phase 8 polygonize work lands

Suites

  • smoke

    • 1K rows, local verification only

  • ci

    • 100K rows, intended for pull requests

  • full

    • 100K and 1M rows, intended for main and manual GPU runs

Each pipeline/scale can be repeated with --repeat N. Reported wall-clock is the median elapsed time across repeats. Device memory and movement counters are reported conservatively from the worst observed sample.

The suite CLI enforces per-item timeouts with --item-timeout N for isolated runs. On timeout it kills only the owned child process group and records any remaining non-orchestrator nvidia-smi compute apps in result metadata; it does not kill unrelated GPU work on the machine.

Regression Rules

The regression checker currently fails when:

  • wall-clock grows by more than 5%

  • peak device memory grows by more than 10%

  • CUDA-runtime D2H transfer count increases

  • host materialization count increases

Trace Contract

Each pipeline result includes:

  • top-level selected_runtime

  • planner_selected_runtime

  • transfer_count

  • owned_transfer_count

  • runtime_d2h_transfer_count

  • runtime_d2h_transfer_bytes

  • runtime_d2h_transfer_seconds

  • materialization_count

  • peak_device_memory_bytes

  • stage traces with per-stage device

transfer_count is the runtime D2H count in current artifacts. Older artifacts used it for owned-array residency transfer diagnostics, so new artifacts also include owned_transfer_count to keep that semantic boundary visible without hiding internal runtime copies.

When a pipeline runs partly on GPU and partly on CPU, selected_runtime becomes hybrid. This is intentional. The benchmark rail reports what actually executed, not what the planner wished would execute.

Each stage may also carry:

  • requested_backend / actual_backend

  • requested_mode / actual_mode

  • fallback_note

  • transfer_count_delta

  • owned_transfer_count_delta

  • runtime_d2h_transfer_count_delta

  • runtime_d2h_transfer_bytes_delta

  • runtime_d2h_transfer_seconds_delta

  • materialization_count_delta

  • peak_device_memory_bytes

That makes CPU<->GPU movement visible in the same artifact as the wall-clock timing.

CI Workflow

.github/workflows/pipeline-benchmarks.yml runs the suite in two modes:

  • CPU job

    • PRs: ci suite

    • main / manual: full suite

  • optional GPU job on a self-hosted NVIDIA runner

    • full suite with --nvtx

The workflow runs the current commit, attempts the same suite on the base commit in a detached worktree, stores both artifacts, and diffs them with uv run vsbench compare.

Bootstrap note:

  • if the base commit predates these scripts, the workflow uploads the current artifact and records the baseline comparison as unavailable instead of pretending the gate ran