Pipeline Benchmarks¶
This repo now has a dedicated end-to-end pipeline benchmark rail for regression gating.
Intent¶
Measure whole-pipeline cost, not just kernel microbenchmarks. The rail is meant to catch regressions from host<->device movement, materialization, allocation churn, and bad execution-shape changes that do not show up in isolated kernel timers.
Request Signals¶
pipeline benchmark
regression gate
ci perf
nvtx
cpu gpu movement
benchmark artifact
vsbench
bench cli
benchmark operation
benchmark suite
benchmark compare
nvbench kernel
shootout
geopandas vs vibespatial
Open First¶
docs/testing/pipeline-benchmarks.md
src/vibespatial/bench/cli.py
src/vibespatial/bench/catalog.py
src/vibespatial/bench/runner.py
src/vibespatial/bench/schema.py
src/vibespatial/bench/fixtures.py
src/vibespatial/bench/fixture_loader.py
src/vibespatial/bench/pipeline.py
src/vibespatial/bench/shootout.py
scripts/benchmark_pipelines.py
scripts/check_pipeline_regressions.py
.github/workflows/pipeline-benchmarks.yml
Verify¶
uv run vsbench list operationsuv run vsbench run bounds --scale 1k --repeat 1 --quietuv run vsbench fixtures generate --scale 1k --format parquetuv run pytest tests/test_pipeline_benchmarks.py tests/test_profiling_rails.py -quv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2uv run python scripts/check_docs.py --check
Risks¶
Comparing current results to a stale or missing baseline can hide regressions or create false confidence.
Reporting planner-selected GPU instead of actual hybrid execution hides where host materialization or transfer churn still dominates.
Single-run timings are noisy; median-over-repeats is the local source of truth for wall-clock regression checks.
Entry Points¶
Run the local smoke suite:
uv run python scripts/benchmark_pipelines.py --suite smoke --repeat 2
Compare a current run against a baseline artifact:
uv run python scripts/check_pipeline_regressions.py --baseline baseline.json --current current.json
Pipelines¶
The active benchmarked pipelines are:
join-heavyread_parquet -> build_index -> sjoin_query -> dissolve -> to_parquet
constructiveread_parquet -> clip -> buffer -> to_parquet
predicate-heavyread_geojson -> load cached polygons -> point_in_polygon -> filter -> DGA-backed to_parquet
predicate-heavy-geopandasread_geojson(pyogrio-first) -> covers -> filter -> to_parquet
raster-to-vectorcurrently emitted as
deferreduntil Phase 8 polygonize work lands
Suites¶
smoke1Krows, local verification only
ci100Krows, intended for pull requests
full100Kand1Mrows, intended formainand manual GPU runs
Each pipeline/scale can be repeated with --repeat N. Reported wall-clock is
the median elapsed time across repeats. Device memory and movement counters are
reported conservatively from the worst observed sample.
Regression Rules¶
The regression checker currently fails when:
wall-clock grows by more than
5%peak device memory grows by more than
10%host<->device transfer count increases
host materialization count increases
Trace Contract¶
Each pipeline result includes:
top-level
selected_runtimeplanner_selected_runtimetransfer_countmaterialization_countpeak_device_memory_bytesstage traces with per-stage
device
When a pipeline runs partly on GPU and partly on CPU, selected_runtime becomes
hybrid. This is intentional. The benchmark rail reports what actually
executed, not what the planner wished would execute.
Each stage may also carry:
requested_backend/actual_backendrequested_mode/actual_modefallback_notetransfer_count_deltamaterialization_count_deltapeak_device_memory_bytes
That makes CPU<->GPU movement visible in the same artifact as the wall-clock timing.
CI Workflow¶
.github/workflows/pipeline-benchmarks.yml runs the suite in two modes:
CPU job
PRs:
cisuitemain/ manual:fullsuite
optional GPU job on a self-hosted NVIDIA runner
fullsuite with--nvtx
The workflow runs the current commit, attempts the same suite on the base
commit in a detached worktree, stores both artifacts, and diffs them with
scripts/check_pipeline_regressions.py.
Bootstrap note:
if the base commit predates these scripts, the workflow uploads the current artifact and records the baseline comparison as unavailable instead of pretending the gate ran