ADR0044 Rich Baseline 2026-04-25¶

Use this baseline as the ADR0044/0045 checkpoint before the next generalized performance pass.

Intent¶

Capture where the repo stands after the private native execution substrate checkpoint and the benchmark harness repeat/JSON fixes. The target is not to optimize these specific workflows. They are measurement canaries for generic public API composition, transient work, and host/device transfer shape.

Request Signals¶

adr0044 baseline
rich baseline
workflow shootout
operation benchmark
transient D2H

Open First¶

docs/testing/adr0044-rich-baseline-2026-04-25.md
docs/testing/pipeline-benchmarks.md
docs/testing/performance-tiers.md
docs/dev/private-native-execution-substrate-plan.md

Verify¶

uv run python scripts/check_docs.py --check
uv run vsbench shootout benchmarks/shootout --scale 10k --repeat 3
uv run python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline

Risks¶

Raw artifacts live under ignored benchmark_results/working/ and are not durable unless this note is updated.
Sandbox GPU visibility can make timings meaningless; rerun performance commands outside the sandbox before comparing.
Workflow wins are canaries only; do not overfit implementation strategy to a single shootout script.

Environment¶

Date: 2026-04-25.
Commit: bae6767 (Honor public CRS on native Arrow exports).
GPU: NVIDIA GeForce RTX 4090.
CUDA_VISIBLE_DEVICES: unset.
Device nodes visible outside sandbox: /dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm, /dev/nvidia-uvm-tools, /dev/nvidia-modeset.
Raw artifacts: benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/.
Artifact policy: benchmark_results/ is ignored, so this tracked note is the durable repo record.

Commands¶

env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync vsbench shootout benchmarks/shootout --scale 10k --repeat 3 --json --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/shootouts_10k_repeat3.json --timeout 900

Operation baselines used vsbench run <operation> --scale 10k --repeat 3 --json --quiet --output ... for the operation variants listed below.

env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/pipelines_full_repeat1_gpu_sparkline.json
env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python scripts/benchmark_pipelines.py --suite full --repeat 3 --profile-mode lean --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/pipelines_full_repeat3_lean.json

Summary¶

Workflow shootouts: 14/14 passed fingerprint checks at 10K, repeat 3.
Workflow geomean speedup: 1.11x vs GeoPandas.
Workflow total median speedup: 1.38x vs GeoPandas, 3.411s GeoPandas vs 2.468s vibeSpatial across all 14 workflows.
Operation baselines now report real repeat summaries (sample_count=3).
Pipeline full repeat-3 lean stays healthy at 1M: zero-transfer is 21.6ms with zero runtime D2H; join-heavy is 111.4ms with 24 runtime D2H transfers.
Main warning signal: gpu-dissolve method=disjoint_subset is 4.50s vs 71.2ms baseline, with repeated D2H warnings.

Workflow Shootouts¶

Workflow	GeoPandas ms	vibeSpatial ms	Speedup	Runtime D2H	Materializations	Fallbacks
`accessibility_redevelopment.py`	217.1	257.4	0.84x	247	12	0
`corridor_flood_priority.py`	160.0	193.7	0.83x	192	10	0
`emergency_response_catchments.py`	95.6	162.8	0.59x	226	9	0
`flood_exposure.py`	38.0	32.6	1.16x	41	3	0
`habitat_corridor_compliance.py`	124.3	145.5	0.85x	174	12	1
`insurance_flood_screening.py`	39.8	110.3	0.36x	244	7	0
`nearby_buildings.py`	97.5	47.3	2.06x	27	1	0
`network_service_area.py`	91.0	88.9	1.02x	50	4	0
`parcel_zoning.py`	64.7	58.8	1.10x	51	6	0
`redevelopment_screening.py`	688.9	373.9	1.84x	179	10	1
`retail_trade_area_screening.py`	612.9	296.2	2.07x	81	9	0
`site_suitability.py`	660.2	267.2	2.47x	73	7	0
`transit_service_gap.py`	227.0	222.1	1.02x	185	8	1
`vegetation_corridor.py`	294.0	211.2	1.39x	92	6	1

Operation Baselines¶

Operation variant	vibeSpatial ms	Baseline	Baseline ms	Speedup	Samples
`binary-predicates-contains`	1.171	shapely	0.421	0.36x	3
`binary-predicates-covered-by`	1.156	shapely	0.428	0.37x	3
`binary-predicates-intersects`	1.140	shapely	1.065	0.93x	3
`bounds`	0.225	shapely	0.100	0.44x	3
`clip-rect-line`	3.199	shapely	3.880	1.21x	3
`clip-rect-polygon`	0.972	shapely	2.387	2.46x	3
`gpu-dissolve-coverage`	4.709	shapely-coverage	13.620	2.89x	3
`gpu-dissolve-disjoint-subset`	4502.373	shapely-disjoint_subset	71.228	0.02x	3
`gpu-dissolve-unary`	70.557	shapely-unary	68.657	0.97x	3
`gpu-overlay`	63.008	shapely-strtree-intersection	233.404	3.70x	3
`make-valid`	12.441	baseline	20.491	1.65x	3
`spatial-query-overlap02`	1.420	shapely_strtree	3.034	2.14x	3
`spatial-query-overlap08`	1.527	shapely_strtree	9.388	6.15x	3
`stroke-offset-curve`	214.995	shapely	311.595	1.45x	3
`stroke-point-buffer`	0.560	shapely	35.265	62.96x	3

Pipeline Baselines¶

Pipeline	Scale	Elapsed ms	Runtime	Runtime D2H	Runtime D2H MB	Materializations
`join-heavy`	100000	40.1	hybrid	24	2.63	0
`relation-semijoin`	100000	14.2	gpu	3	0.00	0
`small-grouped-constructive-reduce`	100000	119.7	hybrid	11	0.33	2
`constructive`	100000	10.7	hybrid	3	0.60	0
`predicate-heavy`	100000	12.1	gpu	6	1.20	0
`zero-transfer`	100000	10.1	gpu	0	0.00	0
`join-heavy`	1000000	111.4	hybrid	24	25.97	0
`relation-semijoin`	1000000	25.1	gpu	3	0.00	0
`small-grouped-constructive-reduce`	1000000	119.7	hybrid	11	0.33	2
`constructive`	1000000	22.8	hybrid	3	6.00	0
`predicate-heavy`	1000000	42.1	gpu	6	12.00	0
`zero-transfer`	1000000	21.6	gpu	0	0.00	0

Profile Gate¶

Top 1M audit stages from --gpu-sparkline:

Pipeline	Total ms	Runtime D2H	Runtime D2H MB	Largest stages
`join-heavy`	98.4	24	25.97	`dissolve_groups` 47.9ms CPU, `assemble_join_rows` 18.1ms GPU, `sjoin_query` 16.7ms GPU
`relation-semijoin`	24.0	3	0.00	`read_inputs` 14.7ms GPU, `write_output` 4.7ms GPU, `subset_rows` 1.5ms GPU
`small-grouped-constructive-reduce`	128.3	11	0.33	`shapely_reference` 72.7ms CPU, `build_device_grouped_polygons` 35.4ms GPU, `native_grouped_union` 19.3ms GPU
`constructive`	28.8	3	6.00	`write_output` 21.0ms GPU, `read_points` 3.3ms GPU, `buffer_points` 1.7ms GPU
`predicate-heavy`	89.3	6	12.00	`read_geojson` 60.9ms GPU, `load_polygons` 10.1ms GPU, `point_in_polygon` 3.4ms GPU
`zero-transfer`	22.1	0	0.00	`read_input` 13.8ms GPU, `write_output` 4.6ms GPU, `subset_rows` 1.6ms GPU

Interpretation¶

The ADR0044 substrate is helping where workflows can stay in larger native chunks: site suitability, retail, nearby buildings, redevelopment, and vegetation corridor are all faster than GeoPandas. The slow workflows are not missing “GPU work” in a simple sense; they are dominated by many small public API transitions, runtime D2H checks, and compatibility materializations.

The operation table shows the same shape. High-throughput native kernels are excellent once the work is large enough (stroke-point-buffer, spatial query, overlay), while tiny bounds and binary predicate calls are still slower than Shapely because launch/composition overhead dominates.

The highest priority generic fixes remain: remove transient runtime D2H from predicate/bounds/buffer setup paths, make public rowset/copy/filter composition consume private native state without compatibility exports, and quarantine or rewrite bad-shape modes such as dissolve(method="disjoint_subset").