ADR0044 Rich Baseline 2026-04-25¶
Use this baseline as the ADR0044/0045 checkpoint before the next generalized performance pass.
Intent¶
Capture where the repo stands after the private native execution substrate checkpoint and the benchmark harness repeat/JSON fixes. The target is not to optimize these specific workflows. They are measurement canaries for generic public API composition, transient work, and host/device transfer shape.
Request Signals¶
adr0044 baseline
rich baseline
workflow shootout
operation benchmark
transient D2H
Open First¶
docs/testing/adr0044-rich-baseline-2026-04-25.md
docs/testing/pipeline-benchmarks.md
docs/testing/performance-tiers.md
docs/dev/private-native-execution-substrate-plan.md
Verify¶
uv run python scripts/check_docs.py --checkuv run vsbench shootout benchmarks/shootout --scale 10k --repeat 3uv run python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline
Risks¶
Raw artifacts live under ignored
benchmark_results/working/and are not durable unless this note is updated.Sandbox GPU visibility can make timings meaningless; rerun performance commands outside the sandbox before comparing.
Workflow wins are canaries only; do not overfit implementation strategy to a single shootout script.
Environment¶
Date: 2026-04-25.
Commit:
bae6767(Honor public CRS on native Arrow exports).GPU: NVIDIA GeForce RTX 4090.
CUDA_VISIBLE_DEVICES: unset.Device nodes visible outside sandbox:
/dev/nvidia0,/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia-modeset.Raw artifacts:
benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/.Artifact policy:
benchmark_results/is ignored, so this tracked note is the durable repo record.
Commands¶
env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync vsbench shootout benchmarks/shootout --scale 10k --repeat 3 --json --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/shootouts_10k_repeat3.json --timeout 900
Operation baselines used vsbench run <operation> --scale 10k --repeat 3 --json --quiet --output ... for the operation variants listed below.
env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python scripts/benchmark_pipelines.py --suite full --repeat 1 --gpu-sparkline --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/pipelines_full_repeat1_gpu_sparkline.json
env UV_CACHE_DIR=/tmp/uv-cache uv run --no-sync python scripts/benchmark_pipelines.py --suite full --repeat 3 --profile-mode lean --output benchmark_results/working/rich_baseline_2026_04_25_head_bae6767/pipelines_full_repeat3_lean.json
Summary¶
Workflow shootouts: 14/14 passed fingerprint checks at 10K, repeat 3.
Workflow geomean speedup: 1.11x vs GeoPandas.
Workflow total median speedup: 1.38x vs GeoPandas, 3.411s GeoPandas vs 2.468s vibeSpatial across all 14 workflows.
Operation baselines now report real repeat summaries (
sample_count=3).Pipeline full repeat-3 lean stays healthy at 1M: zero-transfer is 21.6ms with zero runtime D2H; join-heavy is 111.4ms with 24 runtime D2H transfers.
Main warning signal:
gpu-dissolve method=disjoint_subsetis 4.50s vs 71.2ms baseline, with repeated D2H warnings.
Workflow Shootouts¶
Workflow |
GeoPandas ms |
vibeSpatial ms |
Speedup |
Runtime D2H |
Materializations |
Fallbacks |
|---|---|---|---|---|---|---|
|
217.1 |
257.4 |
0.84x |
247 |
12 |
0 |
|
160.0 |
193.7 |
0.83x |
192 |
10 |
0 |
|
95.6 |
162.8 |
0.59x |
226 |
9 |
0 |
|
38.0 |
32.6 |
1.16x |
41 |
3 |
0 |
|
124.3 |
145.5 |
0.85x |
174 |
12 |
1 |
|
39.8 |
110.3 |
0.36x |
244 |
7 |
0 |
|
97.5 |
47.3 |
2.06x |
27 |
1 |
0 |
|
91.0 |
88.9 |
1.02x |
50 |
4 |
0 |
|
64.7 |
58.8 |
1.10x |
51 |
6 |
0 |
|
688.9 |
373.9 |
1.84x |
179 |
10 |
1 |
|
612.9 |
296.2 |
2.07x |
81 |
9 |
0 |
|
660.2 |
267.2 |
2.47x |
73 |
7 |
0 |
|
227.0 |
222.1 |
1.02x |
185 |
8 |
1 |
|
294.0 |
211.2 |
1.39x |
92 |
6 |
1 |
Operation Baselines¶
Operation variant |
vibeSpatial ms |
Baseline |
Baseline ms |
Speedup |
Samples |
|---|---|---|---|---|---|
|
1.171 |
shapely |
0.421 |
0.36x |
3 |
|
1.156 |
shapely |
0.428 |
0.37x |
3 |
|
1.140 |
shapely |
1.065 |
0.93x |
3 |
|
0.225 |
shapely |
0.100 |
0.44x |
3 |
|
3.199 |
shapely |
3.880 |
1.21x |
3 |
|
0.972 |
shapely |
2.387 |
2.46x |
3 |
|
4.709 |
shapely-coverage |
13.620 |
2.89x |
3 |
|
4502.373 |
shapely-disjoint_subset |
71.228 |
0.02x |
3 |
|
70.557 |
shapely-unary |
68.657 |
0.97x |
3 |
|
63.008 |
shapely-strtree-intersection |
233.404 |
3.70x |
3 |
|
12.441 |
baseline |
20.491 |
1.65x |
3 |
|
1.420 |
shapely_strtree |
3.034 |
2.14x |
3 |
|
1.527 |
shapely_strtree |
9.388 |
6.15x |
3 |
|
214.995 |
shapely |
311.595 |
1.45x |
3 |
|
0.560 |
shapely |
35.265 |
62.96x |
3 |
Pipeline Baselines¶
Pipeline |
Scale |
Elapsed ms |
Runtime |
Runtime D2H |
Runtime D2H MB |
Materializations |
Fallbacks |
|---|---|---|---|---|---|---|---|
|
100000 |
40.1 |
hybrid |
24 |
2.63 |
0 |
0 |
|
100000 |
14.2 |
gpu |
3 |
0.00 |
0 |
0 |
|
100000 |
119.7 |
hybrid |
11 |
0.33 |
2 |
0 |
|
100000 |
10.7 |
hybrid |
3 |
0.60 |
0 |
0 |
|
100000 |
12.1 |
gpu |
6 |
1.20 |
0 |
0 |
|
100000 |
10.1 |
gpu |
0 |
0.00 |
0 |
0 |
|
1000000 |
111.4 |
hybrid |
24 |
25.97 |
0 |
0 |
|
1000000 |
25.1 |
gpu |
3 |
0.00 |
0 |
0 |
|
1000000 |
119.7 |
hybrid |
11 |
0.33 |
2 |
0 |
|
1000000 |
22.8 |
hybrid |
3 |
6.00 |
0 |
0 |
|
1000000 |
42.1 |
gpu |
6 |
12.00 |
0 |
0 |
|
1000000 |
21.6 |
gpu |
0 |
0.00 |
0 |
0 |
Profile Gate¶
Top 1M audit stages from --gpu-sparkline:
Pipeline |
Total ms |
Runtime D2H |
Runtime D2H MB |
Largest stages |
|---|---|---|---|---|
|
98.4 |
24 |
25.97 |
|
|
24.0 |
3 |
0.00 |
|
|
128.3 |
11 |
0.33 |
|
|
28.8 |
3 |
6.00 |
|
|
89.3 |
6 |
12.00 |
|
|
22.1 |
0 |
0.00 |
|
Interpretation¶
The ADR0044 substrate is helping where workflows can stay in larger native chunks: site suitability, retail, nearby buildings, redevelopment, and vegetation corridor are all faster than GeoPandas. The slow workflows are not missing “GPU work” in a simple sense; they are dominated by many small public API transitions, runtime D2H checks, and compatibility materializations.
The operation table shows the same shape. High-throughput native kernels are
excellent once the work is large enough (stroke-point-buffer, spatial query,
overlay), while tiny bounds and binary predicate calls are still slower than
Shapely because launch/composition overhead dominates.
The highest priority generic fixes remain: remove transient runtime D2H from
predicate/bounds/buffer setup paths, make public rowset/copy/filter composition
consume private native state without compatibility exports, and quarantine or
rewrite bad-shape modes such as dissolve(method="disjoint_subset").