IO Performance Rails¶
This repo has public IO benchmark entry points plus internal component rails for the accelerated format work in Phase 6d.
Intent¶
Turn IO performance claims into repeatable suite artifacts without forcing private fast paths through user-facing benchmark scripts.
Request Signals¶
io benchmark
io performance
throughput floor
format benchmark
io rail
Open First¶
docs/testing/io-performance-rails.md
scripts/benchmark_io_arrow.py
scripts/benchmark_io_file.py
Verify¶
uv run python scripts/benchmark_io_arrow.py --suite smokeuv run python scripts/benchmark_io_file.py --suite smokeuv run python scripts/check_docs.py --check
Risks¶
Drift in performance baselines if the rail does not run regularly.
GeoJSON informational-only status masking real regression.
Throughput floors set too low to catch meaningful regressions.
Entry Points¶
Run the public Arrow/WKB API suite:
uv run python scripts/benchmark_io_arrow.py --suite all
Run the public file-format API suite:
uv run python scripts/benchmark_io_file.py --suite all
Both entry points also accept --suite smoke and --suite ci. These script
suites call the registered public vsbench operations only. Low-level
GeoArrow, WKB, GeoParquet planner/scan, and direct-parser file component rails
remain internal health surfaces under src/vibespatial/bench/io_benchmark_rails.py
and tests/test_io_benchmark_rails.py.
Suites¶
smoke: local verification at10K.ci:100K-class runs intended for pull requests.all:10K,100K, and1Mwhere practical.Internal component rails may use smaller high-cost scales for polygon-heavy paths.
10M is still treated as a manual deep-run scale for the cheapest point-heavy
paths. It is not part of the default all suite because the rail is meant to
run often enough to catch drift, not only on rare manual sweeps.
Result Contract¶
Public script cases report the standard BenchmarkResult schema:
operation, scale, geometry type, input format, and status
timing summary
reference baseline timing and speedup when available
operation metadata
Internal component rail cases report:
rows_inputrows_decodedbytes_scannedcopies_madefallback_pool_sharebaseline and candidate throughput when a speedup target exists
target_floorstatus
The status vocabulary is:
passfailinformationalunavailable
informational is reserved for internal component cases that are tracked but
not currently enforced.
Current Enforcement¶
Internal component floors currently cover:
GeoArrow aligned import and export
native GeoArrow decode and encode
WKB decode and encode
GeoParquet selective decode avoidance
GeoParquet GPU scan when
pylibcudfis availableShapefile ingest for point-heavy, line-heavy, and polygon-heavy inputs
Informational-only cases currently cover:
GeoJSON ingest
mixed-family WKB decode watch runs
Notes¶
copies_madeis a conservative path-level estimate, not a byte-accurate allocator trace.GeoJSON is present in the rail so drift stays visible, but it is not part of the hard floor gate until the device-rowized tokenizer path actually wins.
Pair these rails with profiling-rails.md and pipeline-benchmarks.md when tracing CPU/GPU movement through end-to-end workloads.