Performance Tiers¶
Define the minimum performance gates for GPU-first kernel work before the benchmark harness and synthetic datasets are fully implemented.
Intent¶
Set explicit speedup floors, aspirational targets, and reference benchmark rules so each kernel can declare success against the same denominator.
Request Signals¶
benchmark
performance
perf gates
speedup
throughput
latency
kernel tier
Open First¶
docs/testing/performance-tiers.md
docs/implementation-order.md
docs/testing/upstream-inventory.md
src/vibespatial/runtime/_runtime.py
Verify¶
uv run python scripts/check_docs.py --checkuv run python scripts/intake.py "define performance tier gates for GPU kernels"
Risks¶
Speedup targets can become meaningless if the reference denominator shifts.
Small synthetic cases can overstate GPU wins by hiding transfer and setup costs.
Gates that are too strict too early can block correct-but-incomplete kernel landings.
Denominator¶
Baseline comparisons are against single-threaded GeoPandas or Shapely on the same machine unless a benchmark explicitly documents a different host-side denominator.
The gate measures steady-state kernel-path speedup at the reference scale, not cold-start import time or one-time environment setup.
CPU fallback paths do not count as passing a GPU benchmark gate.
A kernel must state both its expected tier and the benchmark command or harness entrypoint that will eventually enforce the gate.
Reference Scale¶
Required scales:
10K,100K, and1Mgeometries.The default gate scale is
100Kmixed polygons unless the operation is clearly point- or IO-dominated.10Kexists to observe crossover behavior and dispatch thresholds.1Mexists to catch memory-pressure and batching regressions once kernels move beyond smoke status.
Use these reference dataset families:
uniform grids for regular work and predictable memory access
polygon-heavy parcel-like subdivisions for overlay, clip, and dissolve
point clouds for joins, distance, and coarse-filter workloads
admin-boundary style polygon sets for real-world irregularity
o17.1.8 should generate license-free versions of these families instead of
checking in sourced benchmark data.
Tier Table¶
Tier |
Parallelism shape |
Gate |
Aspirational |
Example operations |
|---|---|---|---|---|
5 |
embarrassingly parallel |
100x |
1000x |
bounds, centroid, area, length, affine transforms, SFC keys, coordinate access |
4 |
per-geometry parallel |
20x |
100x |
buffer, simplify, convex hull, point-in-polygon, unary predicates |
3 |
filtered parallel |
10x |
50x |
sjoin, sindex query, nearest, |
2 |
structured parallel |
5x |
20x |
clip, intersection, union, difference, dissolve |
1 |
external-bound |
1x |
3x to 5x |
file IO, GDAL-mediated reads, format parsing |
CRS transforms are out of scope for these tiers because that work is expected
to route through cuProj policy later in o17.6.3.
Tier Rules¶
Tier 5 kernels should usually be memory-bandwidth bound. If they do not beat the gate, treat the implementation as suspect until profiling proves otherwise.
Tier 4 kernels may have variable geometry complexity, but they should still scale mostly with per-geometry independence.
Tier 3 kernels must include the coarse-filter stage in benchmark accounting. Reporting only the refine pass is not acceptable.
Tier 2 kernels may land below aspirational targets early, but the minimum gate still applies once correctness and batching stabilize.
Tier 1 work is allowed to land with parity-only performance if the bottleneck is dominated by host parsing, legacy libraries, or disk.
Mapping To Roadmap¶
Phase 2 geometry-buffer kernels should mostly declare Tier 5 or Tier 4.
Phase 3 indexing work should declare Tier 5 for pair generation and Tier 3 for query paths.
Phase 4 predicates and joins should mostly declare Tier 3, with some Tier 4 unary predicate coverage.
Phase 5 overlay and constructive geometry should declare Tier 2 unless a narrower fast path clearly fits Tier 4.
Phase 6b IO work should default to Tier 1 unless a GPU-native scanner moves parsing and filtering onto the device.
Acceptance Policy¶
Every kernel-oriented task must name its tier in the description or notes.
Every benchmark result should report:
dataset family
scale
requested runtime mode
selected runtime mode
speedup versus the documented host baseline
Gates are enforced first in docs and manual benchmark runs, then in
o17.1.3ando17.1.7once benchmark rails and CI are in place.Falling below the gate is allowed only with an explicit blocker or follow-up follow-up explaining why the kernel still needs to land.
o17.2.7should use the10Kand100Kscales to reason about dispatch crossover thresholds.
Verification¶
Use this doc as the policy source until benchmark rails exist:
uv run python scripts/check_docs.py --check
uv run python scripts/intake.py "define performance tier gates for GPU kernels"