Performance Tiers

Define the minimum performance gates for GPU-first kernel work before the benchmark harness and synthetic datasets are fully implemented.

Intent

Set explicit speedup floors, aspirational targets, and reference benchmark rules so each kernel can declare success against the same denominator.

Request Signals

  • benchmark

  • performance

  • perf gates

  • speedup

  • throughput

  • latency

  • kernel tier

Open First

  • docs/testing/performance-tiers.md

  • docs/implementation-order.md

  • docs/testing/upstream-inventory.md

  • src/vibespatial/runtime/_runtime.py

Verify

  • uv run python scripts/check_docs.py --check

  • uv run python scripts/intake.py "define performance tier gates for GPU kernels"

Risks

  • Speedup targets can become meaningless if the reference denominator shifts.

  • Small synthetic cases can overstate GPU wins by hiding transfer and setup costs.

  • Gates that are too strict too early can block correct-but-incomplete kernel landings.

Denominator

  • Baseline comparisons are against single-threaded GeoPandas or Shapely on the same machine unless a benchmark explicitly documents a different host-side denominator.

  • The gate measures steady-state kernel-path speedup at the reference scale, not cold-start import time or one-time environment setup.

  • CPU fallback paths do not count as passing a GPU benchmark gate.

  • A kernel must state both its expected tier and the benchmark command or harness entrypoint that will eventually enforce the gate.

Reference Scale

  • Required scales: 10K, 100K, and 1M geometries.

  • The default gate scale is 100K mixed polygons unless the operation is clearly point- or IO-dominated.

  • 10K exists to observe crossover behavior and dispatch thresholds.

  • 1M exists to catch memory-pressure and batching regressions once kernels move beyond smoke status.

Use these reference dataset families:

  • uniform grids for regular work and predictable memory access

  • polygon-heavy parcel-like subdivisions for overlay, clip, and dissolve

  • point clouds for joins, distance, and coarse-filter workloads

  • admin-boundary style polygon sets for real-world irregularity

o17.1.8 should generate license-free versions of these families instead of checking in sourced benchmark data.

Tier Table

Tier

Parallelism shape

Gate

Aspirational

Example operations

5

embarrassingly parallel

100x

1000x

bounds, centroid, area, length, affine transforms, SFC keys, coordinate access

4

per-geometry parallel

20x

100x

buffer, simplify, convex hull, point-in-polygon, unary predicates

3

filtered parallel

10x

50x

sjoin, sindex query, nearest, dwithin, binary predicates after coarse filtering

2

structured parallel

5x

20x

clip, intersection, union, difference, dissolve

1

external-bound

1x

3x to 5x

file IO, GDAL-mediated reads, format parsing

CRS transforms are out of scope for these tiers because that work is expected to route through cuProj policy later in o17.6.3.

Tier Rules

  • Tier 5 kernels should usually be memory-bandwidth bound. If they do not beat the gate, treat the implementation as suspect until profiling proves otherwise.

  • Tier 4 kernels may have variable geometry complexity, but they should still scale mostly with per-geometry independence.

  • Tier 3 kernels must include the coarse-filter stage in benchmark accounting. Reporting only the refine pass is not acceptable.

  • Tier 2 kernels may land below aspirational targets early, but the minimum gate still applies once correctness and batching stabilize.

  • Tier 1 work is allowed to land with parity-only performance if the bottleneck is dominated by host parsing, legacy libraries, or disk.

Mapping To Roadmap

  • Phase 2 geometry-buffer kernels should mostly declare Tier 5 or Tier 4.

  • Phase 3 indexing work should declare Tier 5 for pair generation and Tier 3 for query paths.

  • Phase 4 predicates and joins should mostly declare Tier 3, with some Tier 4 unary predicate coverage.

  • Phase 5 overlay and constructive geometry should declare Tier 2 unless a narrower fast path clearly fits Tier 4.

  • Phase 6b IO work should default to Tier 1 unless a GPU-native scanner moves parsing and filtering onto the device.

Acceptance Policy

  • Every kernel-oriented task must name its tier in the description or notes.

  • Every benchmark result should report:

    • dataset family

    • scale

    • requested runtime mode

    • selected runtime mode

    • speedup versus the documented host baseline

  • Gates are enforced first in docs and manual benchmark runs, then in o17.1.3 and o17.1.7 once benchmark rails and CI are in place.

  • Falling below the gate is allowed only with an explicit blocker or follow-up follow-up explaining why the kernel still needs to land.

  • o17.2.7 should use the 10K and 100K scales to reason about dispatch crossover thresholds.

Verification

Use this doc as the policy source until benchmark rails exist:

uv run python scripts/check_docs.py --check
uv run python scripts/intake.py "define performance tier gates for GPU kernels"