Per-Kernel Dispatch Crossover Policy

Context

The runtime already distinguishes auto, cpu, and gpu, but it does not yet define when auto should stop preferring CPU for small workloads. A global size gate would be too coarse, because bounds, metrics, predicates, and constructive kernels have materially different launch overhead and crossover behavior.

Decision

Use fixed per-kernel-class crossover thresholds until adaptive runtime lands.

  • explicit cpu always stays on host

  • explicit gpu always attempts device execution

  • auto dispatches CPU below the class threshold and GPU at or above it

  • the initial thresholds are 1K, 5K, 10K, and 50K rows for coarse, metric, predicate, and constructive kernels respectively

Consequences

  • Kernel dispatch code can rely on one shared threshold policy instead of ad hoc size checks.

  • Benchmark work now has concrete constants to validate and replace when measurements improve.

  • Explicit overrides remain stable even if later runtime adaptation changes the auto path.

Alternatives Considered

  • one global crossover threshold for all kernels

  • always preferring GPU whenever the runtime is available

  • deciding thresholds independently inside each kernel module

  • delaying all crossover policy until adaptive runtime exists

Acceptance Notes

The landed policy encodes fixed thresholds only. o17.2.10 may replace the constants with adaptive inputs later, but should preserve the same override semantics.