Adaptive Runtime¶
Use a probe-first planner now, and leave room for a fuller controller later.
Intent¶
Define how runtime adaptation works before owned buffers and kernel families expand, without overcommitting to a live feedback controller too early.
Request Signals¶
adaptive runtime
nvml
variant registry
probe and adapt
chunk planning
saturation monitoring
Open First¶
docs/architecture/adaptive-runtime.md
docs/architecture/runtime.md
src/vibespatial/runtime/adaptive.py
src/vibespatial/runtime/kernel_registry.py
docs/decisions/0007-probe-first-adaptive-runtime.md
Verify¶
uv run pytest tests/test_adaptive_runtime.pyuv run python scripts/check_docs.py --check
Risks¶
A full live controller would create more machinery than value before real kernels and chunked workloads exist.
Overfitting variant choice too early can freeze bad metadata into the registry contract.
Hard-coding NVML into call sites would make later telemetry upgrades expensive.
Canonical Rule¶
Adaptive planning happens before execution and, for streaming work, at chunk boundaries.
The first landing is a planner, not a continuous controller.
Telemetry is optional. When monitoring is unavailable, planning falls back to static heuristics and declared metadata.
Explicit
cpu,gpu, and precision overrides remain authoritative.
Required Layers¶
telemetry snapshot: GPU availability plus optional NVML saturation and memory signals
variant registry: typed metadata, not just variant names
planner input: kernel class, row count, geometry mix, residency, and requested mode
planner output: selected runtime, variant, precision plan, chunk size hint, and reason log
Decision Scope¶
The planner may adapt:
kernel variant
chunk size hint
precision path through the existing precision-policy contract
autoruntime target through the existing crossover policy
The planner must not:
switch mid-kernel
override explicit user pins
depend on continuous background polling
Upgrade Path¶
This design is intentionally a stepping stone.
Today: one-shot planning plus optional re-plan after the first chunk.
Later: richer telemetry, runtime history, and tighter re-plan cadence.
Moving from the planner to a live controller should only replace internal policy and telemetry sources. Kernel call sites, registry metadata, and plan objects should stay stable.