Staged Fusion Strategy

Context

The repo now has owned geometry arrays, coarse kernels, residency diagnostics, and a probe-first runtime planner. The remaining question is how to eliminate intermediate buffers in multi-step pipelines without overcommitting to a full graph runtime too early.

Decision

Use a lightweight staged operator DAG as the default fusion mechanism.

  • fuse ephemeral device-local chains

  • persist reusable structures such as indexes and partition metadata

  • treat explicit host materialization as a hard boundary

  • keep user-facing APIs unchanged

  • allow specialized fused kernels as an optimization inside the staged model, not as the only strategy

Consequences

  • The runtime planner has a clear place to choose stage shapes at dispatch time.

  • Diagnostics remain visible because stage boundaries are explicit.

  • Future kernels can register fusible chains without building a whole-program graph engine first.

Alternatives Considered

  • full lazy evaluation graph

  • explicit fused kernels only

  • no shared fusion contract until much later

Acceptance Notes

The landed implementation is a policy module plus tests. It defines how to classify ephemeral versus persisted intermediates and where fusion must stop. Actual fused execution and memory accounting remain follow-up implementation work.