Transient Native Work Budget

Context

ADR-0042 moved GPU-selected workflows toward device-native result boundaries. ADR-0044 defines the private native execution substrate underneath exact GeoPandas-compatible public APIs. Recent ADR-0044 profiling showed a second performance center: many real workflows are composed from small transient work items rather than one large throughput kernel.

Examples include rowset filters, relation consumers, many-few overlay fragments, grouped reductions with small groups, bounds/type/empty metadata checks, count-scatter allocation totals, and terminal export preparation. These items may each do little arithmetic, but they can dominate a workflow when every step rebuilds public objects, asks the host for scalar sizes, allocates temporary buffers independently, or crosses a compatibility boundary.

Accessibility-style workflow canaries exposed this pattern, but the pattern is not workflow-specific. The reusable lesson is that device-resident small operations need a latency/control-plane budget just as large kernels need throughput gates.

Decision

Make transient native work a P0 performance target.

A transient native work item is a small, short-lived device-resident operation created inside a larger public workflow. It may operate on a small rowset, a small geometry batch, a tiny group, a scalar allocation count, or intermediate native metadata. It is not user-visible by itself, but its composition cost is user-visible.

Transient native work must use private native carriers as execution currency:

  • NativeFrameState for frame and geometry state

  • NativeRowSet for row flow

  • NativeRelation for pair flow

  • NativeGrouped for grouped work

  • NativeExpression for admitted private scalar/vector predicates

  • NativeExportBoundary for explicit public compatibility export

The design target is:

  • zero hidden materialization inside admitted native transient stages

  • device-resident metadata and row-flow by default

  • batched scalar reads when a host allocation size is still unavoidable

  • shape-level batching for many tiny operations instead of Python loops over groups, fragments, or relation slices

  • reusable scratch or arena allocation for temporary device buffers

  • shape canaries that measure reusable work-item classes, not workflow scripts

Throughput gates remain necessary for large kernels, but they are not sufficient evidence of generalized performance. A GPU implementation that wins a large kernel benchmark can still fail ADR-0045 if normal public composition pays too many scalar fences, launches, allocations, or compatibility exports.

Budget Rules

Every new or expanded native transient shape must declare a budget covering:

  • runtime D2H transfer count and bytes

  • materialization count

  • scalar allocation fences

  • launch or stage count when observable

  • device residency of inputs and outputs

  • public export boundary, if any

  • shape canary command and expected scale

Default budget stance:

  • Hidden materialization in admitted native stages is forbidden.

  • Device row positions are preferred over host row positions.

  • Scalar size reads should be batched across independent count-scatter totals.

  • Public GeoDataFrame, GeoSeries, pandas, or Shapely construction is an export boundary, not transient execution currency.

  • Workflow canaries may measure impact, but production code must target the reusable transient shape.

Consequences

  • ADR-0044 work is reprioritized around composition latency, not only carrier coverage.

  • Small-operation canaries become first-class acceptance signals.

  • Generic primitives such as packed scalar reads, batched count-scatter totals, device rowset takes, grouped-offset caches, and scratch allocation can produce broad gains without workflow-specific branches.

  • Some current helpers that synchronously return Python sizes remain debt until they are replaced by device-planned allocation or batched scalar resolution.

  • Launch overhead and allocation churn become reviewable performance bugs even when total D2H bytes are small.

Alternatives Considered

  • Keep transient work as an implementation detail of ADR-0044. Rejected. The repeated small-operation cost is a distinct design center and needs explicit budget rules.

  • Optimize each workflow canary directly. Rejected. That would repeat the benchmark-specific path ADR-0043 warned against.

  • Focus only on large-kernel throughput tiers. Rejected. Throughput gates do not catch public composition latency.

  • Build a broad public lazy planner to fuse transient work. Rejected. ADR-0043 documents why broad public interception is not the right next step. ADR-0045 keeps the scope to admitted private native shapes.