Provenance Tagging and Rewrite System¶
Context¶
vibeSpatial evaluates eagerly: each spatial operation materializes its result
before the next one runs. This means common patterns like
gdf.buffer(100).intersects(other) perform expensive geometry construction
only to answer a yes/no question that dwithin could answer directly.
The staged fusion strategy (ADR-0009) handles intra-pipeline optimization on device-local chains but cannot see across user-visible operation boundaries. A lightweight metadata system that travels with intermediate results closes this gap.
Decision¶
Attach a frozen ProvenanceTag to GeometryArray results that records what
operation created them and with what parameters. A declarative registry of
RewriteRule definitions maps (producer, consumer) pairs to attempt functions
that check preconditions and substitute cheaper equivalents.
Key properties:
tags are immutable frozen dataclasses, zero-cost when no rewrite matches
rules are pure data; adding a new rule is a dataclass + attempt function
rewrites are observable via
RewriteEventdeque and JSONL event logprecondition failures fall through silently to the original operation
provenance propagates through
GeometryArray.copy()and__init__so it survives pandas Series wrappingrewrites are globally toggleable via
VIBESPATIAL_PROVENANCE_REWRITESenv var (0/false/no/offto disable; default enabled) or theset_provenance_rewrites(bool | None)programmatic override for A/B benchmarking;provenance_rewrites_enabled()reads: explicit override > env var >Truedefaulteach
RewriteEventcarrieselapsed_secondswall-clock timing of the rewritten computation for performance analysis
Rewrite rules:
Rule |
Pattern |
Rewrite |
Constraint |
|---|---|---|---|
R1 |
|
|
point-only, round cap/join |
R2 |
|
|
point-only, round cap/join |
R5 |
|
identity |
always valid |
R6 |
|
|
positive radii, same style, point-only |
R7 |
|
identity |
always valid |
Consequences¶
Users who write inefficient but obvious code get automatic speedups.
Every rewrite is logged in the dispatch event stream, so profiling and debugging remain transparent.
The tag carries a strong reference to the source GeometryArray; this is acceptable because the buffer result is typically larger than its source.
New rewrite rules require only a rule definition and attempt function in
provenance.py, not changes to dispatch logic.
Alternatives Considered¶
Full lazy evaluation graph: Maximum optimization power but fundamentally changes GeoPandas-compatible eager semantics. Deferred to future work.
Deferred execution context manager: Opt-in lazy block where operations return plan nodes. Worth prototyping later but higher complexity.
Declarative pipeline API: New surface (
gpd.compile([...])) that is not GeoPandas-compatible. Outside the current scope.
Acceptance Notes¶
The landed implementation covers the data model, registry, event logging, and five rewrite rules (R1, R2, R5, R6, R7) with full test coverage.