Overlay Reconstruction¶

o17.5.3 fixes the constructive assembly shape before full overlay kernels land.

For debugging workflow, pinned-pair isolation, and regression strategy, see docs/testing/overlay-debugging-playbook.md.

Request Signals¶

union reconstruction
difference reconstruction
symmetric difference
overlay assembly
face labeling
cccl

Open First¶

docs/architecture/overlay-reconstruction.md
docs/architecture/segment-primitives.md
src/vibespatial/overlay/reconstruction.py
tests/test_overlay_reconstruction.py

Verify¶

uv run pytest tests/test_overlay_reconstruction.py tests/test_segment_primitives.py
uv run python scripts/check_docs.py --check

Risks¶

Monolithic overlay kernels would hide the exact stage boundaries needed for CCCL.
If stable ordering is not explicit, later dissolve and overlay assembly will drift from pandas-facing semantics.
Union, difference, and symmetric difference should not each invent separate topology assembly logic.

Intent¶

Define one staged reconstruction plan from classified segments to geometry output buffers so later constructive kernels share the same graph.

Options Considered¶

One bespoke kernel per overlay operation. Fast to sketch, but it duplicates topology assembly and destroys reuse.
Keep using host-side Shapely for all reconstruction. Correct on CPU, but it provides no owned assembly seam for GPU work.
Shared staged reconstruction. Classify segments once, emit nodes once, build directed edges once, label faces once, then apply operation-specific selection at the end.

Decision¶

Use option 3.

The shared plan is:

classify candidate segment pairs
emit nodes and split segments at intersections
stable-sort directed half-edges
walk rings and open chains
label faces and chains by source coverage
select union/difference/symmetric-difference outputs
emit geometry buffers in deterministic order

CCCL Mapping¶

The intended GPU execution is:

compaction of ambiguous segment rows
device-side split-event emission for endpoints, touches, proper crosses, and overlaps
stable sort for half-edge grouping
prefix sums for segment splitting and output sizing
reduce-by-key for face labeling and chain aggregation
scatter/gather for output restoration

That keeps the expensive topology decisions localized and reusable across overlay operations.

Consequences¶

public overlay() now chooses an execution family before heavy work starts (clip_rewrite, broadcast_right_intersection, broadcast_right_difference, coverage_union, grouped_union, or generic_reconstruction) and records that family in dispatch telemetry
every constructive family still lowers through one canonical NativeTabularResult boundary, so planner selection changes execution shape without reintroducing host-side composition trees
Phase 5 now has one reconstruction graph instead of operation-specific glue
o17.9.6.2 lands the device split-event and directed-edge primitive that feeds the later half-edge graph work
o17.9.6.3 adds canonical node ids, deterministic next_edge traversal, and bounded-face labeling on top of that directed-edge table
o17.9.6.4 is allowed to route simple axis-aligned rectangle batches into a dedicated fast path when the generic shared graph would have the wrong performance shape for pairwise box intersections
o17.5.5 can build dissolve on top of the same union reconstruction
o17.9.6.6 reuses the same bounded-face labels for union, difference, symmetric_difference, and geometry-only identity selectors instead of forking new topology assembly code
current GPU overlay output support remains polygon-only at the input seam; mixed-family constructive overlay stays explicitly unsupported until a later a later change widens the kernel contract
later GPU overlay work has an explicit CCCL-friendly assembly seam
Memory-Safe Difference Batching: overlay difference splits large workloads into VRAM-safe batches to prevent OOM at scale. The strategy:
1. Estimate per-pair byte cost from right-side coordinate density.
2. Query free VRAM via cupy.cuda.Device().mem_info.
3. Budget = free_bytes * _VRAM_BUDGET_FRACTION (0.3, conservative because segmented_union and binary_constructive each allocate comparable working memory).
4. Compute groups per batch from budget / (avg_pairs * bytes_per_pair), clamped to [_MIN_GROUPS_PER_BATCH=64, _MAX_GROUPS_PER_BATCH=10K].
5. Below _PAIR_THRESHOLD (200K total pairs), skip batching entirely and process in a single dispatch (lower overhead).
6. Each batch gathers its left/right slices, runs the GPU overlay difference kernel, and appends results. Final assembly concatenates batch outputs with consistent index tracking.
many-vs-one (N-vs-1) overlay uses a three-tier strategy that bypasses the full reconstruction graph for the common broadcast_right workload shape: (1) containment bypass identifies polygons fully inside the clip polygon and returns them unchanged, (2) batched Sutherland-Hodgman clip handles boundary-crossing simple polygons on GPU, (3) only complex remainder polygons fall through to the full per-group overlay pipeline