Segment Intersection Primitives¶

o17.5.1 establishes the first constructive-kernel primitive layer: direct segment extraction from owned buffers, cheap candidate generation, and robust intersection classification.

Request Signals¶

segment intersection
overlay primitives
degeneracy
collinear overlap
zero-length segment
cccl

Open First¶

docs/architecture/segment-primitives.md
docs/architecture/robustness.md
src/vibespatial/spatial/segment_primitives.py
tests/test_segment_primitives.py

Verify¶

uv run pytest tests/test_segment_primitives.py tests/test_segment_filters.py
uv run pytest tests/test_segment_primitives.py -q --run-gpu
uv run vsbench run segment-intersections --scale 10k
uv run python scripts/check_architecture_lints.py --all
uv run python scripts/check_docs.py --check

Risks¶

Wrong segment classification poisons every later overlay kernel.
Exact arithmetic on every row would destroy GPU throughput.
Routing candidate generation through materialized Shapely geometry would lock Phase 5 to the host path we are trying to replace.

Intent¶

Choose a segment-intersection primitive shape that preserves robustness while staying compatible with later CCCL-backed GPU execution.

Options Considered¶

Delegate segment intersection to Shapely/GEOS. Correct on CPU, but it hides the primitive boundaries and forces future GPU work to reverse-engineer the operation graph.
Run exact scalar arithmetic on every candidate pair. Robust, but throughput-hostile and a poor fit for SIMT hardware.
Stage the work: direct segment extraction, vectorized fast-path orientation, ambiguous-row compaction, and exact fallback only on compacted rows. This matches the repo’s robustness policy and maps onto reusable primitives.

Decision¶

Use option 3.

The landed primitive is a fully GPU-native 3-kernel pipeline:

extracts segments from owned coordinate buffers via GPU count-scatter kernels
generates candidate pairs via O(n log n) sort-sweep with P95 OOM protection
classifies pairs on GPU with fast orientation filter and Shewchuk adaptive refinement
no host round-trip for ambiguous rows – all refinement stays on device

The kernel registers as KernelClass.PREDICATE (not CONSTRUCTIVE) because on-GPU Shewchuk adaptive refinement handles precision internally, removing the need for forced fp64 compute on consumer GPUs.

The implementation has three GPU kernel stages:

segment extraction via NVRTC count-scatter kernels (per-family dispatch)
candidate generation via CCCL radix sort + binary search sweep
classification with warp-cooperative MBR skip and on-GPU Shewchuk refinement
result buffers stay on device for later reconstruction work

CCCL Mapping¶

The GPU path uses a three-tier primitive strategy (ADR-0033):

Segment extraction (Tier 1 NVRTC): two-pass count-scatter kernel with per-family dispatch extracts segments directly from owned coordinate buffers.
Candidate generation (CCCL Tier 3a + CuPy Tier 2): O(n log n) sort-sweep using CCCL radix_sort on right-segment x-midpoints, then CCCL lower_bound/upper_bound binary search per left segment. A P95 half-width strategy prevents OOM from outlier segments: the binary-search window uses the 95th-percentile right half-width (tight windows for 95% of segments), with a separate brute-force MBR pass for the <=5% outlier right segments whose half-width exceeds P95. Main and outlier candidates merge via cp.concatenate on device. When right.count < 20 or P95 == max, the outlier pass is skipped.
Fast classification (Tier 1 NVRTC): warp-cooperative MBR skip, fast orientation filter, Shewchuk adaptive refinement on GPU. No ambiguous rows are sent to the host; all refinement stays on device.
Compaction (CuPy Tier 2): compact_indices via CuPy flatnonzero for MBR overlap filtering and ambiguous-row isolation.

Degeneracy Semantics¶

strict interior crossing: proper
shared endpoints or point-on-segment hits: touch
collinear shared span: overlap
otherwise: disjoint

The exact fallback path must cover:

collinear overlap
shared vertices
zero-length segments
polygon ring-edge corner cases

Consequences¶

Phase 5 now has a reusable segment primitive instead of ad hoc scalar logic
the owned-buffer path stays visible from extraction through classification
later overlay work can consume candidate pairs and classifications directly