Segment Intersection Primitives¶
o17.5.1 establishes the first constructive-kernel primitive layer: direct
segment extraction from owned buffers, cheap candidate generation, and robust
intersection classification.
Request Signals¶
segment intersection
overlay primitives
degeneracy
collinear overlap
zero-length segment
cccl
Open First¶
docs/architecture/segment-primitives.md
docs/architecture/robustness.md
src/vibespatial/spatial/segment_primitives.py
tests/test_segment_primitives.py
Verify¶
uv run pytest tests/test_segment_primitives.py tests/test_segment_filters.pyuv run pytest tests/test_segment_primitives.py -q --run-gpuuv run python scripts/benchmark_segment_intersections.py --rows 10000uv run python scripts/check_architecture_lints.py --alluv run python scripts/check_docs.py --check
Risks¶
Wrong segment classification poisons every later overlay kernel.
Exact arithmetic on every row would destroy GPU throughput.
Routing candidate generation through materialized Shapely geometry would lock Phase 5 to the host path we are trying to replace.
Intent¶
Choose a segment-intersection primitive shape that preserves robustness while staying compatible with later CCCL-backed GPU execution.
Options Considered¶
Delegate segment intersection to Shapely/GEOS. Correct on CPU, but it hides the primitive boundaries and forces future GPU work to reverse-engineer the operation graph.
Run exact scalar arithmetic on every candidate pair. Robust, but throughput-hostile and a poor fit for SIMT hardware.
Stage the work: direct segment extraction, vectorized fast-path orientation, ambiguous-row compaction, and exact fallback only on compacted rows. This matches the repo’s robustness policy and maps onto reusable primitives.
Decision¶
Use option 3.
The landed primitive is a fully GPU-native 3-kernel pipeline:
extracts segments from owned coordinate buffers via GPU count-scatter kernels
generates candidate pairs via O(n log n) sort-sweep with P95 OOM protection
classifies pairs on GPU with fast orientation filter and Shewchuk adaptive refinement
no host round-trip for ambiguous rows – all refinement stays on device
The kernel registers as KernelClass.PREDICATE (not CONSTRUCTIVE) because
on-GPU Shewchuk adaptive refinement handles precision internally, removing the
need for forced fp64 compute on consumer GPUs.
The implementation has three GPU kernel stages:
segment extraction via NVRTC count-scatter kernels (per-family dispatch)
candidate generation via CCCL radix sort + binary search sweep
classification with warp-cooperative MBR skip and on-GPU Shewchuk refinement
result buffers stay on device for later reconstruction work
CCCL Mapping¶
The GPU path uses a three-tier primitive strategy (ADR-0033):
Segment extraction (Tier 1 NVRTC): two-pass count-scatter kernel with per-family dispatch extracts segments directly from owned coordinate buffers.
Candidate generation (CCCL Tier 3a + CuPy Tier 2): O(n log n) sort-sweep using CCCL
radix_sorton right-segment x-midpoints, then CCCLlower_bound/upper_boundbinary search per left segment. A P95 half-width strategy prevents OOM from outlier segments: the binary-search window uses the 95th-percentile right half-width (tight windows for 95% of segments), with a separate brute-force MBR pass for the <=5% outlier right segments whose half-width exceeds P95. Main and outlier candidates merge viacp.concatenateon device. When right.count < 20 or P95 == max, the outlier pass is skipped.Fast classification (Tier 1 NVRTC): warp-cooperative MBR skip, fast orientation filter, Shewchuk adaptive refinement on GPU. No ambiguous rows are sent to the host; all refinement stays on device.
Compaction (CuPy Tier 2):
compact_indicesvia CuPyflatnonzerofor MBR overlap filtering and ambiguous-row isolation.
Degeneracy Semantics¶
strict interior crossing:
propershared endpoints or point-on-segment hits:
touchcollinear shared span:
overlapotherwise:
disjoint
The exact fallback path must cover:
collinear overlap
shared vertices
zero-length segments
polygon ring-edge corner cases
Consequences¶
Phase 5 now has a reusable segment primitive instead of ad hoc scalar logic
the owned-buffer path stays visible from extraction through classification
later overlay work can consume candidate pairs and classifications directly