Spatial Query And Join Assembly¶
Request Signals¶
sindex query
sjoin
dwithin
nearest
pandas semantics
join assembly
spatial index
cccl
Open First¶
docs/architecture/spatial-joins.md
src/vibespatial/spatial/query.py
src/vibespatial/api/sindex.py
src/vibespatial/api/tools/sjoin.py
docs/architecture/binary-predicates.md
Verify¶
uv run pytest tests/test_spatial_query.pyuv run pytest tests/upstream/geopandas/tests/test_sindex.py -quv run pytest tests/upstream/geopandas/tools/tests/test_sjoin.py -k "predicate or nearest"uv run python scripts/benchmark_spatial_query.py --rows 20000 --overlap-ratio 0.2uv run python scripts/profile_fixture_spatial_query.py --fixture polygons-regular-grid-rows100000 --operation query --ensureuv run python scripts/check_docs.py --check
Risks¶
Query/index speedups are meaningless if join assembly breaks index names, suffixes, or geometry-column retention.
A nearest implementation without a bounded-search fast path will not scale.
Falling back to Shapely/STRtree for unsupported inputs must stay explicit enough that future GPU work is not hidden.
Intent¶
Define how repo-owned spatial query primitives plug into vendored GeoPandas
sindex and sjoin surfaces while preserving pandas-compatible output rules.
Decision¶
Land a repo-owned spatial-query engine and keep the vendored DataFrame-level join assembly.
This means:
the repo-owned spatial-query engine defines how future GPU-friendly query,
dwithin, and bounded-nearest work should be stagedregular-grid, axis-aligned rectangle polygon indexes may register an explicit GPU point-query specialization instead of paying the generic bbox-tiling cost
vendored
SpatialIndexstays on STRtree by default for current host execution because the owned host path is not yet performance-competitivesjoincontinues to rely on vendored_frame_join(...)so suffixes, index restoration, and geometry-column rules stay GeoPandas-compatibledwithinuses distance-expanded bounds as the coarse pass, then exactshapely.dwithinon the compacted candidate setbounded nearest uses exact-distance reduction over compacted candidates
unsupported geometry families fall back to the original Shapely/STRtree path
Query Strategy¶
The owned engine provides:
build and cache a flat spatial index over owned geometry buffers
allow the generic flat-index order build to use GPU morton-key sort when the runtime selection explicitly requests GPU execution
use bbox candidate generation for scalar and bulk query
refine candidate rows with the exact binary-predicate engine
support
indices,dense, andsparseoutput formats
For benchmark-style joins where the tree is a regular grid of non-overlapping rectangles and the query side is points, the engine may skip generic candidate generation entirely:
detect the regular-grid rectangle layout while building the flat index
launch a point-to-cell GPU kernel that emits up to four polygon hits per point for edge and corner cases
return polygon row ids directly because point-vs-rectangle
intersectssemantics match the cell lookup exactlycompact duplicate polygon ids before any downstream dissolve/materialization work so repeated point hits do not dominate CPU join assembly
The planned integration seam makes this query stack prefer index-driven coarse filtering where possible and routes point-vs-polygon refinement through the dedicated point predicate pipeline instead of always using the generic binary-predicate path.
This maps cleanly onto a future GPU/CCCL path:
bounds compare: transform-style primitive
candidate compaction:
DeviceSelectexact predicate refine on compacted rows
scatter back into requested output representation
Profile public-path query changes against cached GeoParquet fixtures in
.benchmark_fixtures/ before trusting synthetic-only kernel rails. The fixture
workflow is the source of truth for host/device traversal because it exercises
read_parquet -> sindex -> query through the public GeoPandas surface.
Nearest Strategy¶
Nearest uses a three-tier dispatch for point-point data:
Zero-copy GPU grid nearest — extracts coords directly from Shapely arrays (bypassing
_to_owned), builds a uniform-grid spatial hash on device, and runs ring-expansion search entirely on the GPU. Handles both bounded and unbounded nearest. Falls through for non-point inputs or degenerate grids.Indexed GPU nearest — uses
_to_ownedarrays with a sorted-X sweep and CCCL lower/upper bound for tie counting. Fallback when the grid path declines.STRtree host path — used for non-point or mixed-geometry inputs, or when the GPU is unavailable.
The bounded path (max_distance set) can also use distance-expanded bounds to
prune candidates, then exact distance reduction per input geometry.
Pandas Semantics¶
GeoPandas-compatible result assembly remains in the vendored layer:
howsemantics forleft,right, andinnerindex restoration and unnamed-index handling
overlapping-column suffix policy
geometry-column retention rules
optional distance-column insertion for nearest joins
Consequences¶
o17.4.3now has one owned spatial-query engine that future GPU work can targettargeted upstream
sindexandsjoinfamilies validate the result semanticsthe default host adapter avoids a measured regression by staying on STRtree for now
future GPU work can replace query/nearest internals without re-implementing DataFrame join behavior