Mixed Geometries¶
GeoPandas permits mixed geometry families in one array. This doc defines the default storage and execution strategy before owned geometry buffers land.
Intent¶
Choose a mixed-geometry handling strategy that minimizes divergence without making the canonical buffer model overly fragmented or wasteful.
Request Signals¶
mixed geometry
tagged union
sort partition
soa
geometry families
buffer layout
divergence
Open First¶
docs/architecture/mixed-geometries.md
src/vibespatial/testing/mixed_layouts.py
scripts/benchmark_mixed_layouts.py
docs/implementation-order.md
Verify¶
uv run pytest tests/test_mixed_layouts.pyuv run python scripts/benchmark_mixed_layouts.py --scales 100000 1000000uv run python scripts/check_docs.py --check
Risks¶
Canonical storage and execution strategy are easy to conflate; they should not be the same decision.
GeometryCollections remain a pathological mixed-type case and should not force the common fast path.
A storage model that optimizes memory at ingest can still lose badly once divergent kernels dominate runtime.
Candidates¶
The four candidate approaches are:
separate typed arrays: split by family, execute family-specific kernels, then rejoin
tagged union: one mixed array with family tags and offsets, execute in original order
sort-partition: keep one logical array, but sort or partition to homogeneous chunks for execution
promote to common type: coerce everything to one family such as polygon
Decision¶
Use a hybrid strategy:
canonical storage: dense tagged representation with family tags and child-relative offsets
execution default for truly mixed inputs: sort-partition by coarse family (
point,line,polygon)execution fast path for near-homogeneous inputs: direct tagged execution without repartition
reject common-type promotion as a default strategy
do not make permanently separated typed arrays the canonical user-visible storage
This means candidate B wins as the storage model and candidate C wins as
the mixed-execution model. Candidate A remains a useful internal cache or
kernel-local staging shape, not the primary persisted array contract.
Why¶
Tagged storage preserves original ordering and keeps API semantics simple.
Sort-partition execution removes the worst warp divergence when the array is materially mixed.
Permanently separate typed arrays force split and rejoin logic into every consumer and complicate row-wise pandas alignment.
Promotion to polygon-like common type is semantically wrong for points and lines and can distort both memory and algorithm choice.
Benchmark Method¶
The current benchmark in scripts/benchmark_mixed_layouts.py measures:
metadata and payload byte estimates from synthetic geometry families
partitioning or reorder cost at
100Kand1Mrowswarp-purity proxy on the original order as a divergence signal
Representative mixes:
point-dominated:
90 / 8 / 2polygon-dominated:
5 / 15 / 80mixed:
40 / 30 / 30
These are design-stage layout benchmarks, not final kernel throughput numbers. Actual GPU throughput still needs validation in later benchmark rails.
Results¶
Measured on this repo checkout with the synthetic payload model:
Dataset |
Scale |
Tagged purity |
Tagged prep ms |
Sort-partition prep ms |
Tagged payload MB |
Sort-partition MB |
Recommendation |
|---|---|---|---|---|---|---|---|
point-dominated |
100K |
0.900 |
0.16 |
0.71 |
3.43 |
4.73 |
direct tagged execution |
point-dominated |
1M |
0.900 |
1.74 |
7.89 |
34.28 |
47.28 |
direct tagged execution |
polygon-dominated |
100K |
0.800 |
0.15 |
0.38 |
19.84 |
21.14 |
tagged with optional late partitioning |
polygon-dominated |
1M |
0.800 |
1.57 |
7.02 |
198.40 |
211.40 |
tagged with optional late partitioning |
mixed |
100K |
0.436 |
0.10 |
0.27 |
11.58 |
12.88 |
sort-partition execution |
mixed |
1M |
0.435 |
1.08 |
5.82 |
115.80 |
128.80 |
sort-partition execution |
Interpretation:
Metadata overhead is not the deciding factor between tagged and separated layouts.
Divergence risk becomes the real problem once warp purity drops well below the
0.70to0.80range.Sort-partition adds modest metadata and reorder cost relative to the payload sizes at
100Kand1M, which makes it a good execution-time trade when the mix is genuinely heterogeneous.
Thresholds¶
Use these provisional thresholds until adaptive runtime work lands:
dominant-family share
>= 88%: execute directly from tagged storagedominant-family share
70%to< 88%: default to tagged execution, allow kernel-specific late partitioning if profiling shows divergence paindominant-family share
< 70%and row count>= 10K: partition before executionrow count
< 10K: prefer tagged execution unless a kernel proves otherwise
o17.2.10 should eventually replace these fixed thresholds with observed
runtime-driven switching.
Buffer Implications¶
o17.2.1 should assume:
one logical mixed array may contain multiple geometry families
the canonical metadata must include at least family tag and family-relative offset
partitioned execution should be able to materialize permutation buffers without copying full payload data
row-order restoration must be cheap and explicit
GeometryCollections can stay on a slow path or explicit fallback path early
Rejections¶
Reject as defaults:
sparse-union-like promotion of all rows to the same payload shape
permanent split-by-family storage as the only canonical representation
Both can still exist as specialized internal views, but neither should define the Phase 2 buffer contract.
Next Consumers¶
o17.2.1should use this doc as the storage-layout decision input.o17.2.10should treat the thresholds here as the first adaptive-policy baseline.o17.6.1should plan to preserve pandas row order even when execution partitions by family.
Verification¶
uv run pytest tests/test_mixed_layouts.py
uv run python scripts/benchmark_mixed_layouts.py --scales 100000 1000000
uv run python scripts/check_docs.py --check