Mixed Geometries¶

GeoPandas permits mixed geometry families in one array. This doc defines the default storage and execution strategy before owned geometry buffers land.

Intent¶

Choose a mixed-geometry handling strategy that minimizes divergence without making the canonical buffer model overly fragmented or wasteful.

Request Signals¶

mixed geometry
tagged union
sort partition
soa
geometry families
buffer layout
divergence

Open First¶

docs/architecture/mixed-geometries.md
src/vibespatial/testing/mixed_layouts.py
scripts/benchmark_mixed_layouts.py
docs/implementation-order.md

Verify¶

uv run pytest tests/test_mixed_layouts.py
uv run python scripts/benchmark_mixed_layouts.py --scales 100000 1000000
uv run python scripts/check_docs.py --check

Risks¶

Canonical storage and execution strategy are easy to conflate; they should not be the same decision.
GeometryCollections remain a pathological mixed-type case and should not force the common fast path.
A storage model that optimizes memory at ingest can still lose badly once divergent kernels dominate runtime.

Candidates¶

The four candidate approaches are:

separate typed arrays: split by family, execute family-specific kernels, then rejoin
tagged union: one mixed array with family tags and offsets, execute in original order
sort-partition: keep one logical array, but sort or partition to homogeneous chunks for execution
promote to common type: coerce everything to one family such as polygon

Decision¶

Use a hybrid strategy:

canonical storage: dense tagged representation with family tags and child-relative offsets
execution default for truly mixed inputs: sort-partition by coarse family (point, line, polygon)
execution fast path for near-homogeneous inputs: direct tagged execution without repartition
reject common-type promotion as a default strategy
do not make permanently separated typed arrays the canonical user-visible storage

This means candidate B wins as the storage model and candidate C wins as the mixed-execution model. Candidate A remains a useful internal cache or kernel-local staging shape, not the primary persisted array contract.

Why¶

Tagged storage preserves original ordering and keeps API semantics simple.
Sort-partition execution removes the worst warp divergence when the array is materially mixed.
Permanently separate typed arrays force split and rejoin logic into every consumer and complicate row-wise pandas alignment.
Promotion to polygon-like common type is semantically wrong for points and lines and can distort both memory and algorithm choice.

Benchmark Method¶

The current benchmark in scripts/benchmark_mixed_layouts.py measures:

metadata and payload byte estimates from synthetic geometry families
partitioning or reorder cost at 100K and 1M rows
warp-purity proxy on the original order as a divergence signal

Representative mixes:

point-dominated: 90 / 8 / 2
polygon-dominated: 5 / 15 / 80
mixed: 40 / 30 / 30

These are design-stage layout benchmarks, not final kernel throughput numbers. Actual GPU throughput still needs validation in later benchmark rails.

Results¶

Measured on this repo checkout with the synthetic payload model:

Dataset	Scale	Tagged purity	Tagged prep ms	Sort-partition prep ms	Tagged payload MB	Sort-partition MB	Recommendation
point-dominated	100K	0.900	0.16	0.71	3.43	4.73	direct tagged execution
point-dominated	1M	0.900	1.74	7.89	34.28	47.28	direct tagged execution
polygon-dominated	100K	0.800	0.15	0.38	19.84	21.14	tagged with optional late partitioning
polygon-dominated	1M	0.800	1.57	7.02	198.40	211.40	tagged with optional late partitioning
mixed	100K	0.436	0.10	0.27	11.58	12.88	sort-partition execution
mixed	1M	0.435	1.08	5.82	115.80	128.80	sort-partition execution

Interpretation:

Metadata overhead is not the deciding factor between tagged and separated layouts.
Divergence risk becomes the real problem once warp purity drops well below the 0.70 to 0.80 range.
Sort-partition adds modest metadata and reorder cost relative to the payload sizes at 100K and 1M, which makes it a good execution-time trade when the mix is genuinely heterogeneous.

Thresholds¶

Use these provisional thresholds until adaptive runtime work lands:

dominant-family share >= 88%: execute directly from tagged storage
dominant-family share 70% to < 88%: default to tagged execution, allow kernel-specific late partitioning if profiling shows divergence pain
dominant-family share < 70% and row count >= 10K: partition before execution
row count < 10K: prefer tagged execution unless a kernel proves otherwise

o17.2.10 should eventually replace these fixed thresholds with observed runtime-driven switching.

Buffer Implications¶

o17.2.1 should assume:

one logical mixed array may contain multiple geometry families
the canonical metadata must include at least family tag and family-relative offset
partitioned execution should be able to materialize permutation buffers without copying full payload data
row-order restoration must be cheap and explicit
GeometryCollections can stay on a slow path or explicit fallback path early

Rejections¶

Reject as defaults:

sparse-union-like promotion of all rows to the same payload shape
permanent split-by-family storage as the only canonical representation

Both can still exist as specialized internal views, but neither should define the Phase 2 buffer contract.

Next Consumers¶

o17.2.1 should use this doc as the storage-layout decision input.
o17.2.10 should treat the thresholds here as the first adaptive-policy baseline.
o17.6.1 should plan to preserve pandas row order even when execution partitions by family.

Verification¶

uv run pytest tests/test_mixed_layouts.py
uv run python scripts/benchmark_mixed_layouts.py --scales 100000 1000000
uv run python scripts/check_docs.py --check