Arrow And GeoParquet IO¶
Intent¶
Define the repo-owned Arrow, GeoParquet, and WKB IO boundary around owned geometry buffers while keeping GPU-native formats as the design center.
Request Signals¶
io arrow
geoparquet
wkb
geoarrow
parquet
arrow bridge
io decode
io encode
Open First¶
docs/architecture/io-arrow.md
src/vibespatial/io/geoarrow.py
src/vibespatial/io/geoparquet.py
src/vibespatial/io/wkb.py
Verify¶
uv run pytest tests/test_io_arrow.pyuv run python scripts/benchmark_io_arrow.py --suite smokeuv run python scripts/check_docs.py --check
Risks¶
Repeatedly rebuilding Shapely-heavy intermediate state in the Arrow path destroys throughput.
Silent host decode hides missing GPU paths.
WKB compatibility bridge becoming the de facto layout instead of GeoArrow.
Decision¶
Treat GeoArrow as the canonical geometry interchange surface for owned buffers.
Route GeoPandas
to_arrow,from_arrow,to_parquet, andread_parquetthrough repo-owned adapters instead of calling vendored helpers directly.Keep a real optional
pylibcudfGeoParquet scan path for unfiltered scans, but fall back explicitly when that runtime or a GPU-side bbox filter path is unavailable.Model bbox pushdown at the adapter layer from GeoParquet covering metadata or point encoding so later GPU scanners can reuse the same decision logic.
Treat WKB as a compatibility bridge, not a canonical layout, and keep its encode/decode path explicit.
Adopt aligned GeoArrow buffers zero-copy and normalize only when the incoming layout does not match the canonical owned schema.
Performance Notes¶
Arrow and GeoParquet should converge on owned buffers instead of repeatedly rebuilding shapely-heavy intermediate state.
The fastest long-term path is device-side GeoArrow and WKB codecs plus a GPU Parquet scanner; today the repo-owned adapters make the fallback visible instead of silently hiding a host path.
GeoParquet scans without bbox filters can already target a
pylibcudfreader when that dependency is present.Covering-based bbox pruning should stay outside geometry decode so row-group selection can reject work before expensive geometry materialization.
The current planner compares loop and vectorized row-group pruning and uses the vectorized strategy once row-group counts are large enough to justify it.
GeoArrow import and export should prefer shared buffer views over eager host copies whenever dtypes and shapes already match owned-buffer requirements.
Host geometry objects should stay lazily materialized; GeoArrow adoption must not construct Shapely objects unless a caller explicitly requests them.
GeoParquet scans should decode native GeoArrow family columns directly into owned buffers after scan instead of bouncing through Shapely.
Chunked GeoParquet scans should concatenate owned-buffer batches, not materialized geometry objects.
Current Behavior¶
GeoDataFrame.to_arrow,GeoDataFrame.from_arrow,GeoSeries.to_arrow,GeoSeries.from_arrow,GeoDataFrame.to_parquet, andgeopandas.read_parquetnow dispatch through repo-owned wrappers.Owned GeoArrow and WKB bridge helpers exist as first-class repo APIs.
Dispatch and fallback events make the current host/device choice observable.
Repo-owned WKB bridges now use a staged native path for supported families:
one header scan separates native rows from the explicit fallback pool
point, linestring, polygon, multipoint, multilinestring, and multipolygon rows use family-specialized native decode or encode
malformed, unsupported, or non-little-endian rows compact into explicit fallback instead of forcing the whole batch through Shapely
geopandas.read_parquet(..., bbox=...)now builds a repo-owned metadata summary when pyarrow metadata is available, selects row groups before the table read, and passes those row groups into the host read path instead of decoding the full dataset first.Repo-owned GeoArrow bridges now distinguish:
copy: always normalize into fresh owned buffersauto: share aligned buffers, normalize only when requiredshare: require a fully aligned layout and fail otherwise
Repo-owned
read_geoparquet_owned(...)now provides the scan-engine seam foro17.6.20:backend selection:
pylibcudforpyarrowrow-group chunk planning from metadata summaries
direct GeoArrow-family decode into owned buffers
chunk concatenation at the owned-buffer layer
The
pylibcudfGeoParquet device path now decodes all native GeoArrow families (point,linestring,polygon,multipoint,multilinestring,multipolygon) into device-resident owned buffers without forcing host family payload materialization first.The
pylibcudfGeoParquet device path now also decodes WKB point-only, linestring-only, and mixed point/linestring columns into device-resident owned buffers without a Shapely round-trip.Polygon-family WKB device decode is still follow-on work; the current staged WKB bridge remains the contract to port for polygon, multipoint, multilinestring, and multipolygon WKB rows.
Repo-owned native GeoArrow codecs now provide family-specialized encode and decode for homogeneous geometry columns:
point, linestring, polygon, multilinestring, and multipolygon extension arrays decode through dedicated family builders
homogeneous exports encode directly from owned buffers to native GeoArrow arrays instead of routing through the generic host bridge
mixed-family exports stay on explicit WKB fallback until partition-and-restore mixed codecs land
successful homogeneous native export no longer records a fallback event on the public GeoPandas Arrow surface
Measured Local Baseline¶
Host-only validation on this machine already shows why native GeoArrow decode must be the design center even before GPU throughput is measured:
100Kpoint rows, GeoArrow GeoParquet decode: about37.0Mrows/s100Kpoint rows, WKB GeoParquet decode: about170Krows/s20Kpolygon rows, GeoArrow GeoParquet decode: about3.05Mrows/s20Kpolygon rows, WKB GeoParquet decode: about64.6Krows/s
That is roughly 218x better on the point case and 47x better on the polygon
case, which validates the native scan-engine direction before pylibcudf
throughput is available locally.
The new family-specialized codec benchmarks also show the bridge structure is paying off before device kernels land:
100Kpoint rows, native GeoArrow encode: about98.9Mrows/s100Kpoint rows, host bridge encode: about11.2Mrows/s20Kpolygon rows, native GeoArrow decode: about5.68Mrows/s20Kpolygon rows, host bridge decode: about4.62Mrows/s
That is about 8.8x faster on point encode and about 1.23x faster on polygon
decode. The remaining bottleneck is the pylibcudf -> pyarrow -> owned bridge,
which is now isolated behind the family codec boundary instead of being mixed
into the public adapter layer.
The staged WKB bridge now shows the same pattern on the compatibility path:
1Mpoint rows, native WKB decode: about1.54Mrows/s1Mpoint rows, host WKB decode bridge: about177Krows/s1Mpoint rows, native WKB encode: about5.37Mrows/s1Mpoint rows, host WKB encode bridge: about145Krows/s
That is about 8.7x faster on decode and about 37x faster on encode while
keeping unsupported rows isolated in an explicit fallback pool. The remaining
work for o17.6.22 is no longer bridge shape; it is moving the same staged
scan, partition, size, and scatter contract onto CCCL-backed device passes.