Arrow And GeoParquet IO¶
Intent¶
Define the repo-owned Arrow, GeoParquet, and WKB IO boundary around owned geometry buffers while keeping GPU-native formats as the design center.
Request Signals¶
io arrow
geoparquet
wkb
geoarrow
parquet
arrow bridge
io decode
io encode
Open First¶
docs/architecture/io-arrow.md
src/vibespatial/io/geoarrow.py
src/vibespatial/io/geoparquet.py
src/vibespatial/io/wkb.py
Verify¶
uv run pytest tests/test_io_arrow.pyuv run python scripts/benchmark_io_arrow.py --suite smokeuv run python scripts/check_docs.py --check
Risks¶
Repeatedly rebuilding Shapely-heavy intermediate state in the Arrow path destroys throughput.
Silent host decode hides missing GPU paths.
WKB compatibility bridge becoming the de facto layout instead of GeoArrow.
Decision¶
Treat GeoArrow as the canonical geometry interchange surface for owned buffers.
Route GeoPandas
to_arrow,from_arrow,to_parquet, andread_parquetthrough repo-owned adapters instead of calling vendored helpers directly.Keep a real optional
pylibcudfGeoParquet scan path for unfiltered scans, but fall back explicitly when that runtime or a GPU-side bbox filter path is unavailable.Model bbox pushdown at the adapter layer from GeoParquet covering metadata or point encoding so later GPU scanners can reuse the same decision logic.
Treat WKB as a compatibility bridge, not a canonical layout, and keep its encode/decode path explicit.
Adopt aligned GeoArrow buffers zero-copy and normalize only when the incoming layout does not match the canonical owned schema.
Performance Notes¶
Arrow and GeoParquet should converge on owned buffers instead of repeatedly rebuilding shapely-heavy intermediate state.
The fastest long-term path is device-side GeoArrow and WKB codecs plus a GPU Parquet scanner; today the repo-owned adapters make the fallback visible instead of silently hiding a host path.
GeoParquet scans without bbox filters can already target a
pylibcudfreader when that dependency is present.Covering-based bbox pruning should stay outside geometry decode so row-group selection can reject work before expensive geometry materialization.
The current planner compares loop and vectorized row-group pruning and uses the vectorized strategy once row-group counts are large enough to justify it.
GeoArrow import and export should prefer shared buffer views over eager host copies whenever dtypes and shapes already match owned-buffer requirements.
Host geometry objects should stay lazily materialized; GeoArrow adoption must not construct Shapely objects unless a caller explicitly requests them.
GeoParquet scans should decode native GeoArrow family columns directly into owned buffers after scan instead of bouncing through Shapely.
Chunked GeoParquet scans should concatenate owned-buffer batches, not materialized geometry objects.
Current Behavior¶
GeoDataFrame.to_arrow,GeoDataFrame.from_arrow,GeoSeries.to_arrow,GeoSeries.from_arrow,GeoDataFrame.to_parquet, andgeopandas.read_parquetnow dispatch through repo-owned wrappers.Owned GeoArrow and WKB bridge helpers exist as first-class repo APIs.
Dispatch and fallback events make the current host/device choice observable.
Repo-owned WKB bridges now use a staged native path for supported families:
one header scan separates native rows from the explicit fallback pool
point, linestring, polygon, multipoint, multilinestring, and multipolygon rows use family-specialized native decode or encode
homogeneous Arrow WKB point, uniform-linestring, and uniform-polygon batches now take raw-buffer fast paths ahead of the generic GPU bridge and bulk-promote to device when a GPU runtime is available
malformed, unsupported, or non-little-endian rows compact into explicit fallback instead of forcing the whole batch through Shapely
geopandas.read_parquet(..., bbox=...)now builds a repo-owned metadata summary when pyarrow metadata is available, selects row groups before the table read, and passes those row groups into the host read path instead of decoding the full dataset first.Repo-owned GeoArrow bridges now distinguish:
copy: always normalize into fresh owned buffersauto: share aligned buffers, normalize only when requiredshare: require a fully aligned layout and fail otherwise
Repo-owned GeoParquet export now also accepts grouped native constructive results as an explicit terminal boundary, so grouped dissolve-style outputs can write directly without first materializing an intermediate GeoDataFrame.
Geometry-only native results can also write directly to GeoParquet, so geometry-producing pipelines do not need to rebuild a temporary GeoDataFrame just to hit the writer boundary.
Row-preserving native clip results can also write directly to GeoParquet, so constructive filter pipelines do not need to materialize a public GeoDataFrame before the terminal write boundary.
Point-only row-preserving clip results now also lower directly into the shared native tabular boundary, so simple clip producer paths no longer need to materialize a temporary public spatial object before Arrow-family export.
More generally, default
clip(..., keep_geom_type=False)producer paths now lower row-preserving non-point results directly into the shared native tabular boundary too; only the stricterkeep_geom_type=Truecompatibility cases still need the public clip materializer today.Deferred overlay constructive results can also write directly to GeoParquet, so union, identity, and symmetric-difference native paths no longer need to collapse into an intermediate public frame just to hit the writer boundary.
Overlay pairwise and left-row constructive fragments now also project attributes directly into the shared native attribute-table boundary, so union/intersection/difference producer paths no longer need to build pandas attribute fragments before native terminal export.
Deferred spatial-join export results can also write directly to GeoParquet, so native join pairs and join-context assembly can stay deferred until the terminal write boundary instead of rebuilding a public frame first.
Join export now also produces Arrow-backed native attribute payloads before the sink boundary, so
sjoinandsjoin_nearestno longer need to build a joined pandas frame just to cross into Arrow-family writers.Native join, clip, and overlay exports now converge on a shared
NativeTabularResultboundary of attribute columns plus native geometry before any terminal sink runs.That same shared boundary now lowers directly to Arrow too, so GeoParquet and other Arrow-family sinks do not need to rebuild a temporary GeoDataFrame-shaped export just to cross the terminal format boundary.
The low-level GeoPandas Arrow helper for
geometry_encoding="geoarrow"now also delegates to that shared native tabular boundary, so GeoArrow export no longer keeps a separate helper-local DeviceGeometryArray materialization branch alongside the repo-owned adapter.Repo-owned GeoArrow export now also keeps the promotable single/multi mixes (
Point/MultiPoint,LineString/MultiLineString,Polygon/MultiPolygon) on the native adapter instead of dropping those columns into the hostconstruct_geometry_array(...)bridge.The shared native boundary now also owns Parquet and Feather terminal emission, so Arrow-family write sinks no longer depend on GeoDataFrame assembly when a native result is already available.
NativeTabularResultnow accepts a shared attribute payload abstraction, so Arrow-family sinks can lower Arrow-backed attribute tables directly instead of requiring pandas frames as the only internal attribute representation.The GeoParquet writer consumes that shared native tabular boundary directly, so writer-local payload assembly is no longer the place where native result semantics live.
Native GeoParquet payload writes now also keep Arrow-backed attributes plus owned geometry on the device writer until the sink actually declines a feature, so the public
to_parquetboundary no longer eagerly materializes a temporaryGeoSeriesjust to discover that the native writer would have accepted the payload.Shared native Arrow export now follows the same rule for WKB payloads: when owned geometry is already available, it encodes directly from the owned buffers instead of rebuilding a temporary
GeoSeriesbefore Arrow emission.Public host-originated
GeoDataFrame.to_arrowandGeoSeries.to_arrowexports now record an explicit GeoArrow compatibility-writer boundary, while device-backed CPU misses stay inio_writecoverage as real acceleration gaps instead of being hidden by the compatibility bucket.Terminal GeoParquet compatibility decisions such as non-filesystem sinks or non-native compression now record an explicit CPU dispatch at the sink boundary instead of a fallback event, so strict-native mode still rejects hidden mid-pipeline fallback without forbidding explicit compatibility export.
Repo-owned
read_geoparquet_owned(...)now provides the scan-engine seam foro17.6.20:backend selection:
pylibcudforpyarrowrow-group chunk planning from metadata summaries
direct GeoArrow-family decode into owned buffers
chunk concatenation at the owned-buffer layer
The
pylibcudfGeoParquet device path now decodes all native GeoArrow families (point,linestring,polygon,multipoint,multilinestring,multipolygon) into device-resident owned buffers without forcing host family payload materialization first.The
pylibcudfGeoParquet device path now also decodes canonical WKB point, linestring, polygon, multipoint, multilinestring, and multipolygon columns into device-resident owned buffers without a Shapely round-trip.Mixed canonical WKB columns now keep the same GPU-first contract too: point-only, linestring-only, and point/linestring columns still use the lightweight
pylibcudfhelpers, while heavier or broader family mixes route through the staged GPU WKB decode pipeline after the same header scan.Non-canonical WKB is still explicit compatibility work: big-endian 2D records, EWKB SRID-annotated 2D rows, Z/M/ZM or other non-2D type ids, and families outside the owned native result model now classify into explicit compatibility buckets instead of silently hiding a host decode.
Repo-owned native GeoArrow codecs now provide family-specialized encode and decode for homogeneous geometry columns:
point, linestring, polygon, multilinestring, and multipolygon extension arrays decode through dedicated family builders
device-backed homogeneous exports now rebuild public Arrow arrays directly from the repo-owned device codec instead of routing through the generic host bridge first
unsupported device-backed geometry mixes now drop to the repo-owned WKB compatibility bridge instead of forcing a host
construct_wkb_array(...)materialization stepmixed-family GeoArrow export still ends in WKB for truly unsupported mixes until partition-and-restore mixed codecs land
successful homogeneous native export no longer records a fallback event on the public GeoPandas Arrow surface
The verified
pylibcudftransport matrix is now checked in explicitly: local paths,bytes,BytesIO,DeviceBuffer, multi-source scans, row-group selection, filters, andChunkedParquetReaderare confirmed inpylibcudf-capabilities.md.Local partitioned-directory GeoParquet reads and normalized
file://public reads are now explicitly verified to stay on thepylibcudfscan backend.
Measured Local Baseline¶
Host-only validation on this machine already shows why native GeoArrow decode must be the design center even before GPU throughput is measured:
100Kpoint rows, GeoArrow GeoParquet decode: about37.0Mrows/s100Kpoint rows, WKB GeoParquet decode: about170Krows/s20Kpolygon rows, GeoArrow GeoParquet decode: about3.05Mrows/s20Kpolygon rows, WKB GeoParquet decode: about64.6Krows/s
That is roughly 218x better on the point case and 47x better on the polygon
case, which validates the native scan-engine direction before pylibcudf
throughput is available locally.
The new family-specialized codec benchmarks also show the bridge structure is paying off before device kernels land:
100Kpoint rows, native GeoArrow encode: about98.9Mrows/s100Kpoint rows, host bridge encode: about11.2Mrows/s20Kpolygon rows, native GeoArrow decode: about5.68Mrows/s20Kpolygon rows, host bridge decode: about4.62Mrows/s
That is about 8.8x faster on point encode and about 1.23x faster on polygon
decode. The remaining bottleneck is the pylibcudf -> pyarrow -> owned bridge,
which is now isolated behind the family codec boundary instead of being mixed
into the public adapter layer.
The staged WKB bridge now shows the same pattern on the compatibility path:
1Mpoint rows, native WKB decode: about1.54Mrows/s1Mpoint rows, host WKB decode bridge: about177Krows/s1Mpoint rows, native WKB encode: about5.37Mrows/s1Mpoint rows, host WKB encode bridge: about145Krows/s
That is about 8.7x faster on decode and about 37x faster on encode while
keeping unsupported rows isolated in an explicit fallback pool. The remaining
work for o17.6.22 is no longer bridge shape; it is moving the same staged
scan, partition, size, and scatter contract onto CCCL-backed device passes.