GeoParquet Scan Engine¶
Context¶
o17.6.19 added metadata-first row-group pruning and o17.6.18 made aligned
GeoArrow adoption cheap. The remaining gap is the scan engine itself: one
contract that can use pylibcudf when available, plan chunk boundaries from
row-group metadata, and assemble owned buffers without routing geometry through
Shapely.
Decision¶
Adopt a backend-neutral GeoParquet scan engine with chunk planning and direct Arrow-to-owned geometry decode.
row-group selection remains the planner’s responsibility
scan execution chooses
pylibcudfwhen available and otherwise usespyarrowsupported geometry encodings decode directly into owned buffers after scan
chunked scans concatenate owned-buffer batches instead of rebuilding geometry objects between chunks
WKB remains supported, but as an explicit slower bridge relative to native GeoArrow encodings
Consequences¶
the fast path is now
parquet scan -> Arrow geometry decode -> owned buffersinstead ofparquet scan -> Shapely objects -> owned bufferso17.6.21can replace the host Arrow family decoders with device kernels without changing the scan-engine boundarychunked out-of-core execution now has a stable owned-buffer contract
Alternatives Considered¶
keep
pylibcudfas an unfiltered table reader that immediately converts back to host GeoPandas objectswait for full device-side family decoders before adding any owned-buffer scan engine
keep WKB as the default decode path for all scanned GeoParquet datasets
Acceptance Notes¶
This landing provides the scan-engine boundary, chunk planning, and direct
GeoArrow-family owned decode on the host validation path. pylibcudf is still
an optional backend and device-side family decode remains follow-on work in
o17.6.21.