GeoParquet Scan Engine

Context

o17.6.19 added metadata-first row-group pruning and o17.6.18 made aligned GeoArrow adoption cheap. The remaining gap is the scan engine itself: one contract that can use pylibcudf when available, plan chunk boundaries from row-group metadata, and assemble owned buffers without routing geometry through Shapely.

Decision

Adopt a backend-neutral GeoParquet scan engine with chunk planning and direct Arrow-to-owned geometry decode.

  • row-group selection remains the planner’s responsibility

  • scan execution chooses pylibcudf when available and otherwise uses pyarrow

  • supported geometry encodings decode directly into owned buffers after scan

  • chunked scans concatenate owned-buffer batches instead of rebuilding geometry objects between chunks

  • WKB remains supported, but as an explicit slower bridge relative to native GeoArrow encodings

Consequences

  • the fast path is now parquet scan -> Arrow geometry decode -> owned buffers instead of parquet scan -> Shapely objects -> owned buffers

  • o17.6.21 can replace the host Arrow family decoders with device kernels without changing the scan-engine boundary

  • chunked out-of-core execution now has a stable owned-buffer contract

Alternatives Considered

  • keep pylibcudf as an unfiltered table reader that immediately converts back to host GeoPandas objects

  • wait for full device-side family decoders before adding any owned-buffer scan engine

  • keep WKB as the default decode path for all scanned GeoParquet datasets

Acceptance Notes

This landing provides the scan-engine boundary, chunk planning, and direct GeoArrow-family owned decode on the host validation path. pylibcudf is still an optional backend and device-side family decode remains follow-on work in o17.6.21.