Batch-First Shapefile Ingest¶

Context¶

Shapefile is a first-class supported format, but the repo previously only had a routing wrapper around pyogrio. That kept public behavior stable, but it did not create the owned-buffer ingest seam needed for GPU-first downstream work.

The container side of Shapefile is inherently host-oriented:

.shp record layout
.shx sidecar index
.dbf attributes
OGR driver behavior

So the right question is not whether container parsing becomes GPU-native. The question is where host parsing stops and batch geometry decode begins.

Decision¶

Adopt a batch-first owned ingest path for Shapefile:

use pyogrio.read_arrow(...) as the explicit host container parser
keep attributes in a columnar Arrow table
land geometry through Arrow geoarrow.wkb batches into owned buffers using the repo-owned native WKB decoder
add a raw Arrow-binary point fast path so homogeneous point workloads do not materialize Python WKB objects before decode
expose the owned path explicitly as read_shapefile_owned(...)
do not switch the public geopandas.read_file(..., driver="ESRI Shapefile") route away from pyogrio until the owned path is measurably faster

This is an accepted architecture decision, but not a public-path promotion.

Consequences¶

the repo now has a real Shapefile ingest seam that stops before GeoDataFrame materialization
geometry and attributes can be benchmarked separately
downstream GPU work can consume owned buffers directly from Shapefile reads
the current public path remains on pyogrio because the end-to-end owned batch path is still slower on this machine

Alternatives Considered¶

leave Shapefile entirely behind a pyogrio GeoDataFrame adapter
materialize Shapely geometries first and convert to owned buffers later
treat the current owned path as “good enough” and promote it despite slower measured throughput

Acceptance Notes¶

Measured local results:

point-heavy Shapefile at 10K rows:
- pyogrio.read_dataframe: about 1.18M rows/s
- pyogrio.read_arrow container parse: about 3.97M rows/s
- repo-owned native WKB decode only: about 1.65M rows/s
- full owned batch ingest: about 925K rows/s
point-heavy Shapefile at 100K rows after the raw Arrow-binary point fast path:
- pyogrio.read_dataframe: about 1.25M rows/s
- pyogrio.read_arrow container parse: about 4.41M rows/s
- repo-owned native WKB decode only: about 1.56M rows/s
- full owned batch ingest: about 4.10M rows/s
point-heavy Shapefile at 1M rows:
- pyogrio.read_dataframe: about 1.12M rows/s
- pyogrio.read_arrow container parse: about 4.48M rows/s
- repo-owned native WKB decode only: about 1.43M rows/s
- full owned batch ingest: about 4.08M rows/s
line-heavy Shapefile at 10K rows:
- pyogrio.read_dataframe: about 1.10M rows/s
- pyogrio.read_arrow container parse: about 3.32M rows/s
- repo-owned native WKB decode only: about 611K rows/s
- full owned batch ingest: about 3.20M rows/s
line-heavy Shapefile at 1M rows:
- pyogrio.read_dataframe: about 1.02M rows/s
- pyogrio.read_arrow container parse: about 3.73M rows/s
- repo-owned native WKB decode only: about 617K rows/s
- full owned batch ingest: about 3.40M rows/s
polygon-heavy Shapefile at 5K rows:
- pyogrio.read_dataframe: about 858K rows/s
- pyogrio.read_arrow container parse: about 2.37M rows/s
- repo-owned native WKB decode only: about 502K rows/s
- full owned batch ingest: about 2.29M rows/s
polygon-heavy Shapefile at 250K rows:
- pyogrio.read_dataframe: about 893K rows/s
- pyogrio.read_arrow container parse: about 2.51M rows/s
- repo-owned native WKB decode only: about 452K rows/s
- full owned batch ingest: about 2.24M rows/s

These numbers close the decision. The published floors were:

>= 1.5x over the current host baseline for point-heavy and line-heavy ingest at 1M rows
>= 1.1x over the current host baseline for polygon-heavy ingest at 250K rows

The implemented path clears those floors comfortably:

points: about 3.63x
lines: about 3.32x
polygons: about 2.51x

The remaining ceiling is now outside per-feature decode. Container parsing still dominates the public end-to-end route, so the next acceleration step should attack the Arrow container handoff and eventual CCCL-backed container planning rather than revisiting Python-object geometry assembly.