Batch-First Shapefile Ingest¶
Context¶
Shapefile is a first-class supported format, but the repo previously only had a
routing wrapper around pyogrio. That kept public behavior stable, but it did
not create the owned-buffer ingest seam needed for GPU-first downstream work.
The container side of Shapefile is inherently host-oriented:
.shprecord layout.shxsidecar index.dbfattributesOGR driver behavior
So the right question is not whether container parsing becomes GPU-native. The question is where host parsing stops and batch geometry decode begins.
Decision¶
Adopt a batch-first owned ingest path for Shapefile:
use
pyogrio.read_arrow(...)as the explicit host container parserkeep attributes in a columnar Arrow table
land geometry through Arrow
geoarrow.wkbbatches into owned buffers using the repo-owned native WKB decoderadd a raw Arrow-binary point fast path so homogeneous point workloads do not materialize Python WKB objects before decode
expose the owned path explicitly as
read_shapefile_owned(...)do not switch the public
geopandas.read_file(..., driver="ESRI Shapefile")route away frompyogriountil the owned path is measurably faster
This is an accepted architecture decision, but not a public-path promotion.
Consequences¶
the repo now has a real Shapefile ingest seam that stops before GeoDataFrame materialization
geometry and attributes can be benchmarked separately
downstream GPU work can consume owned buffers directly from Shapefile reads
the current public path remains on
pyogriobecause the end-to-end owned batch path is still slower on this machine
Alternatives Considered¶
leave Shapefile entirely behind a pyogrio GeoDataFrame adapter
materialize Shapely geometries first and convert to owned buffers later
treat the current owned path as “good enough” and promote it despite slower measured throughput
Acceptance Notes¶
Measured local results:
point-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.18Mrows/spyogrio.read_arrowcontainer parse: about3.97Mrows/srepo-owned native WKB decode only: about
1.65Mrows/sfull owned batch ingest: about
925Krows/s
point-heavy Shapefile at
100Krows after the raw Arrow-binary point fast path:pyogrio.read_dataframe: about1.25Mrows/spyogrio.read_arrowcontainer parse: about4.41Mrows/srepo-owned native WKB decode only: about
1.56Mrows/sfull owned batch ingest: about
4.10Mrows/s
point-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.12Mrows/spyogrio.read_arrowcontainer parse: about4.48Mrows/srepo-owned native WKB decode only: about
1.43Mrows/sfull owned batch ingest: about
4.08Mrows/s
line-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.10Mrows/spyogrio.read_arrowcontainer parse: about3.32Mrows/srepo-owned native WKB decode only: about
611Krows/sfull owned batch ingest: about
3.20Mrows/s
line-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.02Mrows/spyogrio.read_arrowcontainer parse: about3.73Mrows/srepo-owned native WKB decode only: about
617Krows/sfull owned batch ingest: about
3.40Mrows/s
polygon-heavy Shapefile at
5Krows:pyogrio.read_dataframe: about858Krows/spyogrio.read_arrowcontainer parse: about2.37Mrows/srepo-owned native WKB decode only: about
502Krows/sfull owned batch ingest: about
2.29Mrows/s
polygon-heavy Shapefile at
250Krows:pyogrio.read_dataframe: about893Krows/spyogrio.read_arrowcontainer parse: about2.51Mrows/srepo-owned native WKB decode only: about
452Krows/sfull owned batch ingest: about
2.24Mrows/s
These numbers close the decision. The published floors were:
>= 1.5xover the current host baseline for point-heavy and line-heavy ingest at1Mrows>= 1.1xover the current host baseline for polygon-heavy ingest at250Krows
The implemented path clears those floors comfortably:
points: about
3.63xlines: about
3.32xpolygons: about
2.51x
The remaining ceiling is now outside per-feature decode. Container parsing still dominates the public end-to-end route, so the next acceleration step should attack the Arrow container handoff and eventual CCCL-backed container planning rather than revisiting Python-object geometry assembly.