Staged GeoJSON Ingest¶

Context¶

GeoJSON is a first-class hybrid format, but the repo still only had a routed pyogrio adapter for it. That kept public behavior stable, but it did not create the staged geometry-ingest seam needed for GPU-native work. o17.6.23 needs a real tokenizer-plus-assembly pipeline and an honest comparison between that approach, the existing host adapter, and possible pylibcudf use.

Decision¶

Adopt a staged GeoJSON ingest design with three layers:

streaming FeatureCollection tokenizer for container structure
repo-owned geometry assembly directly into owned buffers
optional pylibcudf exploration for per-feature JSON batches after feature boundaries are already isolated

The default implementation for this decision is the streaming tokenizer plus native geometry assembly. read_geojson_owned(..., prefer="auto") now selects gpu-byte-classify when a GPU runtime is available (producing device-resident geometry), falling back to fast-json on CPU-only hosts. Public geopandas.read_file(..., driver="GeoJSON") stays on pyogrio for now; the new staged path is exposed as an owned-ingest API and benchmark surface until it is semantically complete enough to replace the public host route.

Follow-on evaluation also added a structural feature-span tokenizer as a separate strategy. It makes the future CCCL boundary more explicit, but in pure Python it is slower than both full-json native ingest and the older stream path. So it remains an opt-in strategy and a design seam for later GPU work, not the host default.

A second follow-on added an explicit pylibcudf strategy. It keeps the same host feature-span splitter for the outer FeatureCollection, then uses pylibcudf for JSON-path extraction, family-local JSON parsing, and typed coordinate column recovery. That gives the repo a real GPU-assisted tokenizer path, but it is still not the host-default winner because the current owned buffer contract remains host-materialized.

A third follow-on prototyped device-side rowization of the full parsed features array via pylibcudf plus interleave_columns. It is kept only as an explicit experimental strategy. Measured sweeps showed that it is much slower than the current host-split GPU path even for tiny homogeneous point inputs, and it still fails on heterogeneous feature schemas.

A fourth follow-on prototyped wildcard-array extraction from the full FeatureCollection using $.features[*].geometry.type and $.features[*].geometry.coordinates. This is the cleaner splitter-free design seam because it avoids host feature splitting entirely and can assemble owned buffers for homogeneous point, line, and polygon families from typed concatenated columns. It also remains explicit-only because current JSONPath wildcard extraction is still dramatically slower than the host-split GPU path.

That GPU path is now hybrid on purpose:

point, multipoint, linestring, and multilinestring use coordinates-only parsing on device
polygon and multipolygon keep full-geometry parsing on device because coordinates-only parsing collapses ring structure

Consequences¶

GeoJSON now has a real staged owned-ingest seam instead of only a routed host adapter
geometry assembly no longer depends on Shapely objects for the staged path
the remaining bottleneck was isolated to Python-side tokenization, which ADR-0038 resolved with GPU byte-classification (12 NVRTC kernels, 1.8s for 2.16 GB / 7.2M polygons). CPU property extraction (9.2s) is now the remaining bottleneck
the repo now has a real GPU-assisted GeoJSON tokenizer path instead of only a hypothetical seam, which means future work can optimize the device stages instead of starting from scratch
the repo also has an experimental device-rowization prototype, but it is intentionally not on the default pylibcudf path because the current interleave_columns approach is dramatically slower than host-span planning
the repo now also has an experimental wildcard-array GPU path that bypasses host feature splitting entirely for homogeneous families, but it is still not promoted because JSONPath wildcard extraction is the new bottleneck
property materialization is now lazy on the owned batch, which avoids paying host-side property decode for geometry-only ingest paths
auto now prefers gpu-byte-classify when GPU is available for zero-copy device-resident output, falling back to fast-json on CPU-only hosts
the public read_file behavior avoids regressions while the owned path matures

Alternatives Considered¶

keep GeoJSON entirely behind pyogrio
use json.loads of the full FeatureCollection as the permanent design center
treat pylibcudf as the mandatory container parser for standard FeatureCollection GeoJSON

Acceptance Notes¶

Benchmarks on this machine show:

point-heavy GeoJSON at 100K rows:
- pyogrio: about 300K rows/s
- full json.loads plus native assembly: about 651K rows/s
- staged stream-native: about 331K rows/s
- structural tokenizer-native: about 126K rows/s
- pylibcudf GPU tokenizer-native: about 147K rows/s
focused wildcard-array rejection sweep:
- 100 point rows: about 199 rows/s wildcard-array vs about 405 rows/s for the current host-split GPU path
- 1K point rows: about 274 rows/s wildcard-array vs about 33.2K rows/s for the current host-split GPU path
- 5K point rows: about 275 rows/s wildcard-array vs about 94.0K rows/s for the current host-split GPU path
focused rowization rejection sweep:
- 100 point rows: about 92 rows/s rowized vs about 4.00K rows/s for the current host-split GPU path
- 1K point rows: about 120 rows/s rowized vs about 33.1K rows/s for the current host-split GPU path
polygon-heavy GeoJSON at 20K rows:
- pyogrio: about 161K rows/s
- full json.loads plus native assembly: about 288K rows/s
- staged stream-native: about 147K rows/s
- structural tokenizer-native: about 52K rows/s
- pylibcudf GPU tokenizer-native: about 52K rows/s

Additional sweeps at 10K and 500K point rows and at 5K polygon rows kept the same ranking. That is enough to validate the staged direction, keep full-json native as the host winner, and justify focusing the next acceleration step on CCCL-backed tokenization and direct device-to-owned buffer writeout rather than on geometry assembly alone.