File Format IO¶
Intent¶
Define how file-based vector formats should route through the repo while keeping GPU-native formats primary and legacy formats explicit.
Request Signals¶
io file
geojson
shapefile
read_file
to_file
file format
gdal
Open First¶
docs/architecture/io-files.md
src/vibespatial/io/file.py
tests/test_io_file.py
Verify¶
uv run pytest tests/test_io_file.pyuv run python scripts/benchmark_io_file.py --suite smokeuv run python scripts/check_docs.py --check
Risks¶
Legacy GDAL formats masquerading as native hides work that bypasses the GPU stack.
GeoJSON geometry ingest is now GPU-accelerated (ADR-0038); remaining bottleneck is CPU property extraction.
Shapefile adapter losing speed leadership if the raw Arrow-binary fast path regresses.
Decision¶
GeoJSON is a first-class hybrid path. Files >10 MB auto-route to GPU byte-classification (ADR-0038); smaller files and filtered reads use pyogrio.
Shapefile is a first-class hybrid path with pyogrio-first routing.
Other GDAL vector formats stay behind an explicit legacy fallback adapter.
Public
geopandas.read_fileandGeoDataFrame.to_fileshould dispatch through repo-owned wrappers so the chosen path is observable.
Performance Notes¶
GeoJSON files >10 MB now auto-route to GPU byte-classification (ADR-0038) for geometry, with pyogrio as fallback. Shapefile stays pyogrio-first.
The pyogrio bias is retained for Shapefile and small GeoJSON because that path keeps us closer to Arrow- and columnar-oriented follow-on work.
Legacy GDAL formats should not masquerade as native; the extra explicit fallback event is part of the performance contract because it exposes work that still bypasses the GPU-oriented stack.
Current Behavior¶
geopandas.read_filenow classifies GeoJSON, Shapefile, and legacy GDAL paths through one repo-owned router.GeoDataFrame.to_fileuses the same routing policy.GeoJSON and Shapefile record dispatch events without fallback events.
Legacy formats such as GPKG emit explicit fallback events.
Repo-owned GeoJSON ingest now also has an internal staged owned path:
autoprefersgpu-byte-classifywhen a GPU runtime is available, producing device-resident geometry via NVRTC kernels. On CPU-only hosts,autofalls back tofast-json:orjsonfor parsing (when available, otherwise CPythonjson) plus vectorized per-family coordinate extraction directly into numpy owned buffers. Thefast-jsonpath is 2.4-2.6x faster than the previousfull-jsondefault, and 3.5-3.9x faster thanpyogrio.prefer="chunked"splits the features array into byte-range chunks, parses each chunk with orjson, and extracts coordinates with vectorized numpy. Slightly slower than single-pass fast-json but reduces peak memory.prefer="full-json"remains available as the legacy host path usingjson.loadsplus per-element native geometry assemblyprefer="pylibcudf"uses host feature-span discovery pluspylibcudfbulk JSON-path extraction and family-local GPU parsing; now slower thanfast-jsonbecause host-side span discovery dominated GPU savingsprefer="pylibcudf-arrays"exposes a cleaner splitter-free GPU prototype that extracts$.features[*].geometry.typeand$.features[*].geometry.coordinatesdirectly from the fullFeatureCollectionand assembles owned buffers from concatenated typed columnsprefer="pylibcudf-rowized"exposes an experimental device-rowization prototype for homogeneous feature arrays, but it is intentionally not the default GPU routethe GPU path uses coordinates-only parsing for point/line families and full-geometry parsing for polygon families, because coordinates-only parsing loses ring structure for polygons
property dictionaries are materialized lazily on the owned batch, so geometry-only callers do not pay host-side property decode by default
prefer="gpu-byte-classify"uses 12 NVRTC kernels for GPU byte classification, structural scanning, geometry type detection, coordinate extraction, and ASCII-to-fp64 parsing directly on device-resident file bytes. Supports homogeneous and mixed Point, LineString, and Polygon files. Type detection scans for"type":keys at geometry depth, classifies per-feature, then partitions into family-local decode batches (per io-acceleration.md policy). Property extraction stays on CPU via orjson (hybrid design per ADR-0038). Geometry parse: 1.8s for 2.16 GB / 7.2M polygons (32x vs pyogrio). Total read including properties: 11.7s (4.9x vs pyogrio). File-to-device transfer uses kvikio when installed (parallel POSIX reads with pinned bounce buffers, no GDS required), falling back tocp.asarrayotherwise. Thread count is tunable viaKVIKIO_NTHREADS.the
read_fileGPU path auto-selectsgpu-byte-classifyfor GeoJSON files >10 MB when a CUDA device is available, before falling back to the pyogrio GPU WKB paththe stream tokenizer and structural feature-span tokenizer remain available as explicit strategies
assemble geometry directly into owned buffers without Shapely objects
keep property rows separate from geometry assembly
Repo-owned Shapefile ingest now also has an internal batch-first owned path:
read_shapefile_owned(...)usespyogrio.read_arrow(...)for host container parsing of.shp/.shx/.dbfgeometry lands through Arrow
geoarrow.wkbbatches into owned buffers via the repo-owned native WKB decoderhomogeneous point Shapefiles now use a raw Arrow-binary point fast path before any
to_pylist()materializationattributes stay in a columnar Arrow table instead of materializing a GeoDataFrame during ingest
the public
read_file(..., driver=\"ESRI Shapefile\")route stays onpyogriountil the owned batch path is measurably faster end-to-end
Measured Local Baseline¶
On this machine the fast-json strategy is the clear GeoJSON ingest winner.
It uses orjson for parsing and vectorized per-family coordinate extraction
directly into numpy arrays, eliminating the old per-feature assembly loop.
point-heavy GeoJSON at
100Krows:pyogrio: about300Krows/sfast-json(orjson + vectorized): about1,141Krows/sfull
json.loadsplus native assembly (old default): about451Krows/spylibcudfGPU tokenizer-native: about542Krows/sstaged stream-native: about
331Krows/sstructural tokenizer-native: about
126Krows/s
point-heavy GeoJSON at
1Mrows:pyogrio: about290Krows/sfast-json(orjson + vectorized): about1,041Krows/sfull
json.loadsplus native assembly: about439Krows/s
polygon-heavy GeoJSON at
20Krows:pyogrio: about161Krows/sfast-json(orjson + vectorized): about745Krows/sfull
json.loadsplus native assembly: about282Krows/spylibcudfGPU tokenizer-native: about307Krows/s
The fast-json path achieves 3.5-3.9x speedup over pyogrio and
2.4-2.6x over the old full-json default. The pylibcudf GPU path is now
slower than fast-json because host-side span discovery and PyArrow string
column construction dominated GPU compute savings. The remaining bottleneck is
orjson.loads() itself, which for future work could be addressed by:
CCCL-backed byte classification and span planning on GPU
simdjsonintegration for ~2-4x faster host parsingdirect-to-device coordinate decode bypassing Python objects entirely
Current Shapefile numbers on this machine now clear the published ingest floors:
point-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.18Mrows/spyogrio.read_arrowcontainer parse: about3.97Mrows/srepo-owned native WKB decode only: about
1.65Mrows/sfull owned batch ingest: about
925Krows/s
point-heavy Shapefile at
100Krows after the raw Arrow-binary point fast path:pyogrio.read_dataframe: about1.25Mrows/spyogrio.read_arrowcontainer parse: about4.41Mrows/srepo-owned native WKB decode only: about
1.56Mrows/sfull owned batch ingest: about
4.10Mrows/s
point-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.12Mrows/spyogrio.read_arrowcontainer parse: about4.48Mrows/srepo-owned native WKB decode only: about
1.43Mrows/sfull owned batch ingest: about
4.08Mrows/s
line-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.10Mrows/spyogrio.read_arrowcontainer parse: about3.32Mrows/srepo-owned native WKB decode only: about
611Krows/sfull owned batch ingest: about
3.20Mrows/s
line-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.02Mrows/spyogrio.read_arrowcontainer parse: about3.73Mrows/srepo-owned native WKB decode only: about
617Krows/sfull owned batch ingest: about
3.40Mrows/s
polygon-heavy Shapefile at
5Krows:pyogrio.read_dataframe: about858Krows/spyogrio.read_arrowcontainer parse: about2.37Mrows/srepo-owned native WKB decode only: about
502Krows/sfull owned batch ingest: about
2.29Mrows/s
polygon-heavy Shapefile at
250Krows:pyogrio.read_dataframe: about893Krows/spyogrio.read_arrowcontainer parse: about2.51Mrows/srepo-owned native WKB decode only: about
452Krows/sfull owned batch ingest: about
2.24Mrows/s
The main change was shifting non-point families off the generic per-row Arrow WKB bridge and onto uniform raw-buffer fast paths. That turned the owned path from “points only” into a broad Shapefile ingest win:
point-heavy ingest now runs about
3.63xfaster than the current host baseline at1Mrowsline-heavy ingest now runs about
3.32xfaster than the current host baseline at1Mrowspolygon-heavy ingest now runs about
2.51xfaster than the current host baseline at250Krows
The GPU byte-classification path (ADR-0038) now handles geometry extraction in
1.8s for the 2.16 GB Florida.geojson benchmark (32x vs pyogrio). The
remaining bottleneck is CPU property extraction at 9.2s — 7.2M
orjson.loads() calls dominated by Python interpreter overhead (function call
dispatch, dict construction), not JSON parsing throughput. POC evaluation of
parallel orjson, multiprocessing, pylibcudf, and coordinate stripping showed
no meaningful improvement (see ADR-0038 Consequences). The next acceleration
step would be a native (Rust/C) columnar property extractor, which is deferred
because it changes the pure-Python build story and the cost is already lazy.