File Format IO¶
Intent¶
Define how file-based vector formats should route through the repo while keeping GPU-native formats primary and legacy formats explicit.
Request Signals¶
io file
geojson
shapefile
read_file
to_file
file format
gdal
Open First¶
docs/architecture/io-files.md
src/vibespatial/io/file.py
tests/test_io_file.py
Verify¶
uv run pytest tests/test_io_file.pyuv run python scripts/benchmark_io_file.py --suite smokeuv run python scripts/check_docs.py --check
Risks¶
Legacy GDAL formats masquerading as native hides work that bypasses the GPU stack.
GeoJSON geometry ingest is now GPU-accelerated (ADR-0038); remaining hybrid seam is staged host property decode.
Shapefile adapter losing speed leadership if the raw Arrow-binary fast path regresses.
Decision¶
GeoJSON is a first-class hybrid path. Unfiltered public reads auto-route to the repo-owned GPU byte-classify path whenever a GPU runtime is available because
read + downstream GPU consumeris the planning objective forread_file; filtered reads still use pyogrio.Shapefile is a first-class hybrid path. Eligible public reads prefer the repo-owned native plan: direct SHP GPU decode first, Arrow/WKB fallback second.
Promoted pyogrio-backed vector containers such as GeoPackage, FileGDB, GML, GPX, TopoJSON, GeoJSON-Seq, and FlatGeobuf stay hybrid rather than being treated as canonical GPU-native formats.
Untargeted legacy GDAL vector formats stay behind an explicit fallback adapter.
Public
geopandas.read_fileandGeoDataFrame.to_fileshould dispatch through repo-owned wrappers so the chosen path is observable.On the
pyogriowrite path, terminal export should prefer the shared native tabular Arrow boundary over rebuilding a GeoDataFrame-shaped host export. Public device-backed GeoJSON, Shapefile, GeoPackage, and FlatGeobuf writes may use that sink when request semantics match pyogrio exactly; CPU, append, and legacy metadata cases remain explicit compatibility. Fiona remains a host boundary.
Performance Notes¶
GeoJSON public
read_file(...)now prefers pipeline-optimal routing over a coarse file-size heuristic. The 10k bar is parity-or-better on the publicread_file + first GPU stagepath, not isolated parser throughput.fast-jsonremains the measured standalone GeoJSON parser winner, so its benchmark rails still act as the host baseline.Legacy GDAL formats should not masquerade as native; explicit fallback events expose work that still bypasses the GPU-oriented stack.
Current Behavior¶
geopandas.read_filenow classifies GeoJSON, Shapefile, and legacy GDAL paths through one repo-owned router.read_vector_file_native(...)is now the shared native file-read surface for promoted vector formats. It returns aNativeTabularResultat the read boundary and lets publicread_file(...)materialize aGeoDataFrameonly at the explicit compatibility/export step.The repo-owned file router already makes the promoted read-boundary classification explicit through
plan_vector_file_io(...).selected_path:hybrid: GeoJSON, Shapefile, WKT, CSV, KML, OSM PBF, GeoPackage, FileGDB, FlatGeobuf, GML, GPX, TopoJSON, GeoJSON-Seqfallback: untargeted legacy GDAL formats
Repo-owned
to_file(..., engine="pyogrio")writes throughpyogrio.write_arrow(...)from the shared native tabular boundary whenever a native result is available. Public device-backed GeoJSON, Shapefile, GeoPackage, and FlatGeobuf writes now use the same native Arrow/WKB sink and force device WKB encode so small exports do not fall through the generic host encoder threshold.Public CPU-backed
GeoDataFrame.to_file(..., engine="pyogrio")stays onpyogrio.write_dataframe(...)for compatibility-sensitive cases such as append mode, timezone-preserving datetime fields, unsupported metadata combinations, and other legacy driver semantics.Repo-owned
read_fileGPU branches that can already produce owned geometry plus columnar attributes now lower through the sharedNativeTabularResultboundary before terminalGeoDataFramematerialization. That now includes the pyogrio Arrow + native WKB bridge, direct WKT/CSV/KML readers, both Shapefile GPU paths, the GeoJSON byte-classify path after property extraction, and the OSM PBF hybrid path after protobuf/tag extraction.The OSM PBF public boundary now uses a bounded, lossless tag projection instead of widening every observed tag key into its own eager object column. Common OSM keys stay first-class and the remainder stay in
other_tags, avoiding the previous Florida-scale2843-column host explosion.Public OSM standard layers (
points,lines,multilinestrings,multipolygons,other_relations) now preferpyogriocontainer reads through the shared native boundary. Those supported-layer scans run in parallel for the default publicread_file("*.osm.pbf")path, so the user-facing wall time is no longer dominated by five serial OSM driver passes. Layers with native-supported geometry stay on Arrow + GPU WKB;other_relationsskips the unsupported owned WKB decode and uses an explicit compatibility bridge because real PBF data still carriesGeometryCollection. The repo-owned hybrid OSM parser remains the path forlayer="all"and the full-data native contract.Default public
read_file("*.osm.pbf")now combines those supported public layers by default instead of forcing the full mixed all-data parser into one eager frame. Small node-only fixtures and other empty supported-layer cases explicitly fall back to the full native parser so data is not lost.OSM PBF native reads now keep tag projection lazy until explicit export. The low-level reader still returns host-resident tag dicts today, but the shared native file boundary no longer eagerly rebuilds a giant pandas object table.
Full-data OSM native reads now normalize through an internal partitioned bundle before any public layer projection or
GeoDataFrameassembly. That keeps parser-shaped node/way/relation output out of the public boundary while preserving a reusable full-data seam for future views.Large geometry-column CSV now prefers a
pylibcudf/libcudftable parse before native GPU WKT/WKB decode. PublicGeoSeries.from_wkt(...)also uses the GPU WKT parser for large clean arrays, so the commonpd.read_csvplus WKT constructor idiom no longer stays pure Shapely at Florida scale.FlatGeobuf now defaults to the repo-owned direct FlatBuffer GPU decoder for eligible local unfiltered public
read_file(...)/read_vector_file_native(...)calls and uses a typed dense-property extractor for common numeric plus repeated-string schemas. Explicitengine="pyogrio"and container-shaped requests stay on the shared Arrow + GPU WKB native boundary.GeoJSONSeq now routes eligible local unfiltered public reads through the GPU GeoJSON parser by rewriting newline-delimited feature records into a FeatureCollection byte payload. Filtered or explicit pyogrio-shaped requests stay on the shared Arrow + GPU WKB native boundary.
GeoJSON remains an explicit hybrid compatibility boundary because property extraction is still host-side even though geometry decode is GPU-native. Public
read_file(...)now tries that repo-owned GPU path for all eligible unfiltered GeoJSON reads, including explicitengine="pyogrio"when the request shape stays native-compatible. If the GPU parser fails, the public boundary falls back to repo-ownedfast-jsonbefore reaching vendored pyogrio. The shared native read boundary preserves properties lazily, accepts both filesystem and in-memory RFC 7946 sources, and treatstrack_properties=Falseas an explicit geometry-only contract.Shapefile remains an explicit hybrid compatibility boundary because the container and DBF attribute story are still legacy-oriented even when geometry decode is native. Public automatic reads prefer direct SHP, while explicit
engine="pyogrio"reads stay on the Arrow/WKB bridge for pyogrio-shaped requests. Untargeted legacy GDAL formats stay explicit compatibility.Public GeoPackage reads and the promoted pyogrio-backed vector-container family now keep
maskand safelayerfilters on the shared native Arrow/WKB boundary whenever the request stays native-compatible; invalidbboxplusmaskfails before dispatch accounting. The public boundary asks pyogrio for datetime strings so naive datetime fields and timezone-aware roundtrips survive without forced UTC Arrow timestamps. Unsupported public geometry families such asPoint ZandUnknownstill route through explicit compatibility.Repo-owned GeoJSON ingest now also has an internal staged owned path:
autonow has two explicit objectives:pipelineprefersgpu-byte-classifywhen a GPU runtime is available so the first downstream GPU consumer does not pay an immediate promotionstandaloneprefersfast-json, which is still the measured isolated ingest winner on this machine On CPU-only hosts both objectives fall back tofast-json:orjson(when available, otherwise CPythonjson) plus vectorized per-family coordinate extraction into numpy owned buffers. That path is 2.4-2.6x faster than the previousfull-jsondefault and 3.5-3.9x faster thanpyogrio.
prefer="chunked"splits the features array into byte-range chunks, parses each chunk with orjson, and extracts coordinates with vectorized numpy. Slightly slower than single-pass fast-json but reduces peak memory.prefer="full-json"remains available as the legacy host path usingjson.loadsplus per-element native geometry assemblyprefer="pylibcudf"uses host feature-span discovery pluspylibcudfbulk JSON-path extraction and family-local GPU parsing; now slower thanfast-jsonbecause host-side span discovery dominated GPU savingsprefer="pylibcudf-arrays"exposes a cleaner splitter-free GPU prototype that extracts$.features[*].geometry.typeand$.features[*].geometry.coordinatesdirectly from the fullFeatureCollectionand assembles owned buffers from concatenated typed columnsprefer="pylibcudf-rowized"exposes an experimental device-rowization prototype for homogeneous feature arrays, but it is intentionally not the default GPU routethe GPU path uses coordinates-only parsing for point/line families and full-geometry parsing for polygon families, because coordinates-only parsing loses ring structure for polygons
property dictionaries are materialized lazily on the owned batch, so geometry-only callers do not pay host-side property decode by default
prefer="gpu-byte-classify"uses 12 NVRTC kernels for GPU byte classification, structural scanning, geometry type detection, coordinate extraction, and ASCII-to-fp64 parsing directly on device-resident file bytes. Supports homogeneous and mixed Point, LineString, and Polygon files. Type detection scans for"type":keys at geometry depth, classifies per-feature, then partitions into family-local decode batches (per io-acceleration.md policy). The GPU path now also captures$.propertiesobject spans while structural state is still on device, so the host only decodes the small property-object payloads instead of reparsing full feature JSON. Geometry parse: 1.8s for 2.16 GB / 7.2M polygons (32x vs pyogrio). Total public read including properties: 6.7s on the April 20, 2026 local Florida run. File-to-device transfer uses kvikio when installed (parallel POSIX reads with pinned bounce buffers, no GDS required), falling back tocp.asarrayotherwise. Thread count is tunable viaKVIKIO_NTHREADS.the public
read_fileGeoJSON path now auto-selectsgpu-byte-classifyfor eligible unfiltered reads whenever a CUDA device is available because the planning objective is end-to-end pipeline shape rather than naked file parse speedthe stream tokenizer and structural feature-span tokenizer remain available as explicit strategies
assemble geometry directly into owned buffers without Shapely objects
keep property rows separate from geometry assembly
Repo-owned Shapefile ingest now also has an internal native-first owned path:
read_shapefile_owned(...)now prefers direct SHP binary decode on GPU plus the GPU DBF parser for plain local-file reads with no bbox, column projection, row window, or pyogrio-specific kwargswhen the request needs those pyogrio container features, it falls back to
pyogrio.read_arrow(...)plus the repo-owned native WKB decoderattributes stay columnar through the read boundary instead of materializing a GeoDataFrame during ingest
the direct SHP path and the Arrow-WKB fallback now share the same public owned/native read boundary instead of reader-local frame assembly
Measured Local Baseline¶
On this machine fast-json is still the clear standalone GeoJSON ingest
winner. The public-path KPI is different: GeoJSON is measured on
read_file(...) + first downstream GPU stage, with 10k as the minimum
acceptance scale, because that determines whether we read on CPU and
immediately pay a promotion tax.
point-heavy GeoJSON at
100Krows:pyogrio: about300Krows/sfast-json(orjson + vectorized): about1,141Krows/sfull
json.loadsplus native assembly (old default): about451Krows/spylibcudfGPU tokenizer-native: about542Krows/sstaged stream-native: about
331Krows/sstructural tokenizer-native: about
126Krows/s
point-heavy GeoJSON at
1Mrows:pyogrio: about290Krows/sfast-json(orjson + vectorized): about1,041Krows/sfull
json.loadsplus native assembly: about439Krows/s
polygon-heavy GeoJSON at
20Krows:pyogrio: about161Krows/sfast-json(orjson + vectorized): about745Krows/sfull
json.loadsplus native assembly: about282Krows/spylibcudfGPU tokenizer-native: about307Krows/s Thefast-jsonpath achieves3.5-3.9xspeedup overpyogrioand2.4-2.6xover the oldfull-jsondefault. ThepylibcudfGPU path is now slower thanfast-jsonbecause host-side span discovery andPyArrowstring construction dominated GPU compute savings. The remaining bottleneck isorjson.loads()itself, which future work could address by:
CCCL-backed byte classification and span planning on GPU
simdjsonintegration for ~2-4x faster host parsingdirect-to-device coordinate decode bypassing Python objects entirely
Current Shapefile numbers on this machine now clear the published ingest floors:
point-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.18Mrows/spyogrio.read_arrowcontainer parse: about3.97Mrows/srepo-owned native WKB decode only: about
1.65Mrows/sfull owned batch ingest: about
925Krows/s
point-heavy Shapefile at
100Krows after the raw Arrow-binary point fast path:pyogrio.read_dataframe: about1.25Mrows/spyogrio.read_arrowcontainer parse: about4.41Mrows/srepo-owned native WKB decode only: about
1.56Mrows/sfull owned batch ingest: about
4.10Mrows/s
point-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.12Mrows/spyogrio.read_arrowcontainer parse: about4.48Mrows/srepo-owned native WKB decode only: about
1.43Mrows/sfull owned batch ingest: about
4.08Mrows/s
line-heavy Shapefile at
10Krows:pyogrio.read_dataframe: about1.10Mrows/spyogrio.read_arrowcontainer parse: about3.32Mrows/srepo-owned native WKB decode only: about
611Krows/sfull owned batch ingest: about
3.20Mrows/s
line-heavy Shapefile at
1Mrows:pyogrio.read_dataframe: about1.02Mrows/spyogrio.read_arrowcontainer parse: about3.73Mrows/srepo-owned native WKB decode only: about
617Krows/sfull owned batch ingest: about
3.40Mrows/s
polygon-heavy Shapefile at
5Krows:pyogrio.read_dataframe: about858Krows/spyogrio.read_arrowcontainer parse: about2.37Mrows/srepo-owned native WKB decode only: about
502Krows/sfull owned batch ingest: about
2.29Mrows/s
polygon-heavy Shapefile at
250Krows:pyogrio.read_dataframe: about893Krows/spyogrio.read_arrowcontainer parse: about2.51Mrows/srepo-owned native WKB decode only: about
452Krows/sfull owned batch ingest: about
2.24Mrows/s The main change was shifting non-point families off the generic per-row Arrow WKB bridge and onto uniform raw-buffer fast paths. That turned the owned path from “points only” into a broad Shapefile ingest win:
point-heavy ingest now runs about
3.63xfaster than the current host baseline at1Mrowsline-heavy ingest now runs about
3.32xfaster than the current host baseline at1Mrowspolygon-heavy ingest now runs about
2.51xfaster than the current host baseline at250Krows
The GPU byte-classification path (ADR-0038) now handles geometry extraction in 1.8s for the 2.16 GB Florida.geojson benchmark, and the staged property-object decode keeps the full public Florida read near 6.7s. The next acceleration step would be a native columnar property extractor, deferred because it changes the pure-Python build story.