ADR-0038: GPU Byte-Classification GeoJSON Parser¶
Context¶
Reading large GeoJSON files is the dominant bottleneck in vibeSpatial’s end-to-end workflow. The Florida.geojson benchmark (2.16 GB, 7.2M polygons) takes 57.7s via pyogrio, which uses CPU-bound JSON parsing internally. For GPU-first spatial analytics, this I/O cost dwarfs all subsequent GPU operations (reproject 0.4s, spatial query 0.5s).
A POC (examples/poc_gpu_geojson.py) demonstrated that GPU byte
classification + ASCII-to-fp64 parsing can extract coordinates in ~1.7s
(34x faster than pyogrio). This ADR covers wiring that approach into
vibeSpatial’s I/O architecture.
Design constraints¶
Geometry on GPU, properties on CPU — vibeSpatial’s GPU memory policy reserves device memory for geometry operations. Property data (strings, mixed types) stays on the host.
Point, LineString, and Polygon — supports homogeneous and mixed files of these three types. Multi-geometry types (MultiPoint, MultiLineString, MultiPolygon) are deferred.
No chunking — files must fit in GPU memory (~3x file size peak). Chunked processing for files exceeding GPU memory is deferred to v2.
Decision¶
Hybrid GPU/CPU pipeline¶
The parser uses 12 NVRTC kernels (all Tier 1 per ADR-0033) for geometry extraction, with CPU-side orjson for property extraction:
GPU pipeline (1.8s for 2.16 GB polygons):
S0 kvikio / cp.asarray [file → device]
S1b quote_toggle → uint8 cumsum → parity [string awareness]
S2 compute_depth_deltas → int32 cumsum → depth [structural depth]
S3 find_coord_key → flatnonzero → positions [pattern match]
S3.5 find_type_key + classify_type_value → tags [geometry type detection]
S3b coord_span_end [per-feature depth scan]
S3c count_rings_and_coords + scatter_ring_offsets [GeoArrow offsets]
S4 find_number_boundaries + mark_coord_spans [coord-only numbers]
S5 parse_ascii_floats → d_coords [ASCII → fp64]
S6 x = d_coords[0::2], y = d_coords[1::2] [zero-copy views]
S7 family-aware assembly (homogeneous or mixed) [OwnedGeometryArray]
S8 find_feature_boundaries → D→H copy [for CPU properties]
CPU property extraction (9.2s, lazy):
For each feature: slice host_bytes → orjson.loads → extract "properties"
Memory optimization¶
The initial implementation hit OOM on a 24 GB GPU because cp.cumsum
of 2.16 billion int32 values requires 8.64 GB. Two optimizations
resolved this:
uint8 parity — Quote state only needs even/odd (0/1), not the full cumsum value.
cp.cumsum(toggle, dtype=cp.uint8) & 1uses 2.16 GB instead of 8.64 GB. Parity is correct after uint8 overflow because 256 is even.Fused depth delta kernel — A single
compute_depth_deltaskernel takes raw bytes + quote parity and outputs int8 deltas directly, avoiding materialization ofd_classes(2.16 GB),outside_string(2.16 GB), and boolean intermediates fromcp.where.
Peak GPU memory is ~3x file size (~6.5 GB for 2.16 GB file).
Integration points¶
io_geojson.py— New"gpu-byte-classify"strategy inplan_geojson_ingest(), routed inread_geojson_owned().io_file.py—_try_gpu_read_file()auto-routes GeoJSON files to the GPU path for eligible unfiltered public reads whenever a CUDA runtime is available. Filtered/container-shaped requests stay on the explicit compatibility boundary.NVRTC warmup — All parser kernels registered via
request_nvrtc_warmup()per ADR-0034. First-run compilation adds ~12s; subsequent runs are cached.
Depth semantics¶
CuPy’s cumsum produces an inclusive prefix sum, so depth values at
brackets include the bracket’s own delta:
At opening
[: depth = parent_depth + 1At closing
]: depth = parent_depth (was +1, now -1 applied)
This means ring-closing ] has depth == coord_depth (not
coord_depth + 1) and pair-closing ] has depth == coord_depth + 1.
Feature boundaries are at depth 3 (open) / depth 2 (close), not 2/1,
because FeatureCollection adds an extra nesting level.
Consequences¶
Measured performance (RTX 4090, i9-13900k)¶
Step |
GeoPandas |
vibeSpatial |
Speedup |
|---|---|---|---|
Read GeoJSON |
57.7s |
11.7s |
4.9x |
Reproject to UTM |
8.2s |
0.4s |
21x |
Select within 1km |
0.2s |
0.5s |
— |
Write GeoParquet |
0.2s |
0.1s |
2x |
End-to-end |
66.3s |
12.7s |
5.2x |
GPU geometry parse: 1.8s (32x vs pyogrio). CPU property extraction: 9.2s (lazy, only when accessed).
April 20, 2026 update: the public read_file(...) route now selects the GPU
byte-classify path for eligible unfiltered GeoJSON whenever CUDA is available,
and staged property-object decode brings the local Florida public read to
6.7s. The original table above is retained as the first accepted ADR result
for this parser family.
Property extraction is the remaining bottleneck¶
The 9.2s CPU property loop is 7.2M calls to orjson.loads() on ~35-byte
property objects. The time is dominated by Python interpreter overhead
(function call dispatch, dict construction, list.append), not JSON
parsing throughput.
Strategies evaluated and rejected:
Strategy |
Time |
Why it doesn’t help |
|---|---|---|
ThreadPool orjson (8w) |
10.0s |
GIL held during Python object construction |
Multiprocessing (8w) |
12.8s |
2.16 GB raw bytes copied to each worker via IPC |
Bulk orjson (single parse) |
17.9s |
1 GB FeatureCollection → 7.2M Python dicts = catastrophic |
pylibcudf batched |
8.6s |
CPU loop to extract+re-serialize property substrings dominates |
Strip coords + bulk parse |
>14 min |
Coordinates are only 54% of file; 1 GB remainder still too large |
Key finding: For this dataset, coordinate bytes are 54% of the file
(not ~90% as initially estimated). Structural JSON overhead ("type":,
"Feature", "geometry":, braces, commas) accounts for 34%. Actual
property data is only ~12% of file bytes.
Deferred: native property parser¶
A Rust/C extension that walks feature bytes and extracts property values directly into columnar arrays (bypassing Python dict construction) could plausibly achieve ~1s for property extraction. This is deferred because:
Pure Python policy — vibeSpatial’s current codebase is pure Python + CUDA kernels (via NVRTC strings). Adding a compiled Rust/C extension changes the build story and CI matrix.
Diminishing returns — The 9.2s property cost only matters when properties are accessed. Geometry-only workflows (the primary GPU target) see only 1.8s read time.
Lazy evaluation — Properties are loaded via a closure; the cost is deferred until
batch.propertiesorGeoDataFrameconstruction. Workflows that filter first (spatial query → subset → access properties) parse far fewer features.
If property extraction becomes a measured bottleneck in production
workflows, the recommended path is a purpose-built columnar JSON
property extractor — either as a Rust pyo3 extension or as additional
NVRTC kernels that output property strings into a device buffer for
pylibcudf.io.json.read_json_from_string_column.
Alternatives Considered¶
pylibcudf get_json_object on full file¶
Using plc.json.get_json_object(file_column, "$.features[*].properties")
to extract all properties on GPU. Rejected because it requires the
entire 2.16 GB file as a single GPU string column, and the JSONPath
evaluation on 7.2M features causes OOM even on a 24 GB card.
cuDF read_json on the full file¶
cuDF’s JSON reader could parse the entire FeatureCollection on GPU, producing a columnar table with both geometry and property columns. Rejected because: (a) it would parse coordinates redundantly (we already extract them faster with our kernels), (b) GeoJSON’s nested coordinate arrays don’t map cleanly to cuDF’s flat column model, and (c) cuDF is a much heavier dependency than pylibcudf.
Host-only fast path (simdjson / orjson bulk)¶
Keep everything on CPU but use SIMD-accelerated JSON parsing. Already benchmarked: simdjson and orjson bulk parse are slower than the per-feature loop for this workload shape (many small features).
References¶
POC:
examples/poc_gpu_geojson.pyProperty extraction POC:
examples/poc_property_extraction.pyGeometry stripping POC:
examples/poc_strip_geometry.pyImplementation:
src/vibespatial/io/geojson_gpu.pyTests:
tests/test_geojson_gpu.pyADR-0002: Precision policy (fp64 for I/O)
ADR-0033: Kernel tier classification
ADR-0034: NVRTC warmup