vibespatial.io.gpu_parse¶
GPU text-parsing primitives for structured format readers.
This package provides composable, format-agnostic building blocks for GPU-accelerated parsing of structured text formats (GeoJSON, WKT, CSV, KML, etc.). Each primitive maps to one or more NVRTC kernels that operate on device-resident byte arrays.
Modules¶
- structural
Quote-state and bracket-depth computation.
- numeric
Number boundary detection and ASCII-to-number conversion.
- pattern
Byte-pattern matching and span detection.
Typical pipeline¶
A GPU text parser composes these primitives in sequence:
d_bytes = read_file_to_device(path)
# Stage 1: structural analysis
d_qp = quote_parity(d_bytes)
d_depth = bracket_depth(d_bytes, d_qp)
# Stage 2: locate structural markers
d_hits = pattern_match(d_bytes, b'"coordinates":', d_qp)
d_positions = cp.flatnonzero(d_hits).astype(cp.int64)
# Stage 3: define value spans
d_span_ends = span_boundaries(d_depth, d_positions, len(d_bytes),
skip_bytes=14)
d_mask = mark_spans(d_positions + 14, d_span_ends, len(d_bytes))
# Stage 4: extract numbers within spans
d_is_start, d_is_end = number_boundaries(d_bytes, d_qp)
d_starts, d_ends = extract_number_positions(d_is_start, d_is_end,
d_mask=d_mask)
d_values = parse_ascii_floats(d_bytes, d_starts, d_ends)
All operations run on the GPU with zero host materialization until
the caller explicitly requests results via .get() or
cp.asnumpy().
Submodules¶
- vibespatial.io.gpu_parse.indexing
- vibespatial.io.gpu_parse.indexing_kernels
- vibespatial.io.gpu_parse.numeric
- vibespatial.io.gpu_parse.numeric_kernels
- vibespatial.io.gpu_parse.pattern
- vibespatial.io.gpu_parse.pattern_kernels
- vibespatial.io.gpu_parse.properties
- vibespatial.io.gpu_parse.properties_kernels
- vibespatial.io.gpu_parse.structural
- vibespatial.io.gpu_parse.structural_kernels
- vibespatial.io.gpu_parse.transform
Functions¶
|
Convert boundary masks to compact position arrays. |
|
Identify start and end positions of numeric tokens. |
|
Parse ASCII numeric tokens to float64 values on GPU. |
|
Parse ASCII integer tokens to int64 values on GPU. |
|
Create a per-byte region mask from start/end position pairs. |
|
Find all occurrences of a byte pattern in the input. |
|
Find span end positions by scanning bracket depth. |
|
Compute per-byte nesting depth via delta kernel + prefix sum. |
|
Compute per-byte quote-parity mask via toggle + cumulative sum. |
Package Contents¶
- vibespatial.io.gpu_parse.extract_number_positions(d_is_start: cupy.ndarray, d_is_end: cupy.ndarray, d_mask: cupy.ndarray | None = None) tuple[cupy.ndarray, cupy.ndarray]¶
Convert boundary masks to compact position arrays.
Takes the per-byte start/end masks from
number_boundariesand produces compact int64 position arrays suitable forparse_ascii_floatsorparse_ascii_ints.Optionally filters by a region mask so that only numbers within specific spans (e.g., coordinate spans in GeoJSON, value columns in CSV) are included.
Parameters¶
- d_is_startcp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Per-byte number-start indicators fromnumber_boundaries.- d_is_endcp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Per-byte number-end indicators fromnumber_boundaries.- d_maskcp.ndarray or None, default None
Optional device-resident uint8 region mask, shape
(n_bytes,). If provided, only number boundaries whered_mask[i] == 1are included. WhenNone, all detected boundaries are returned.
Returns¶
- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Byte offsets of the first character of each detected number (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Byte offsets one past the last character of each detected number (exclusive). This is computed asflatnonzero(d_is_end) + 1so that the range[start, end)spans the full token.
Notes¶
When
d_maskis provided, the function computes element-wise multiplication of both boundary masks with the region mask before extracting positions. This avoids materializing filtered intermediate arrays.The returned arrays are always contiguous int64 arrays suitable for direct kernel parameter passing.
Examples¶
>>> # d_is_start marks positions [3, 7, 15] >>> # d_is_end marks positions [5, 10, 18] >>> # d_mask is 1 only in [0..12] >>> # Result: d_starts=[3, 7], d_ends=[6, 11]
- vibespatial.io.gpu_parse.number_boundaries(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray) tuple[cupy.ndarray, cupy.ndarray]¶
Identify start and end positions of numeric tokens.
A per-byte GPU kernel classifies each byte position as either the start of a numeric token, the end of a numeric token, or neither. Bytes inside quoted strings (
d_quote_parity == 1) are always classified as neither.- Start detection
A byte is a number start if it is a numeric-initial character (
0-9,-,+) AND the preceding byte is a separator (,,[, space, tab, newline, carriage return).- End detection
A byte is a number end if it is a numeric character (
0-9,.,e,E,-,+) AND the following byte is a separator (,,], space, tab, newline, carriage return).
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- d_quote_paritycp.ndarray
Device-resident uint8 parity mask from
structural.quote_parity(), shape(n,).
Returns¶
- d_is_startcp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if byteiis the first byte of a numeric token, else 0.- d_is_endcp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if byteiis the last byte of a numeric token, else 0.
Notes¶
The returned arrays are byte-level masks, not position arrays. Use
extract_number_positionsto convert them to compact int64 position arrays suitable forparse_ascii_floats.The boundary heuristic is designed for JSON/CSV numeric formats. It handles:
Integers:
123,-42Decimals:
3.14,-0.001Scientific notation:
1.5e10,-2.3E-4Leading sign:
+1.0,-1.0
Examples¶
>>> # Input: [1.5, -2.3] >>> # ^ ^ ^ ^ (start, end pairs)
- vibespatial.io.gpu_parse.parse_ascii_floats(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray¶
Parse ASCII numeric tokens to float64 values on GPU.
Each CUDA thread processes one token defined by the half-open byte range
[d_starts[i], d_ends[i]). The kernel implements a character-by-character state machine supporting:Optional leading sign (
+or-)Integer part (digits before decimal point)
Optional fractional part (
.followed by digits)Optional scientific notation exponent (
e/E, optional sign, digits)
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n_bytes,).- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset of the first character of thei-th numeric token (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset one past the last character of thei-th numeric token (exclusive).
Returns¶
- cp.ndarray
Device-resident float64 array, shape
(n_numbers,). Each element is the parsed floating-point value. Invalid tokens produce0.0(not NaN) — callers should validate input boundaries.
Notes¶
The start/end convention is half-open:
[start, end). This matches the output ofextract_number_positions, where ends are already incremented by 1 from thed_is_endmask positions.The kernel does not handle
NaN,Infinity, or hexadecimal float literals. These are not valid in JSON or standard CSV.Examples¶
>>> # d_bytes contains b"[1.5, -2.3e4]" >>> # d_starts = [1, 6], d_ends = [4, 12] >>> # result = [1.5, -23000.0]
- vibespatial.io.gpu_parse.parse_ascii_ints(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray¶
Parse ASCII integer tokens to int64 values on GPU.
Each CUDA thread processes one token defined by the half-open byte range
[d_starts[i], d_ends[i]). The kernel implements a simple character-by-character accumulator supporting:Optional leading sign (
+or-)Decimal digits (
0-9)
Fractional parts and exponent notation are not supported. If a non-digit character (other than a leading sign) is encountered, accumulation stops at that position.
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n_bytes,).- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset of the first character of thei-th integer token (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset one past the last character of thei-th integer token (exclusive).
Returns¶
- cp.ndarray
Device-resident int64 array, shape
(n_numbers,). Each element is the parsed integer value. Tokens that contain no valid digits produce0. Overflow wraps silently (int64 range:-2^63to2^63 - 1).
Notes¶
This function does NOT exist in the current geojson_gpu.py pipeline. It is a new primitive for formats that contain integer fields (e.g., feature IDs in GeoJSON, integer attributes in CSV, SRID values in WKT).
The start/end convention is half-open:
[start, end), consistent withparse_ascii_floats.Examples¶
>>> # d_bytes contains b"SRID=4326;POINT(1 2)" >>> # d_starts = [5], d_ends = [9] >>> # result = [4326]
- vibespatial.io.gpu_parse.mark_spans(d_starts: cupy.ndarray, d_ends: cupy.ndarray, n_bytes: int) cupy.ndarray¶
Create a per-byte region mask from start/end position pairs.
For each
(d_starts[i], d_ends[i])pair, sets all bytes in the half-open range[d_starts[i], d_ends[i])to 1 in the output mask. All other positions are 0.This is used to create coordinate-span masks that filter number detection to only relevant regions of the file.
Parameters¶
- d_startscp.ndarray
Device-resident int64 array of span start positions, shape
(n_spans,). Each element is an inclusive byte offset.- d_endscp.ndarray
Device-resident int64 array of span end positions, shape
(n_spans,). Each element is an exclusive byte offset.- n_bytesint
Total number of bytes in the input. The output mask has this length.
Returns¶
- cp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Elementiis 1 if byteifalls within any span, else 0. Overlapping spans are handled correctly (union semantics).
Notes¶
This is a generalization of the
mark_coord_spanskernel from the GeoJSON parser. That kernel reads start positions fromcoord_positionsand offsets them by 14 bytes (the length of"coordinates":). This function takes pre-computed start/end arrays directly.The kernel launches one thread per span (not per byte). Each thread writes 1 to all bytes in its span via a serial loop. For large numbers of short spans, this is efficient because the write pattern is coalesced within each span. For very large spans (>1M bytes each), a per-byte kernel with binary search over sorted starts would be more efficient, but in practice coordinate spans are small relative to file size.
Examples¶
>>> # d_starts = [10, 50], d_ends = [25, 60], n_bytes = 100 >>> # Result: 0s except positions [10..24] and [50..59] are 1
- vibespatial.io.gpu_parse.pattern_match(d_bytes: cupy.ndarray, pattern: bytes, d_quote_parity: cupy.ndarray | None = None, *, quote_check_offset: int = -1) cupy.ndarray¶
Find all occurrences of a byte pattern in the input.
A per-byte GPU kernel tests whether the substring starting at each position matches the given pattern. Optionally validates that the match is outside a quoted string by checking the quote parity at a specific offset within the pattern.
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- patternbytes
The byte pattern to search for. Must be non-empty. Maximum length 256 bytes. The pattern is compiled into the NVRTC kernel as a constant array for optimal access.
- d_quote_paritycp.ndarray or None, default None
Device-resident uint8 parity mask from
structural.quote_parity(), shape(n,). When provided, matches inside quoted strings are suppressed based on thequote_check_offsetparameter. WhenNone, no quote filtering is applied.- quote_check_offsetint, default -1
Byte offset within the pattern at which to check quote parity. A match is suppressed if
d_quote_parity[pos + quote_check_offset] != 0. A value of-1means: use the last byte of the pattern (len(pattern) - 1).For JSON key patterns like
"coordinates":, the check offset should point to the colon (last byte), because inside a real key the opening and closing quotes cancel to parity 0. Inside a string value, parity would be 1 (odd), suppressing the match.
Returns¶
- cp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if the pattern matches starting at byte offseti, else 0. Positions where the pattern would extend past the end of the input are always 0.
Notes¶
This is a generalization of the
find_coord_keyandfind_type_keykernels from the GeoJSON parser. Those search for specific 14-byte and 7-byte patterns respectively. This function parameterizes the pattern and supports arbitrary lengths.The kernel is generated at runtime via NVRTC with the pattern bytes embedded as a compile-time constant. A kernel cache keyed on the pattern bytes avoids redundant compilations.
For multi-criteria matching (e.g., pattern match AND depth check), combine the output with depth-based filtering after the call:
hits = pattern_match(d_bytes, b'"type":', d_qp) # Further filter by depth hits = hits * (d_depth == 4).view(cp.uint8)
Examples¶
>>> # Input: {"coordinates": [1,2], "coord": 3} >>> # Pattern: b'"coordinates":' >>> # Result: 1 at position 1, 0 elsewhere
- vibespatial.io.gpu_parse.span_boundaries(d_depth: cupy.ndarray, d_starts: cupy.ndarray, n_bytes: int, *, skip_bytes: int = 0) cupy.ndarray¶
Find span end positions by scanning bracket depth.
For each start position, scans forward through the depth array to find the byte position where the nesting depth drops below the depth at the scan start. This identifies the end of a bracket-delimited span (e.g., the closing
]of a JSON"coordinates"array, the closing)of a WKT geometry).Parameters¶
- d_depthcp.ndarray
Device-resident int32 depth array from
structural.bracket_depth(), shape(n_bytes,).- d_startscp.ndarray
Device-resident int64 array of span start positions, shape
(n_spans,). Each position is the byte offset of the structural marker that begins the span (e.g., the first byte of"coordinates":).- n_bytesint
Total number of bytes in the input. Used as the upper bound for forward scanning.
- skip_bytesint, default 0
Number of bytes to skip past each start position before beginning the depth scan. For example, when starting from the position of
"coordinates":,skip_bytes=14skips past the key pattern to reach the opening bracket of the value.The scan then advances through any whitespace to find the first bracket, records its depth, and continues until depth drops below that level.
Returns¶
- cp.ndarray
Device-resident int64 array, shape
(n_spans,). Elementiis the byte offset one past the closing bracket of thei-th span. If the closing bracket is not found beforen_bytes, the value isn_bytes.
Notes¶
This is a generalization of the
coord_span_endkernel from the GeoJSON parser. That kernel hard-codesskip_bytes=14for the"coordinates":pattern length.The scan algorithm:
Start at
d_starts[i] + skip_bytesSkip forward while depth does not change (whitespace between key and opening bracket)
Record
start_depth = d_depth[pos]at the opening bracketScan forward while
d_depth[pos] >= start_depthReturn
pos(one past the closing bracket)
Examples¶
>>> # Input: "coordinates": [[1,2],[3,4]] >>> # ^pos=0 ^depth=5 ^end >>> # d_starts = [0], skip_bytes = 14 >>> # Result: [end_position]
- vibespatial.io.gpu_parse.bracket_depth(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray, *, open_chars: str = '{[', close_chars: str = '}]') cupy.ndarray¶
Compute per-byte nesting depth via delta kernel + prefix sum.
Produces an int32 array where each position holds the cumulative bracket depth at that byte offset. The algorithm is:
A per-byte kernel emits
+1for open-bracket characters,-1for close-bracket characters, and0for all other bytes. Brackets inside quoted strings (whered_quote_parityis 1) are treated as0.An
int32cumulative sum over the delta array yields the running depth at each byte position.
The open/close character sets are parameterizable so that the same primitive works across formats:
JSON:
open_chars="{[", close_chars="}]"WKT:
open_chars="(", close_chars=")"XML:
open_chars="<", close_chars=">"
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- d_quote_paritycp.ndarray
Device-resident uint8 parity mask from
quote_parity(), shape(n,). Positions with parity 1 (inside string) are excluded from depth computation.- open_charsstr, default
"{[" Characters that increment depth. Each character is treated independently. Maximum 8 characters.
- close_charsstr, default
"}]" Characters that decrement depth. Must have the same length as
open_chars. Each character is treated independently.
Returns¶
- cp.ndarray
Device-resident int32 array of shape
(n,). Elementiholds the cumulative nesting depth at byte offseti. The depth is inclusive: at an opening bracket, the depth already includes the+1delta from that bracket. At a closing bracket, the depth includes the-1delta.
Raises¶
- ValueError
If
len(open_chars) != len(close_chars)or either exceeds 8 characters.
Notes¶
The intermediate delta array uses
int8dtype (1 byte per position) to minimize memory before the cumsum materializes theint32depth array.For JSON documents, the depth structure follows:
Depth 0: outside the root object
Depth 1: inside
FeatureCollection { }Depth 2: inside
"features": [ ]Depth 3: inside each
Feature { }Depth 4+: nested geometry objects and coordinate arrays
Examples¶
>>> # Input: {"a": [1, 2]} >>> # Depth: 1 1 1 1 1 2 2 2 2 1 0
- vibespatial.io.gpu_parse.quote_parity(d_bytes: cupy.ndarray) cupy.ndarray¶
Compute per-byte quote-parity mask via toggle + cumulative sum.
Marks each byte position as inside (1) or outside (0) a quoted string literal. The algorithm is:
A per-byte kernel emits 1 at unescaped quote characters (
"), 0 elsewhere. Escaped quotes (preceded by an odd number of backslashes) emit 0.A
uint8cumulative sum over the toggle array yields a monotonically increasing counter. A bitwise AND with 1 extracts the low bit, producing the parity: 0 = outside string, 1 = inside string.
Using
uint8cumsum instead ofint32saves 4x memory (2.16 GB vs 8.64 GB for a 2 GB input file). Parity remains correct after uint8 overflow because 256 is even.Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).
Returns¶
- cp.ndarray
Device-resident uint8 array of shape
(n,). Each element is 0 (outside quoted string) or 1 (inside quoted string).
Notes¶
The quote character is always ASCII
"(0x22). This primitive does not support single-quoted strings. Backslash-escaped quotes (\") are handled correctly by counting consecutive preceding backslashes: a quote preceded by an odd number of backslashes is escaped and does not toggle parity.This is the first stage in any GPU text-parsing pipeline. The resulting parity mask is consumed by
bracket_depth,number_boundaries, andpattern_matchto filter out bytes that appear inside string literals.Examples¶
>>> # Input: {"key": "val"} >>> # Bytes: { " k e y " : " v a l " } >>> # Parity: 0 0 1 1 1 0 0 0 0 1 1 1 0 0