vibespatial.io.gpu_parse

GPU text-parsing primitives for structured format readers.

This package provides composable, format-agnostic building blocks for GPU-accelerated parsing of structured text formats (GeoJSON, WKT, CSV, KML, etc.). Each primitive maps to one or more NVRTC kernels that operate on device-resident byte arrays.

Modules

structural

Quote-state and bracket-depth computation.

numeric

Number boundary detection and ASCII-to-number conversion.

pattern

Byte-pattern matching and span detection.

Typical pipeline

A GPU text parser composes these primitives in sequence:

d_bytes = read_file_to_device(path)

# Stage 1: structural analysis
d_qp = quote_parity(d_bytes)
d_depth = bracket_depth(d_bytes, d_qp)

# Stage 2: locate structural markers
d_hits = pattern_match(d_bytes, b'"coordinates":', d_qp)
d_positions = cp.flatnonzero(d_hits).astype(cp.int64)

# Stage 3: define value spans
d_span_ends = span_boundaries(d_depth, d_positions, len(d_bytes),
                              skip_bytes=14)
d_mask = mark_spans(d_positions + 14, d_span_ends, len(d_bytes))

# Stage 4: extract numbers within spans
d_is_start, d_is_end = number_boundaries(d_bytes, d_qp)
d_starts, d_ends = extract_number_positions(d_is_start, d_is_end,
                                             d_mask=d_mask)
d_values = parse_ascii_floats(d_bytes, d_starts, d_ends)

All operations run on the GPU with zero host materialization until the caller explicitly requests results via .get() or cp.asnumpy().

Submodules

Functions

extract_number_positions(→ tuple[cupy.ndarray, ...)

Convert boundary masks to compact position arrays.

number_boundaries(→ tuple[cupy.ndarray, cupy.ndarray])

Identify start and end positions of numeric tokens.

parse_ascii_floats(→ cupy.ndarray)

Parse ASCII numeric tokens to float64 values on GPU.

parse_ascii_ints(→ cupy.ndarray)

Parse ASCII integer tokens to int64 values on GPU.

mark_spans(→ cupy.ndarray)

Create a per-byte region mask from start/end position pairs.

pattern_match(→ cupy.ndarray)

Find all occurrences of a byte pattern in the input.

span_boundaries(→ cupy.ndarray)

Find span end positions by scanning bracket depth.

bracket_depth(→ cupy.ndarray)

Compute per-byte nesting depth via delta kernel + prefix sum.

quote_parity(→ cupy.ndarray)

Compute per-byte quote-parity mask via toggle + cumulative sum.

Package Contents

vibespatial.io.gpu_parse.extract_number_positions(d_is_start: cupy.ndarray, d_is_end: cupy.ndarray, d_mask: cupy.ndarray | None = None) tuple[cupy.ndarray, cupy.ndarray]

Convert boundary masks to compact position arrays.

Takes the per-byte start/end masks from number_boundaries and produces compact int64 position arrays suitable for parse_ascii_floats or parse_ascii_ints.

Optionally filters by a region mask so that only numbers within specific spans (e.g., coordinate spans in GeoJSON, value columns in CSV) are included.

Parameters

d_is_startcp.ndarray

Device-resident uint8 array, shape (n_bytes,). Per-byte number-start indicators from number_boundaries.

d_is_endcp.ndarray

Device-resident uint8 array, shape (n_bytes,). Per-byte number-end indicators from number_boundaries.

d_maskcp.ndarray or None, default None

Optional device-resident uint8 region mask, shape (n_bytes,). If provided, only number boundaries where d_mask[i] == 1 are included. When None, all detected boundaries are returned.

Returns

d_startscp.ndarray

Device-resident int64 array, shape (n_numbers,). Byte offsets of the first character of each detected number (inclusive).

d_endscp.ndarray

Device-resident int64 array, shape (n_numbers,). Byte offsets one past the last character of each detected number (exclusive). This is computed as flatnonzero(d_is_end) + 1 so that the range [start, end) spans the full token.

Notes

When d_mask is provided, the function computes element-wise multiplication of both boundary masks with the region mask before extracting positions. This avoids materializing filtered intermediate arrays.

The returned arrays are always contiguous int64 arrays suitable for direct kernel parameter passing.

Examples

>>> # d_is_start marks positions [3, 7, 15]
>>> # d_is_end marks positions [5, 10, 18]
>>> # d_mask is 1 only in [0..12]
>>> # Result: d_starts=[3, 7], d_ends=[6, 11]
vibespatial.io.gpu_parse.number_boundaries(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray) tuple[cupy.ndarray, cupy.ndarray]

Identify start and end positions of numeric tokens.

A per-byte GPU kernel classifies each byte position as either the start of a numeric token, the end of a numeric token, or neither. Bytes inside quoted strings (d_quote_parity == 1) are always classified as neither.

Start detection

A byte is a number start if it is a numeric-initial character (0-9, -, +) AND the preceding byte is a separator (,, [, space, tab, newline, carriage return).

End detection

A byte is a number end if it is a numeric character (0-9, ., e, E, -, +) AND the following byte is a separator (,, ], space, tab, newline, carriage return).

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

d_quote_paritycp.ndarray

Device-resident uint8 parity mask from structural.quote_parity(), shape (n,).

Returns

d_is_startcp.ndarray

Device-resident uint8 array, shape (n,). Element i is 1 if byte i is the first byte of a numeric token, else 0.

d_is_endcp.ndarray

Device-resident uint8 array, shape (n,). Element i is 1 if byte i is the last byte of a numeric token, else 0.

Notes

The returned arrays are byte-level masks, not position arrays. Use extract_number_positions to convert them to compact int64 position arrays suitable for parse_ascii_floats.

The boundary heuristic is designed for JSON/CSV numeric formats. It handles:

  • Integers: 123, -42

  • Decimals: 3.14, -0.001

  • Scientific notation: 1.5e10, -2.3E-4

  • Leading sign: +1.0, -1.0

Examples

>>> # Input: [1.5, -2.3]
>>> #         ^ ^   ^ ^   (start, end pairs)
vibespatial.io.gpu_parse.parse_ascii_floats(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray

Parse ASCII numeric tokens to float64 values on GPU.

Each CUDA thread processes one token defined by the half-open byte range [d_starts[i], d_ends[i]). The kernel implements a character-by-character state machine supporting:

  • Optional leading sign (+ or -)

  • Integer part (digits before decimal point)

  • Optional fractional part (. followed by digits)

  • Optional scientific notation exponent (e/E, optional sign, digits)

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n_bytes,).

d_startscp.ndarray

Device-resident int64 array, shape (n_numbers,). Element i is the byte offset of the first character of the i-th numeric token (inclusive).

d_endscp.ndarray

Device-resident int64 array, shape (n_numbers,). Element i is the byte offset one past the last character of the i-th numeric token (exclusive).

Returns

cp.ndarray

Device-resident float64 array, shape (n_numbers,). Each element is the parsed floating-point value. Invalid tokens produce 0.0 (not NaN) — callers should validate input boundaries.

Notes

The start/end convention is half-open: [start, end). This matches the output of extract_number_positions, where ends are already incremented by 1 from the d_is_end mask positions.

The kernel does not handle NaN, Infinity, or hexadecimal float literals. These are not valid in JSON or standard CSV.

Examples

>>> # d_bytes contains b"[1.5, -2.3e4]"
>>> # d_starts = [1, 6],  d_ends = [4, 12]
>>> # result = [1.5, -23000.0]
vibespatial.io.gpu_parse.parse_ascii_ints(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray

Parse ASCII integer tokens to int64 values on GPU.

Each CUDA thread processes one token defined by the half-open byte range [d_starts[i], d_ends[i]). The kernel implements a simple character-by-character accumulator supporting:

  • Optional leading sign (+ or -)

  • Decimal digits (0-9)

Fractional parts and exponent notation are not supported. If a non-digit character (other than a leading sign) is encountered, accumulation stops at that position.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n_bytes,).

d_startscp.ndarray

Device-resident int64 array, shape (n_numbers,). Element i is the byte offset of the first character of the i-th integer token (inclusive).

d_endscp.ndarray

Device-resident int64 array, shape (n_numbers,). Element i is the byte offset one past the last character of the i-th integer token (exclusive).

Returns

cp.ndarray

Device-resident int64 array, shape (n_numbers,). Each element is the parsed integer value. Tokens that contain no valid digits produce 0. Overflow wraps silently (int64 range: -2^63 to 2^63 - 1).

Notes

This function does NOT exist in the current geojson_gpu.py pipeline. It is a new primitive for formats that contain integer fields (e.g., feature IDs in GeoJSON, integer attributes in CSV, SRID values in WKT).

The start/end convention is half-open: [start, end), consistent with parse_ascii_floats.

Examples

>>> # d_bytes contains b"SRID=4326;POINT(1 2)"
>>> # d_starts = [5],  d_ends = [9]
>>> # result = [4326]
vibespatial.io.gpu_parse.mark_spans(d_starts: cupy.ndarray, d_ends: cupy.ndarray, n_bytes: int) cupy.ndarray

Create a per-byte region mask from start/end position pairs.

For each (d_starts[i], d_ends[i]) pair, sets all bytes in the half-open range [d_starts[i], d_ends[i]) to 1 in the output mask. All other positions are 0.

This is used to create coordinate-span masks that filter number detection to only relevant regions of the file.

Parameters

d_startscp.ndarray

Device-resident int64 array of span start positions, shape (n_spans,). Each element is an inclusive byte offset.

d_endscp.ndarray

Device-resident int64 array of span end positions, shape (n_spans,). Each element is an exclusive byte offset.

n_bytesint

Total number of bytes in the input. The output mask has this length.

Returns

cp.ndarray

Device-resident uint8 array, shape (n_bytes,). Element i is 1 if byte i falls within any span, else 0. Overlapping spans are handled correctly (union semantics).

Notes

This is a generalization of the mark_coord_spans kernel from the GeoJSON parser. That kernel reads start positions from coord_positions and offsets them by 14 bytes (the length of "coordinates":). This function takes pre-computed start/end arrays directly.

The kernel launches one thread per span (not per byte). Each thread writes 1 to all bytes in its span via a serial loop. For large numbers of short spans, this is efficient because the write pattern is coalesced within each span. For very large spans (>1M bytes each), a per-byte kernel with binary search over sorted starts would be more efficient, but in practice coordinate spans are small relative to file size.

Examples

>>> # d_starts = [10, 50], d_ends = [25, 60], n_bytes = 100
>>> # Result: 0s except positions [10..24] and [50..59] are 1
vibespatial.io.gpu_parse.pattern_match(d_bytes: cupy.ndarray, pattern: bytes, d_quote_parity: cupy.ndarray | None = None, *, quote_check_offset: int = -1) cupy.ndarray

Find all occurrences of a byte pattern in the input.

A per-byte GPU kernel tests whether the substring starting at each position matches the given pattern. Optionally validates that the match is outside a quoted string by checking the quote parity at a specific offset within the pattern.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

patternbytes

The byte pattern to search for. Must be non-empty. Maximum length 256 bytes. The pattern is compiled into the NVRTC kernel as a constant array for optimal access.

d_quote_paritycp.ndarray or None, default None

Device-resident uint8 parity mask from structural.quote_parity(), shape (n,). When provided, matches inside quoted strings are suppressed based on the quote_check_offset parameter. When None, no quote filtering is applied.

quote_check_offsetint, default -1

Byte offset within the pattern at which to check quote parity. A match is suppressed if d_quote_parity[pos + quote_check_offset] != 0. A value of -1 means: use the last byte of the pattern (len(pattern) - 1).

For JSON key patterns like "coordinates":, the check offset should point to the colon (last byte), because inside a real key the opening and closing quotes cancel to parity 0. Inside a string value, parity would be 1 (odd), suppressing the match.

Returns

cp.ndarray

Device-resident uint8 array, shape (n,). Element i is 1 if the pattern matches starting at byte offset i, else 0. Positions where the pattern would extend past the end of the input are always 0.

Notes

This is a generalization of the find_coord_key and find_type_key kernels from the GeoJSON parser. Those search for specific 14-byte and 7-byte patterns respectively. This function parameterizes the pattern and supports arbitrary lengths.

The kernel is generated at runtime via NVRTC with the pattern bytes embedded as a compile-time constant. A kernel cache keyed on the pattern bytes avoids redundant compilations.

For multi-criteria matching (e.g., pattern match AND depth check), combine the output with depth-based filtering after the call:

hits = pattern_match(d_bytes, b'"type":', d_qp)
# Further filter by depth
hits = hits * (d_depth == 4).view(cp.uint8)

Examples

>>> # Input: {"coordinates": [1,2], "coord": 3}
>>> # Pattern: b'"coordinates":'
>>> # Result: 1 at position 1, 0 elsewhere
vibespatial.io.gpu_parse.span_boundaries(d_depth: cupy.ndarray, d_starts: cupy.ndarray, n_bytes: int, *, skip_bytes: int = 0) cupy.ndarray

Find span end positions by scanning bracket depth.

For each start position, scans forward through the depth array to find the byte position where the nesting depth drops below the depth at the scan start. This identifies the end of a bracket-delimited span (e.g., the closing ] of a JSON "coordinates" array, the closing ) of a WKT geometry).

Parameters

d_depthcp.ndarray

Device-resident int32 depth array from structural.bracket_depth(), shape (n_bytes,).

d_startscp.ndarray

Device-resident int64 array of span start positions, shape (n_spans,). Each position is the byte offset of the structural marker that begins the span (e.g., the first byte of "coordinates":).

n_bytesint

Total number of bytes in the input. Used as the upper bound for forward scanning.

skip_bytesint, default 0

Number of bytes to skip past each start position before beginning the depth scan. For example, when starting from the position of "coordinates":, skip_bytes=14 skips past the key pattern to reach the opening bracket of the value.

The scan then advances through any whitespace to find the first bracket, records its depth, and continues until depth drops below that level.

Returns

cp.ndarray

Device-resident int64 array, shape (n_spans,). Element i is the byte offset one past the closing bracket of the i-th span. If the closing bracket is not found before n_bytes, the value is n_bytes.

Notes

This is a generalization of the coord_span_end kernel from the GeoJSON parser. That kernel hard-codes skip_bytes=14 for the "coordinates": pattern length.

The scan algorithm:

  1. Start at d_starts[i] + skip_bytes

  2. Skip forward while depth does not change (whitespace between key and opening bracket)

  3. Record start_depth = d_depth[pos] at the opening bracket

  4. Scan forward while d_depth[pos] >= start_depth

  5. Return pos (one past the closing bracket)

Examples

>>> # Input: "coordinates": [[1,2],[3,4]]
>>> #        ^pos=0          ^depth=5      ^end
>>> # d_starts = [0], skip_bytes = 14
>>> # Result: [end_position]
vibespatial.io.gpu_parse.bracket_depth(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray, *, open_chars: str = '{[', close_chars: str = '}]') cupy.ndarray

Compute per-byte nesting depth via delta kernel + prefix sum.

Produces an int32 array where each position holds the cumulative bracket depth at that byte offset. The algorithm is:

  1. A per-byte kernel emits +1 for open-bracket characters, -1 for close-bracket characters, and 0 for all other bytes. Brackets inside quoted strings (where d_quote_parity is 1) are treated as 0.

  2. An int32 cumulative sum over the delta array yields the running depth at each byte position.

The open/close character sets are parameterizable so that the same primitive works across formats:

  • JSON: open_chars="{[", close_chars="}]"

  • WKT: open_chars="(", close_chars=")"

  • XML: open_chars="<", close_chars=">"

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

d_quote_paritycp.ndarray

Device-resident uint8 parity mask from quote_parity(), shape (n,). Positions with parity 1 (inside string) are excluded from depth computation.

open_charsstr, default "{["

Characters that increment depth. Each character is treated independently. Maximum 8 characters.

close_charsstr, default "}]"

Characters that decrement depth. Must have the same length as open_chars. Each character is treated independently.

Returns

cp.ndarray

Device-resident int32 array of shape (n,). Element i holds the cumulative nesting depth at byte offset i. The depth is inclusive: at an opening bracket, the depth already includes the +1 delta from that bracket. At a closing bracket, the depth includes the -1 delta.

Raises

ValueError

If len(open_chars) != len(close_chars) or either exceeds 8 characters.

Notes

The intermediate delta array uses int8 dtype (1 byte per position) to minimize memory before the cumsum materializes the int32 depth array.

For JSON documents, the depth structure follows:

  • Depth 0: outside the root object

  • Depth 1: inside FeatureCollection { }

  • Depth 2: inside "features": [ ]

  • Depth 3: inside each Feature { }

  • Depth 4+: nested geometry objects and coordinate arrays

Examples

>>> # Input: {"a": [1, 2]}
>>> # Depth:  1 1 1 1 1 2 2 2 2 1 0
vibespatial.io.gpu_parse.quote_parity(d_bytes: cupy.ndarray) cupy.ndarray

Compute per-byte quote-parity mask via toggle + cumulative sum.

Marks each byte position as inside (1) or outside (0) a quoted string literal. The algorithm is:

  1. A per-byte kernel emits 1 at unescaped quote characters ("), 0 elsewhere. Escaped quotes (preceded by an odd number of backslashes) emit 0.

  2. A uint8 cumulative sum over the toggle array yields a monotonically increasing counter. A bitwise AND with 1 extracts the low bit, producing the parity: 0 = outside string, 1 = inside string.

Using uint8 cumsum instead of int32 saves 4x memory (2.16 GB vs 8.64 GB for a 2 GB input file). Parity remains correct after uint8 overflow because 256 is even.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

Returns

cp.ndarray

Device-resident uint8 array of shape (n,). Each element is 0 (outside quoted string) or 1 (inside quoted string).

Notes

The quote character is always ASCII " (0x22). This primitive does not support single-quoted strings. Backslash-escaped quotes (\") are handled correctly by counting consecutive preceding backslashes: a quote preceded by an odd number of backslashes is escaped and does not toggle parity.

This is the first stage in any GPU text-parsing pipeline. The resulting parity mask is consumed by bracket_depth, number_boundaries, and pattern_match to filter out bytes that appear inside string literals.

Examples

>>> # Input: {"key": "val"}
>>> # Bytes:  { " k e y " :   " v a l " }
>>> # Parity: 0 0 1 1 1 0 0 0 0 1 1 1 0 0