vibespatial.io.gpu_parse.structural¶
GPU structural scanning primitives for text parsing.
Provides quote-state detection and bracket-depth computation via per-byte delta kernels and CuPy prefix sums. These are the foundational building blocks for any structured-text GPU parser: the quote-parity mask identifies which bytes are inside string literals, and the bracket-depth array encodes the hierarchical nesting structure of the document.
Both functions operate on device-resident byte arrays and return device-resident results with zero host materialization.
Attributes¶
Functions¶
|
Compute per-byte quote-parity mask via toggle + cumulative sum. |
|
Compute per-byte nesting depth via delta kernel + prefix sum. |
Module Contents¶
- vibespatial.io.gpu_parse.structural.cp = None¶
- vibespatial.io.gpu_parse.structural.KERNEL_PARAM_I64¶
- vibespatial.io.gpu_parse.structural.quote_parity(d_bytes: cupy.ndarray) cupy.ndarray¶
Compute per-byte quote-parity mask via toggle + cumulative sum.
Marks each byte position as inside (1) or outside (0) a quoted string literal. The algorithm is:
A per-byte kernel emits 1 at unescaped quote characters (
"), 0 elsewhere. Escaped quotes (preceded by an odd number of backslashes) emit 0.A
uint8cumulative sum over the toggle array yields a monotonically increasing counter. A bitwise AND with 1 extracts the low bit, producing the parity: 0 = outside string, 1 = inside string.
Using
uint8cumsum instead ofint32saves 4x memory (2.16 GB vs 8.64 GB for a 2 GB input file). Parity remains correct after uint8 overflow because 256 is even.Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).
Returns¶
- cp.ndarray
Device-resident uint8 array of shape
(n,). Each element is 0 (outside quoted string) or 1 (inside quoted string).
Notes¶
The quote character is always ASCII
"(0x22). This primitive does not support single-quoted strings. Backslash-escaped quotes (\") are handled correctly by counting consecutive preceding backslashes: a quote preceded by an odd number of backslashes is escaped and does not toggle parity.This is the first stage in any GPU text-parsing pipeline. The resulting parity mask is consumed by
bracket_depth,number_boundaries, andpattern_matchto filter out bytes that appear inside string literals.Examples¶
>>> # Input: {"key": "val"} >>> # Bytes: { " k e y " : " v a l " } >>> # Parity: 0 0 1 1 1 0 0 0 0 1 1 1 0 0
- vibespatial.io.gpu_parse.structural.bracket_depth(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray, *, open_chars: str = '{[', close_chars: str = '}]') cupy.ndarray¶
Compute per-byte nesting depth via delta kernel + prefix sum.
Produces an int32 array where each position holds the cumulative bracket depth at that byte offset. The algorithm is:
A per-byte kernel emits
+1for open-bracket characters,-1for close-bracket characters, and0for all other bytes. Brackets inside quoted strings (whered_quote_parityis 1) are treated as0.An
int32cumulative sum over the delta array yields the running depth at each byte position.
The open/close character sets are parameterizable so that the same primitive works across formats:
JSON:
open_chars="{[", close_chars="}]"WKT:
open_chars="(", close_chars=")"XML:
open_chars="<", close_chars=">"
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- d_quote_paritycp.ndarray
Device-resident uint8 parity mask from
quote_parity(), shape(n,). Positions with parity 1 (inside string) are excluded from depth computation.- open_charsstr, default
"{[" Characters that increment depth. Each character is treated independently. Maximum 8 characters.
- close_charsstr, default
"}]" Characters that decrement depth. Must have the same length as
open_chars. Each character is treated independently.
Returns¶
- cp.ndarray
Device-resident int32 array of shape
(n,). Elementiholds the cumulative nesting depth at byte offseti. The depth is inclusive: at an opening bracket, the depth already includes the+1delta from that bracket. At a closing bracket, the depth includes the-1delta.
Raises¶
- ValueError
If
len(open_chars) != len(close_chars)or either exceeds 8 characters.
Notes¶
The intermediate delta array uses
int8dtype (1 byte per position) to minimize memory before the cumsum materializes theint32depth array.For JSON documents, the depth structure follows:
Depth 0: outside the root object
Depth 1: inside
FeatureCollection { }Depth 2: inside
"features": [ ]Depth 3: inside each
Feature { }Depth 4+: nested geometry objects and coordinate arrays
Examples¶
>>> # Input: {"a": [1, 2]} >>> # Depth: 1 1 1 1 1 2 2 2 2 1 0