vibespatial.io.gpu_parse.structural

GPU structural scanning primitives for text parsing.

Provides quote-state detection and bracket-depth computation via per-byte delta kernels and CuPy prefix sums. These are the foundational building blocks for any structured-text GPU parser: the quote-parity mask identifies which bytes are inside string literals, and the bracket-depth array encodes the hierarchical nesting structure of the document.

Both functions operate on device-resident byte arrays and return device-resident results with zero host materialization.

Attributes

Functions

quote_parity(→ cupy.ndarray)

Compute per-byte quote-parity mask via toggle + cumulative sum.

bracket_depth(→ cupy.ndarray)

Compute per-byte nesting depth via delta kernel + prefix sum.

Module Contents

vibespatial.io.gpu_parse.structural.cp = None
vibespatial.io.gpu_parse.structural.KERNEL_PARAM_I64
vibespatial.io.gpu_parse.structural.quote_parity(d_bytes: cupy.ndarray) cupy.ndarray

Compute per-byte quote-parity mask via toggle + cumulative sum.

Marks each byte position as inside (1) or outside (0) a quoted string literal. The algorithm is:

  1. A per-byte kernel emits 1 at unescaped quote characters ("), 0 elsewhere. Escaped quotes (preceded by an odd number of backslashes) emit 0.

  2. A uint8 cumulative sum over the toggle array yields a monotonically increasing counter. A bitwise AND with 1 extracts the low bit, producing the parity: 0 = outside string, 1 = inside string.

Using uint8 cumsum instead of int32 saves 4x memory (2.16 GB vs 8.64 GB for a 2 GB input file). Parity remains correct after uint8 overflow because 256 is even.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

Returns

cp.ndarray

Device-resident uint8 array of shape (n,). Each element is 0 (outside quoted string) or 1 (inside quoted string).

Notes

The quote character is always ASCII " (0x22). This primitive does not support single-quoted strings. Backslash-escaped quotes (\") are handled correctly by counting consecutive preceding backslashes: a quote preceded by an odd number of backslashes is escaped and does not toggle parity.

This is the first stage in any GPU text-parsing pipeline. The resulting parity mask is consumed by bracket_depth, number_boundaries, and pattern_match to filter out bytes that appear inside string literals.

Examples

>>> # Input: {"key": "val"}
>>> # Bytes:  { " k e y " :   " v a l " }
>>> # Parity: 0 0 1 1 1 0 0 0 0 1 1 1 0 0
vibespatial.io.gpu_parse.structural.bracket_depth(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray, *, open_chars: str = '{[', close_chars: str = '}]') cupy.ndarray

Compute per-byte nesting depth via delta kernel + prefix sum.

Produces an int32 array where each position holds the cumulative bracket depth at that byte offset. The algorithm is:

  1. A per-byte kernel emits +1 for open-bracket characters, -1 for close-bracket characters, and 0 for all other bytes. Brackets inside quoted strings (where d_quote_parity is 1) are treated as 0.

  2. An int32 cumulative sum over the delta array yields the running depth at each byte position.

The open/close character sets are parameterizable so that the same primitive works across formats:

  • JSON: open_chars="{[", close_chars="}]"

  • WKT: open_chars="(", close_chars=")"

  • XML: open_chars="<", close_chars=">"

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

d_quote_paritycp.ndarray

Device-resident uint8 parity mask from quote_parity(), shape (n,). Positions with parity 1 (inside string) are excluded from depth computation.

open_charsstr, default "{["

Characters that increment depth. Each character is treated independently. Maximum 8 characters.

close_charsstr, default "}]"

Characters that decrement depth. Must have the same length as open_chars. Each character is treated independently.

Returns

cp.ndarray

Device-resident int32 array of shape (n,). Element i holds the cumulative nesting depth at byte offset i. The depth is inclusive: at an opening bracket, the depth already includes the +1 delta from that bracket. At a closing bracket, the depth includes the -1 delta.

Raises

ValueError

If len(open_chars) != len(close_chars) or either exceeds 8 characters.

Notes

The intermediate delta array uses int8 dtype (1 byte per position) to minimize memory before the cumsum materializes the int32 depth array.

For JSON documents, the depth structure follows:

  • Depth 0: outside the root object

  • Depth 1: inside FeatureCollection { }

  • Depth 2: inside "features": [ ]

  • Depth 3: inside each Feature { }

  • Depth 4+: nested geometry objects and coordinate arrays

Examples

>>> # Input: {"a": [1, 2]}
>>> # Depth:  1 1 1 1 1 2 2 2 2 1 0