vibespatial.io.gpu_parse.pattern

GPU pattern matching and span detection primitives.

Provides byte-pattern search with optional quote-state filtering, depth-based span boundary detection, and region mask generation. These primitives enable GPU parsers to locate structural markers (JSON keys, WKT keywords, XML tags) and define the byte ranges they govern.

Typical pipeline:

  1. pattern_match — find all occurrences of a byte pattern

  2. span_boundaries — for each match, scan depth to find the end

  3. mark_spans — create a per-byte region mask from start/end pairs

All functions operate on device-resident arrays with zero host materialization.

Attributes

Functions

pattern_match(→ cupy.ndarray)

Find all occurrences of a byte pattern in the input.

span_boundaries(→ cupy.ndarray)

Find span end positions by scanning bracket depth.

mark_spans(→ cupy.ndarray)

Create a per-byte region mask from start/end position pairs.

Module Contents

vibespatial.io.gpu_parse.pattern.KERNEL_PARAM_I64
vibespatial.io.gpu_parse.pattern.pattern_match(d_bytes: cupy.ndarray, pattern: bytes, d_quote_parity: cupy.ndarray | None = None, *, quote_check_offset: int = -1) cupy.ndarray

Find all occurrences of a byte pattern in the input.

A per-byte GPU kernel tests whether the substring starting at each position matches the given pattern. Optionally validates that the match is outside a quoted string by checking the quote parity at a specific offset within the pattern.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw file bytes, shape (n,).

patternbytes

The byte pattern to search for. Must be non-empty. Maximum length 256 bytes. The pattern is compiled into the NVRTC kernel as a constant array for optimal access.

d_quote_paritycp.ndarray or None, default None

Device-resident uint8 parity mask from structural.quote_parity(), shape (n,). When provided, matches inside quoted strings are suppressed based on the quote_check_offset parameter. When None, no quote filtering is applied.

quote_check_offsetint, default -1

Byte offset within the pattern at which to check quote parity. A match is suppressed if d_quote_parity[pos + quote_check_offset] != 0. A value of -1 means: use the last byte of the pattern (len(pattern) - 1).

For JSON key patterns like "coordinates":, the check offset should point to the colon (last byte), because inside a real key the opening and closing quotes cancel to parity 0. Inside a string value, parity would be 1 (odd), suppressing the match.

Returns

cp.ndarray

Device-resident uint8 array, shape (n,). Element i is 1 if the pattern matches starting at byte offset i, else 0. Positions where the pattern would extend past the end of the input are always 0.

Notes

This is a generalization of the find_coord_key and find_type_key kernels from the GeoJSON parser. Those search for specific 14-byte and 7-byte patterns respectively. This function parameterizes the pattern and supports arbitrary lengths.

The kernel is generated at runtime via NVRTC with the pattern bytes embedded as a compile-time constant. A kernel cache keyed on the pattern bytes avoids redundant compilations.

For multi-criteria matching (e.g., pattern match AND depth check), combine the output with depth-based filtering after the call:

hits = pattern_match(d_bytes, b'"type":', d_qp)
# Further filter by depth
hits = hits * (d_depth == 4).view(cp.uint8)

Examples

>>> # Input: {"coordinates": [1,2], "coord": 3}
>>> # Pattern: b'"coordinates":'
>>> # Result: 1 at position 1, 0 elsewhere
vibespatial.io.gpu_parse.pattern.span_boundaries(d_depth: cupy.ndarray, d_starts: cupy.ndarray, n_bytes: int, *, skip_bytes: int = 0) cupy.ndarray

Find span end positions by scanning bracket depth.

For each start position, scans forward through the depth array to find the byte position where the nesting depth drops below the depth at the scan start. This identifies the end of a bracket-delimited span (e.g., the closing ] of a JSON "coordinates" array, the closing ) of a WKT geometry).

Parameters

d_depthcp.ndarray

Device-resident int32 depth array from structural.bracket_depth(), shape (n_bytes,).

d_startscp.ndarray

Device-resident int64 array of span start positions, shape (n_spans,). Each position is the byte offset of the structural marker that begins the span (e.g., the first byte of "coordinates":).

n_bytesint

Total number of bytes in the input. Used as the upper bound for forward scanning.

skip_bytesint, default 0

Number of bytes to skip past each start position before beginning the depth scan. For example, when starting from the position of "coordinates":, skip_bytes=14 skips past the key pattern to reach the opening bracket of the value.

The scan then advances through any whitespace to find the first bracket, records its depth, and continues until depth drops below that level.

Returns

cp.ndarray

Device-resident int64 array, shape (n_spans,). Element i is the byte offset one past the closing bracket of the i-th span. If the closing bracket is not found before n_bytes, the value is n_bytes.

Notes

This is a generalization of the coord_span_end kernel from the GeoJSON parser. That kernel hard-codes skip_bytes=14 for the "coordinates": pattern length.

The scan algorithm:

  1. Start at d_starts[i] + skip_bytes

  2. Skip forward while depth does not change (whitespace between key and opening bracket)

  3. Record start_depth = d_depth[pos] at the opening bracket

  4. Scan forward while d_depth[pos] >= start_depth

  5. Return pos (one past the closing bracket)

Examples

>>> # Input: "coordinates": [[1,2],[3,4]]
>>> #        ^pos=0          ^depth=5      ^end
>>> # d_starts = [0], skip_bytes = 14
>>> # Result: [end_position]
vibespatial.io.gpu_parse.pattern.mark_spans(d_starts: cupy.ndarray, d_ends: cupy.ndarray, n_bytes: int) cupy.ndarray

Create a per-byte region mask from start/end position pairs.

For each (d_starts[i], d_ends[i]) pair, sets all bytes in the half-open range [d_starts[i], d_ends[i]) to 1 in the output mask. All other positions are 0.

This is used to create coordinate-span masks that filter number detection to only relevant regions of the file.

Parameters

d_startscp.ndarray

Device-resident int64 array of span start positions, shape (n_spans,). Each element is an inclusive byte offset.

d_endscp.ndarray

Device-resident int64 array of span end positions, shape (n_spans,). Each element is an exclusive byte offset.

n_bytesint

Total number of bytes in the input. The output mask has this length.

Returns

cp.ndarray

Device-resident uint8 array, shape (n_bytes,). Element i is 1 if byte i falls within any span, else 0. Overlapping spans are handled correctly (union semantics).

Notes

This is a generalization of the mark_coord_spans kernel from the GeoJSON parser. That kernel reads start positions from coord_positions and offsets them by 14 bytes (the length of "coordinates":). This function takes pre-computed start/end arrays directly.

The kernel launches one thread per span (not per byte). Each thread writes 1 to all bytes in its span via a serial loop. For large numbers of short spans, this is efficient because the write pattern is coalesced within each span. For very large spans (>1M bytes each), a per-byte kernel with binary search over sorted starts would be more efficient, but in practice coordinate spans are small relative to file size.

Examples

>>> # d_starts = [10, 50], d_ends = [25, 60], n_bytes = 100
>>> # Result: 0s except positions [10..24] and [50..59] are 1