vibespatial.io.gpu_parse.pattern¶
GPU pattern matching and span detection primitives.
Provides byte-pattern search with optional quote-state filtering, depth-based span boundary detection, and region mask generation. These primitives enable GPU parsers to locate structural markers (JSON keys, WKT keywords, XML tags) and define the byte ranges they govern.
Typical pipeline:
pattern_match— find all occurrences of a byte patternspan_boundaries— for each match, scan depth to find the endmark_spans— create a per-byte region mask from start/end pairs
All functions operate on device-resident arrays with zero host materialization.
Attributes¶
Functions¶
|
Find all occurrences of a byte pattern in the input. |
|
Find span end positions by scanning bracket depth. |
|
Create a per-byte region mask from start/end position pairs. |
Module Contents¶
- vibespatial.io.gpu_parse.pattern.KERNEL_PARAM_I64¶
- vibespatial.io.gpu_parse.pattern.pattern_match(d_bytes: cupy.ndarray, pattern: bytes, d_quote_parity: cupy.ndarray | None = None, *, quote_check_offset: int = -1) cupy.ndarray¶
Find all occurrences of a byte pattern in the input.
A per-byte GPU kernel tests whether the substring starting at each position matches the given pattern. Optionally validates that the match is outside a quoted string by checking the quote parity at a specific offset within the pattern.
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- patternbytes
The byte pattern to search for. Must be non-empty. Maximum length 256 bytes. The pattern is compiled into the NVRTC kernel as a constant array for optimal access.
- d_quote_paritycp.ndarray or None, default None
Device-resident uint8 parity mask from
structural.quote_parity(), shape(n,). When provided, matches inside quoted strings are suppressed based on thequote_check_offsetparameter. WhenNone, no quote filtering is applied.- quote_check_offsetint, default -1
Byte offset within the pattern at which to check quote parity. A match is suppressed if
d_quote_parity[pos + quote_check_offset] != 0. A value of-1means: use the last byte of the pattern (len(pattern) - 1).For JSON key patterns like
"coordinates":, the check offset should point to the colon (last byte), because inside a real key the opening and closing quotes cancel to parity 0. Inside a string value, parity would be 1 (odd), suppressing the match.
Returns¶
- cp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if the pattern matches starting at byte offseti, else 0. Positions where the pattern would extend past the end of the input are always 0.
Notes¶
This is a generalization of the
find_coord_keyandfind_type_keykernels from the GeoJSON parser. Those search for specific 14-byte and 7-byte patterns respectively. This function parameterizes the pattern and supports arbitrary lengths.The kernel is generated at runtime via NVRTC with the pattern bytes embedded as a compile-time constant. A kernel cache keyed on the pattern bytes avoids redundant compilations.
For multi-criteria matching (e.g., pattern match AND depth check), combine the output with depth-based filtering after the call:
hits = pattern_match(d_bytes, b'"type":', d_qp) # Further filter by depth hits = hits * (d_depth == 4).view(cp.uint8)
Examples¶
>>> # Input: {"coordinates": [1,2], "coord": 3} >>> # Pattern: b'"coordinates":' >>> # Result: 1 at position 1, 0 elsewhere
- vibespatial.io.gpu_parse.pattern.span_boundaries(d_depth: cupy.ndarray, d_starts: cupy.ndarray, n_bytes: int, *, skip_bytes: int = 0) cupy.ndarray¶
Find span end positions by scanning bracket depth.
For each start position, scans forward through the depth array to find the byte position where the nesting depth drops below the depth at the scan start. This identifies the end of a bracket-delimited span (e.g., the closing
]of a JSON"coordinates"array, the closing)of a WKT geometry).Parameters¶
- d_depthcp.ndarray
Device-resident int32 depth array from
structural.bracket_depth(), shape(n_bytes,).- d_startscp.ndarray
Device-resident int64 array of span start positions, shape
(n_spans,). Each position is the byte offset of the structural marker that begins the span (e.g., the first byte of"coordinates":).- n_bytesint
Total number of bytes in the input. Used as the upper bound for forward scanning.
- skip_bytesint, default 0
Number of bytes to skip past each start position before beginning the depth scan. For example, when starting from the position of
"coordinates":,skip_bytes=14skips past the key pattern to reach the opening bracket of the value.The scan then advances through any whitespace to find the first bracket, records its depth, and continues until depth drops below that level.
Returns¶
- cp.ndarray
Device-resident int64 array, shape
(n_spans,). Elementiis the byte offset one past the closing bracket of thei-th span. If the closing bracket is not found beforen_bytes, the value isn_bytes.
Notes¶
This is a generalization of the
coord_span_endkernel from the GeoJSON parser. That kernel hard-codesskip_bytes=14for the"coordinates":pattern length.The scan algorithm:
Start at
d_starts[i] + skip_bytesSkip forward while depth does not change (whitespace between key and opening bracket)
Record
start_depth = d_depth[pos]at the opening bracketScan forward while
d_depth[pos] >= start_depthReturn
pos(one past the closing bracket)
Examples¶
>>> # Input: "coordinates": [[1,2],[3,4]] >>> # ^pos=0 ^depth=5 ^end >>> # d_starts = [0], skip_bytes = 14 >>> # Result: [end_position]
- vibespatial.io.gpu_parse.pattern.mark_spans(d_starts: cupy.ndarray, d_ends: cupy.ndarray, n_bytes: int) cupy.ndarray¶
Create a per-byte region mask from start/end position pairs.
For each
(d_starts[i], d_ends[i])pair, sets all bytes in the half-open range[d_starts[i], d_ends[i])to 1 in the output mask. All other positions are 0.This is used to create coordinate-span masks that filter number detection to only relevant regions of the file.
Parameters¶
- d_startscp.ndarray
Device-resident int64 array of span start positions, shape
(n_spans,). Each element is an inclusive byte offset.- d_endscp.ndarray
Device-resident int64 array of span end positions, shape
(n_spans,). Each element is an exclusive byte offset.- n_bytesint
Total number of bytes in the input. The output mask has this length.
Returns¶
- cp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Elementiis 1 if byteifalls within any span, else 0. Overlapping spans are handled correctly (union semantics).
Notes¶
This is a generalization of the
mark_coord_spanskernel from the GeoJSON parser. That kernel reads start positions fromcoord_positionsand offsets them by 14 bytes (the length of"coordinates":). This function takes pre-computed start/end arrays directly.The kernel launches one thread per span (not per byte). Each thread writes 1 to all bytes in its span via a serial loop. For large numbers of short spans, this is efficient because the write pattern is coalesced within each span. For very large spans (>1M bytes each), a per-byte kernel with binary search over sorted starts would be more efficient, but in practice coordinate spans are small relative to file size.
Examples¶
>>> # d_starts = [10, 50], d_ends = [25, 60], n_bytes = 100 >>> # Result: 0s except positions [10..24] and [50..59] are 1