vibespatial.io.gpu_parse.numeric¶
GPU numeric parsing primitives for text formats.
Provides boundary detection and ASCII-to-number conversion for numeric values embedded in structured text. The pipeline is:
number_boundaries— per-byte kernel classifies number start/endextract_number_positions— compact boundary masks to position arraysparse_ascii_floats/parse_ascii_ints— per-number parallel parse
All functions operate on device-resident arrays with zero host materialization.
Attributes¶
Functions¶
|
Identify start and end positions of numeric tokens. |
|
Parse ASCII numeric tokens to float64 values on GPU. |
|
Parse ASCII integer tokens to int64 values on GPU. |
|
Convert boundary masks to compact position arrays. |
Module Contents¶
- vibespatial.io.gpu_parse.numeric.cp = None¶
- vibespatial.io.gpu_parse.numeric.KERNEL_PARAM_I64¶
- vibespatial.io.gpu_parse.numeric.number_boundaries(d_bytes: cupy.ndarray, d_quote_parity: cupy.ndarray) tuple[cupy.ndarray, cupy.ndarray]¶
Identify start and end positions of numeric tokens.
A per-byte GPU kernel classifies each byte position as either the start of a numeric token, the end of a numeric token, or neither. Bytes inside quoted strings (
d_quote_parity == 1) are always classified as neither.- Start detection
A byte is a number start if it is a numeric-initial character (
0-9,-,+) AND the preceding byte is a separator (,,[, space, tab, newline, carriage return).- End detection
A byte is a number end if it is a numeric character (
0-9,.,e,E,-,+) AND the following byte is a separator (,,], space, tab, newline, carriage return).
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n,).- d_quote_paritycp.ndarray
Device-resident uint8 parity mask from
structural.quote_parity(), shape(n,).
Returns¶
- d_is_startcp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if byteiis the first byte of a numeric token, else 0.- d_is_endcp.ndarray
Device-resident uint8 array, shape
(n,). Elementiis 1 if byteiis the last byte of a numeric token, else 0.
Notes¶
The returned arrays are byte-level masks, not position arrays. Use
extract_number_positionsto convert them to compact int64 position arrays suitable forparse_ascii_floats.The boundary heuristic is designed for JSON/CSV numeric formats. It handles:
Integers:
123,-42Decimals:
3.14,-0.001Scientific notation:
1.5e10,-2.3E-4Leading sign:
+1.0,-1.0
Examples¶
>>> # Input: [1.5, -2.3] >>> # ^ ^ ^ ^ (start, end pairs)
- vibespatial.io.gpu_parse.numeric.parse_ascii_floats(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray¶
Parse ASCII numeric tokens to float64 values on GPU.
Each CUDA thread processes one token defined by the half-open byte range
[d_starts[i], d_ends[i]). The kernel implements a character-by-character state machine supporting:Optional leading sign (
+or-)Integer part (digits before decimal point)
Optional fractional part (
.followed by digits)Optional scientific notation exponent (
e/E, optional sign, digits)
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n_bytes,).- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset of the first character of thei-th numeric token (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset one past the last character of thei-th numeric token (exclusive).
Returns¶
- cp.ndarray
Device-resident float64 array, shape
(n_numbers,). Each element is the parsed floating-point value. Invalid tokens produce0.0(not NaN) — callers should validate input boundaries.
Notes¶
The start/end convention is half-open:
[start, end). This matches the output ofextract_number_positions, where ends are already incremented by 1 from thed_is_endmask positions.The kernel does not handle
NaN,Infinity, or hexadecimal float literals. These are not valid in JSON or standard CSV.Examples¶
>>> # d_bytes contains b"[1.5, -2.3e4]" >>> # d_starts = [1, 6], d_ends = [4, 12] >>> # result = [1.5, -23000.0]
- vibespatial.io.gpu_parse.numeric.parse_ascii_ints(d_bytes: cupy.ndarray, d_starts: cupy.ndarray, d_ends: cupy.ndarray) cupy.ndarray¶
Parse ASCII integer tokens to int64 values on GPU.
Each CUDA thread processes one token defined by the half-open byte range
[d_starts[i], d_ends[i]). The kernel implements a simple character-by-character accumulator supporting:Optional leading sign (
+or-)Decimal digits (
0-9)
Fractional parts and exponent notation are not supported. If a non-digit character (other than a leading sign) is encountered, accumulation stops at that position.
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw file bytes, shape
(n_bytes,).- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset of the first character of thei-th integer token (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Elementiis the byte offset one past the last character of thei-th integer token (exclusive).
Returns¶
- cp.ndarray
Device-resident int64 array, shape
(n_numbers,). Each element is the parsed integer value. Tokens that contain no valid digits produce0. Overflow wraps silently (int64 range:-2^63to2^63 - 1).
Notes¶
This function does NOT exist in the current geojson_gpu.py pipeline. It is a new primitive for formats that contain integer fields (e.g., feature IDs in GeoJSON, integer attributes in CSV, SRID values in WKT).
The start/end convention is half-open:
[start, end), consistent withparse_ascii_floats.Examples¶
>>> # d_bytes contains b"SRID=4326;POINT(1 2)" >>> # d_starts = [5], d_ends = [9] >>> # result = [4326]
- vibespatial.io.gpu_parse.numeric.extract_number_positions(d_is_start: cupy.ndarray, d_is_end: cupy.ndarray, d_mask: cupy.ndarray | None = None) tuple[cupy.ndarray, cupy.ndarray]¶
Convert boundary masks to compact position arrays.
Takes the per-byte start/end masks from
number_boundariesand produces compact int64 position arrays suitable forparse_ascii_floatsorparse_ascii_ints.Optionally filters by a region mask so that only numbers within specific spans (e.g., coordinate spans in GeoJSON, value columns in CSV) are included.
Parameters¶
- d_is_startcp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Per-byte number-start indicators fromnumber_boundaries.- d_is_endcp.ndarray
Device-resident uint8 array, shape
(n_bytes,). Per-byte number-end indicators fromnumber_boundaries.- d_maskcp.ndarray or None, default None
Optional device-resident uint8 region mask, shape
(n_bytes,). If provided, only number boundaries whered_mask[i] == 1are included. WhenNone, all detected boundaries are returned.
Returns¶
- d_startscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Byte offsets of the first character of each detected number (inclusive).- d_endscp.ndarray
Device-resident int64 array, shape
(n_numbers,). Byte offsets one past the last character of each detected number (exclusive). This is computed asflatnonzero(d_is_end) + 1so that the range[start, end)spans the full token.
Notes¶
When
d_maskis provided, the function computes element-wise multiplication of both boundary masks with the region mask before extracting positions. This avoids materializing filtered intermediate arrays.The returned arrays are always contiguous int64 arrays suitable for direct kernel parameter passing.
Examples¶
>>> # d_is_start marks positions [3, 7, 15] >>> # d_is_end marks positions [5, 10, 18] >>> # d_mask is 1 only in [0..12] >>> # Result: d_starts=[3, 7], d_ends=[6, 11]