vibespatial.io.csv_gpu¶

GPU CSV reader – structural analysis and spatial geometry extraction.

GPU-accelerated CSV reader with two stages:

Stage 1: Structural Analysis (csv_structural_analysis)

Given a device-resident byte array containing a CSV file, identifies:

Quote parity – a CSV-specific quote toggle kernel that emits 1 at each " character without backslash-escape checking (CSV uses "" doubled-quote escaping, not backslash escaping). Doubled quotes naturally cancel in the cumulative-sum parity computation.
Row boundary detection – mark positions of \n characters where quote parity is 0 (outside quoted fields). Handles both \n and \r\n line endings.
Column boundary detection – mark positions of the delimiter character where quote parity is 0. The delimiter is configurable (comma, tab, pipe) and passed as a kernel parameter.
Column count verification – pure CuPy helper that counts delimiters per row and verifies a consistent column count.
Header parsing – CPU-side helper (small data, one-time D->H copy) that splits the first row by delimiter to extract column names and identify spatial columns by name heuristics.

Stage 2: Spatial Column Extraction (read_csv_gpu)

Given the structural analysis result, extracts spatial geometry:

Lat/lon mode: extracts numeric lat/lon columns, parses with parse_ascii_floats, assembles as Point OwnedGeometryArray.
WKT mode: extracts WKT column, concatenates with newline separators, delegates to read_wkt_gpu.
WKB mode: extracts hex-encoded WKB column (auto-detected), decodes hex to binary on CPU, delegates to decode_wkb_arrow_array_owned which has a GPU fast path via pylibcudf.

All structural kernels are integer-only byte classification (no floating-point computation), so no PrecisionPlan is needed per ADR-0002 – same rationale as gpu_parse/structural.py. Coordinate parsing delegates to parse_ascii_floats which always produces fp64 – storage precision is always fp64 per ADR-0002.

Tier classification (ADR-0033):

Quote toggle: Tier 1 (custom NVRTC – byte classification)
Row end detection: Tier 1 (custom NVRTC – byte + parity check)
Delimiter detection: Tier 1 (custom NVRTC – byte + parity check)
Parity cumsum: Tier 2 (CuPy cumsum)
Position extraction: Tier 2 (CuPy flatnonzero)
Column count verification: Tier 2 (CuPy element-wise)
Header parsing: CPU (small data, one-time)
Field span extraction: Tier 2 (CuPy index arithmetic)
Quote stripping: Tier 2 (CuPy element-wise byte comparison)
Numeric parsing: delegates to gpu_parse.parse_ascii_floats (Tier 1)
WKT field concatenation: Tier 2 (CuPy scatter/copy)
WKT parsing: delegates to wkt_gpu.read_wkt_gpu

Attributes¶

`cp`
`KERNEL_PARAM_I64`

Classes¶

`CsvStructuralResult`	Result of CSV structural analysis.
`CsvGpuResult`	Result of GPU CSV spatial reading.

Functions¶

`csv_structural_analysis`(→ CsvStructuralResult)	Perform GPU-accelerated structural analysis of a CSV file.
`read_csv_gpu`(→ CsvGpuResult)	Read a CSV file on GPU and extract spatial geometry.

Module Contents¶

vibespatial.io.csv_gpu.cp = None¶

vibespatial.io.csv_gpu.KERNEL_PARAM_I64¶

class vibespatial.io.csv_gpu.CsvStructuralResult¶

Result of CSV structural analysis.

All device arrays remain on GPU. Only column_names and spatial_columns are host-side Python objects (derived from the small header row).

Attributes¶

d_row_endscp.ndarray: int64 positions of row-ending \n characters, shape (n_rows_total,). Includes the header row end at index 0 if has_header was True.
d_delimiterscp.ndarray: int64 positions of all delimiter characters outside quotes, shape (n_delimiters_total,).
d_quote_paritycp.ndarray: uint8 per-byte quote state, shape (n_bytes,). 0 = outside quoted field, 1 = inside.
n_rowsint: Number of data rows (excluding header if present).
n_columnsint: Number of columns (fields per row).
column_nameslist[str]: Column names from the header row. Empty list if no header.
spatial_columnsdict[str, int]: Detected spatial columns. Keys are "lat", "lon", or "geom" mapped to 0-based column indices.

d_row_ends: cupy.ndarray¶

d_delimiters: cupy.ndarray¶

d_quote_parity: cupy.ndarray¶

n_rows: int¶

n_columns: int¶

column_names: list[str]¶

spatial_columns: dict[str, int]¶

vibespatial.io.csv_gpu.csv_structural_analysis(d_bytes: cupy.ndarray, delimiter: str = ',', has_header: bool = True) → CsvStructuralResult¶

Perform GPU-accelerated structural analysis of a CSV file.

Identifies row boundaries, column boundaries, and header metadata for a device-resident CSV byte stream. All structural scanning runs on the GPU; only the small header row is copied to host for column name extraction.

Parameters¶

d_bytescp.ndarray: Device-resident uint8 array of raw CSV file bytes, shape (n,).
delimiterstr, default ",": Single-character field delimiter. Common values: "," (comma), "\t" (tab), "|" (pipe).
has_headerbool, default True: If True, the first row is treated as a header containing column names. If False, columns are named col_0, col_1, etc.

Returns¶

CsvStructuralResult: Frozen dataclass with device-resident boundary arrays and host-side column metadata.

Raises¶

ValueError: If the delimiter is not a single ASCII character, or if column counts are inconsistent across rows.

Notes¶

The CSV quoting rules follow RFC 4180:

Fields containing the delimiter, newline, or double-quote are enclosed in double quotes.
Literal double quotes inside quoted fields are escaped as "" (two consecutive double-quote characters).
Newline characters inside quoted fields are NOT row boundaries.

The quote parity algorithm exploits the fact that "" escaping naturally cancels in a cumulative-sum toggle: each " flips parity, so two consecutive " characters flip it twice, returning to the original state. This is simpler than backslash-escape detection (used by JSON) because CSV has no backslash escaping.

Examples¶

>>> import cupy as cp
>>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n'
>>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8)
>>> result = csv_structural_analysis(d_bytes)
>>> result.n_rows
2
>>> result.n_columns
3
>>> result.column_names
['name', 'lat', 'lon']
>>> result.spatial_columns
{'lat': 1, 'lon': 2}

class vibespatial.io.csv_gpu.CsvGpuResult¶

Result of GPU CSV spatial reading.

Attributes¶

geometryOwnedGeometryArray: Device-resident geometry array. For lat/lon mode, contains Point geometries. For WKT mode, contains whatever types were in the WKT column.
n_rowsint: Number of data rows read.
attributesdict[str, list[str]] or None: Non-spatial columns extracted as host-resident string lists. Keys are column names, values are per-row string values. None when there are no non-spatial columns.

geometry: vibespatial.geometry.owned.OwnedGeometryArray¶

n_rows: int¶

attributes: dict[str, list[str]] | None = None¶

vibespatial.io.csv_gpu.read_csv_gpu(d_bytes: cupy.ndarray, *, delimiter: str = ',', lat_col: str | None = None, lon_col: str | None = None, geom_col: str | None = None) → CsvGpuResult¶

Read a CSV file on GPU and extract spatial geometry.

Performs structural analysis followed by spatial column extraction and geometry assembly. Supports two modes:

Lat/lon mode: When lat_col and lon_col are specified (or auto-detected), extracts numeric latitude and longitude columns and assembles them as Point geometries.
WKT/WKB mode: When geom_col is specified (or auto-detected), extracts the geometry column. Hex-encoded WKB is auto-detected and decoded via the GPU WKB pipeline. WKT text is parsed via the GPU WKT parser.

All computation is device-resident. The only D->H transfers are the small header row (for column name extraction) and the structural metadata in the OwnedGeometryArray (offsets, validity – KB-scale).

Parameters¶

d_bytescp.ndarray: Device-resident uint8 array of raw CSV file bytes, shape (n,).
delimiterstr, default ",": Single-character field delimiter.
lat_colstr or None, default None: Name of the latitude column. If None, auto-detected from header names.
lon_colstr or None, default None: Name of the longitude column. If None, auto-detected from header names.
geom_colstr or None, default None: Name of the geometry column (WKT or hex-encoded WKB). If None, auto-detected from header names. The format (WKT vs hex WKB) is auto-detected from field content.

Returns¶

CsvGpuResult: Frozen dataclass with geometry (OwnedGeometryArray) and n_rows (int).

Raises¶

ValueError: If no spatial columns can be identified (neither lat/lon pair nor WKT geometry column), or if specified column names are not found in the header.

Notes¶

GIS convention: x = longitude, y = latitude. The Point geometry array stores longitude in the x coordinate and latitude in y.

Examples¶

>>> import cupy as cp
>>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n'
>>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8)
>>> result = read_csv_gpu(d_bytes)
>>> result.n_rows
2
>>> result.geometry.row_count
2