vibespatial.io.csv_gpu

GPU CSV reader – structural analysis and spatial geometry extraction.

GPU-accelerated CSV reader with two stages:

Stage 1: Structural Analysis (csv_structural_analysis)

Given a device-resident byte array containing a CSV file, identifies:

  1. Quote parity – a CSV-specific quote toggle kernel that emits 1 at each " character without backslash-escape checking (CSV uses "" doubled-quote escaping, not backslash escaping). Doubled quotes naturally cancel in the cumulative-sum parity computation.

  2. Row boundary detection – mark positions of \n characters where quote parity is 0 (outside quoted fields). Handles both \n and \r\n line endings.

  3. Column boundary detection – mark positions of the delimiter character where quote parity is 0. The delimiter is configurable (comma, tab, pipe) and passed as a kernel parameter.

  4. Column count verification – pure CuPy helper that counts delimiters per row and verifies a consistent column count.

  5. Header parsing – CPU-side helper (small data, one-time D->H copy) that splits the first row by delimiter to extract column names and identify spatial columns by name heuristics.

Stage 2: Spatial Column Extraction (read_csv_gpu)

Given the structural analysis result, extracts spatial geometry:

  • Lat/lon mode: extracts numeric lat/lon columns, parses with parse_ascii_floats, assembles as Point OwnedGeometryArray.

  • WKT mode: extracts WKT column, concatenates with newline separators, delegates to read_wkt_gpu.

  • WKB mode: extracts hex-encoded WKB column (auto-detected), decodes hex to binary on CPU, delegates to decode_wkb_arrow_array_owned which has a GPU fast path via pylibcudf.

All structural kernels are integer-only byte classification (no floating-point computation), so no PrecisionPlan is needed per ADR-0002 – same rationale as gpu_parse/structural.py. Coordinate parsing delegates to parse_ascii_floats which always produces fp64 – storage precision is always fp64 per ADR-0002.

Tier classification (ADR-0033):
  • Quote toggle: Tier 1 (custom NVRTC – byte classification)

  • Row end detection: Tier 1 (custom NVRTC – byte + parity check)

  • Delimiter detection: Tier 1 (custom NVRTC – byte + parity check)

  • Parity cumsum: Tier 2 (CuPy cumsum)

  • Position extraction: Tier 2 (CuPy flatnonzero)

  • Column count verification: Tier 2 (CuPy element-wise)

  • Header parsing: CPU (small data, one-time)

  • Field span extraction: Tier 2 (CuPy index arithmetic)

  • Quote stripping: Tier 2 (CuPy element-wise byte comparison)

  • Numeric parsing: delegates to gpu_parse.parse_ascii_floats (Tier 1)

  • WKT field concatenation: Tier 2 (CuPy scatter/copy)

  • WKT parsing: delegates to wkt_gpu.read_wkt_gpu

Attributes

Classes

CsvStructuralResult

Result of CSV structural analysis.

CsvGpuResult

Result of GPU CSV spatial reading.

Functions

csv_structural_analysis(→ CsvStructuralResult)

Perform GPU-accelerated structural analysis of a CSV file.

read_csv_gpu(→ CsvGpuResult)

Read a CSV file on GPU and extract spatial geometry.

Module Contents

vibespatial.io.csv_gpu.cp = None
vibespatial.io.csv_gpu.KERNEL_PARAM_I64
class vibespatial.io.csv_gpu.CsvStructuralResult

Result of CSV structural analysis.

All device arrays remain on GPU. Only column_names and spatial_columns are host-side Python objects (derived from the small header row).

Attributes

d_row_endscp.ndarray

int64 positions of row-ending \n characters, shape (n_rows_total,). Includes the header row end at index 0 if has_header was True.

d_delimiterscp.ndarray

int64 positions of all delimiter characters outside quotes, shape (n_delimiters_total,).

d_quote_paritycp.ndarray

uint8 per-byte quote state, shape (n_bytes,). 0 = outside quoted field, 1 = inside.

n_rowsint

Number of data rows (excluding header if present).

n_columnsint

Number of columns (fields per row).

column_nameslist[str]

Column names from the header row. Empty list if no header.

spatial_columnsdict[str, int]

Detected spatial columns. Keys are "lat", "lon", or "geom" mapped to 0-based column indices.

d_row_ends: cupy.ndarray
d_delimiters: cupy.ndarray
d_quote_parity: cupy.ndarray
n_rows: int
n_columns: int
column_names: list[str]
spatial_columns: dict[str, int]
vibespatial.io.csv_gpu.csv_structural_analysis(d_bytes: cupy.ndarray, delimiter: str = ',', has_header: bool = True) CsvStructuralResult

Perform GPU-accelerated structural analysis of a CSV file.

Identifies row boundaries, column boundaries, and header metadata for a device-resident CSV byte stream. All structural scanning runs on the GPU; only the small header row is copied to host for column name extraction.

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw CSV file bytes, shape (n,).

delimiterstr, default ","

Single-character field delimiter. Common values: "," (comma), "\t" (tab), "|" (pipe).

has_headerbool, default True

If True, the first row is treated as a header containing column names. If False, columns are named col_0, col_1, etc.

Returns

CsvStructuralResult

Frozen dataclass with device-resident boundary arrays and host-side column metadata.

Raises

ValueError

If the delimiter is not a single ASCII character, or if column counts are inconsistent across rows.

Notes

The CSV quoting rules follow RFC 4180:

  • Fields containing the delimiter, newline, or double-quote are enclosed in double quotes.

  • Literal double quotes inside quoted fields are escaped as "" (two consecutive double-quote characters).

  • Newline characters inside quoted fields are NOT row boundaries.

The quote parity algorithm exploits the fact that "" escaping naturally cancels in a cumulative-sum toggle: each " flips parity, so two consecutive " characters flip it twice, returning to the original state. This is simpler than backslash-escape detection (used by JSON) because CSV has no backslash escaping.

Examples

>>> import cupy as cp
>>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n'
>>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8)
>>> result = csv_structural_analysis(d_bytes)
>>> result.n_rows
2
>>> result.n_columns
3
>>> result.column_names
['name', 'lat', 'lon']
>>> result.spatial_columns
{'lat': 1, 'lon': 2}
class vibespatial.io.csv_gpu.CsvGpuResult

Result of GPU CSV spatial reading.

Attributes

geometryOwnedGeometryArray

Device-resident geometry array. For lat/lon mode, contains Point geometries. For WKT mode, contains whatever types were in the WKT column.

n_rowsint

Number of data rows read.

attributesdict[str, list[str]] or None

Non-spatial columns extracted as host-resident string lists. Keys are column names, values are per-row string values. None when there are no non-spatial columns.

geometry: vibespatial.geometry.owned.OwnedGeometryArray
n_rows: int
attributes: dict[str, list[str]] | None = None
vibespatial.io.csv_gpu.read_csv_gpu(d_bytes: cupy.ndarray, *, delimiter: str = ',', lat_col: str | None = None, lon_col: str | None = None, geom_col: str | None = None) CsvGpuResult

Read a CSV file on GPU and extract spatial geometry.

Performs structural analysis followed by spatial column extraction and geometry assembly. Supports two modes:

  1. Lat/lon mode: When lat_col and lon_col are specified (or auto-detected), extracts numeric latitude and longitude columns and assembles them as Point geometries.

  2. WKT/WKB mode: When geom_col is specified (or auto-detected), extracts the geometry column. Hex-encoded WKB is auto-detected and decoded via the GPU WKB pipeline. WKT text is parsed via the GPU WKT parser.

All computation is device-resident. The only D->H transfers are the small header row (for column name extraction) and the structural metadata in the OwnedGeometryArray (offsets, validity – KB-scale).

Parameters

d_bytescp.ndarray

Device-resident uint8 array of raw CSV file bytes, shape (n,).

delimiterstr, default ","

Single-character field delimiter.

lat_colstr or None, default None

Name of the latitude column. If None, auto-detected from header names.

lon_colstr or None, default None

Name of the longitude column. If None, auto-detected from header names.

geom_colstr or None, default None

Name of the geometry column (WKT or hex-encoded WKB). If None, auto-detected from header names. The format (WKT vs hex WKB) is auto-detected from field content.

Returns

CsvGpuResult

Frozen dataclass with geometry (OwnedGeometryArray) and n_rows (int).

Raises

ValueError

If no spatial columns can be identified (neither lat/lon pair nor WKT geometry column), or if specified column names are not found in the header.

Notes

GIS convention: x = longitude, y = latitude. The Point geometry array stores longitude in the x coordinate and latitude in y.

Examples

>>> import cupy as cp
>>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n'
>>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8)
>>> result = read_csv_gpu(d_bytes)
>>> result.n_rows
2
>>> result.geometry.row_count
2