vibespatial.io.csv_gpu¶
GPU CSV reader – structural analysis and spatial geometry extraction.
GPU-accelerated CSV reader with two stages:
Stage 1: Structural Analysis (csv_structural_analysis)
Given a device-resident byte array containing a CSV file, identifies:
Quote parity – a CSV-specific quote toggle kernel that emits 1 at each
"character without backslash-escape checking (CSV uses""doubled-quote escaping, not backslash escaping). Doubled quotes naturally cancel in the cumulative-sum parity computation.Row boundary detection – mark positions of
\ncharacters where quote parity is 0 (outside quoted fields). Handles both\nand\r\nline endings.Column boundary detection – mark positions of the delimiter character where quote parity is 0. The delimiter is configurable (comma, tab, pipe) and passed as a kernel parameter.
Column count verification – pure CuPy helper that counts delimiters per row and verifies a consistent column count.
Header parsing – CPU-side helper (small data, one-time D->H copy) that splits the first row by delimiter to extract column names and identify spatial columns by name heuristics.
Stage 2: Spatial Column Extraction (read_csv_gpu)
Given the structural analysis result, extracts spatial geometry:
Lat/lon mode: extracts numeric lat/lon columns, parses with
parse_ascii_floats, assembles as Point OwnedGeometryArray.WKT mode: extracts WKT column, concatenates with newline separators, delegates to
read_wkt_gpu.WKB mode: extracts hex-encoded WKB column (auto-detected), decodes hex to binary on CPU, delegates to
decode_wkb_arrow_array_ownedwhich has a GPU fast path via pylibcudf.
All structural kernels are integer-only byte classification (no
floating-point computation), so no PrecisionPlan is needed per ADR-0002
– same rationale as gpu_parse/structural.py. Coordinate parsing
delegates to parse_ascii_floats which always produces fp64 – storage
precision is always fp64 per ADR-0002.
- Tier classification (ADR-0033):
Quote toggle: Tier 1 (custom NVRTC – byte classification)
Row end detection: Tier 1 (custom NVRTC – byte + parity check)
Delimiter detection: Tier 1 (custom NVRTC – byte + parity check)
Parity cumsum: Tier 2 (CuPy cumsum)
Position extraction: Tier 2 (CuPy flatnonzero)
Column count verification: Tier 2 (CuPy element-wise)
Header parsing: CPU (small data, one-time)
Field span extraction: Tier 2 (CuPy index arithmetic)
Quote stripping: Tier 2 (CuPy element-wise byte comparison)
Numeric parsing: delegates to gpu_parse.parse_ascii_floats (Tier 1)
WKT field concatenation: Tier 2 (CuPy scatter/copy)
WKT parsing: delegates to wkt_gpu.read_wkt_gpu
Attributes¶
Classes¶
Result of CSV structural analysis. |
|
Result of GPU CSV spatial reading. |
Functions¶
|
Perform GPU-accelerated structural analysis of a CSV file. |
|
Read a CSV file on GPU and extract spatial geometry. |
Module Contents¶
- vibespatial.io.csv_gpu.cp = None¶
- vibespatial.io.csv_gpu.KERNEL_PARAM_I64¶
- class vibespatial.io.csv_gpu.CsvStructuralResult¶
Result of CSV structural analysis.
All device arrays remain on GPU. Only
column_namesandspatial_columnsare host-side Python objects (derived from the small header row).Attributes¶
- d_row_endscp.ndarray
int64 positions of row-ending
\ncharacters, shape(n_rows_total,). Includes the header row end at index 0 ifhas_headerwas True.- d_delimiterscp.ndarray
int64 positions of all delimiter characters outside quotes, shape
(n_delimiters_total,).- d_quote_paritycp.ndarray
uint8 per-byte quote state, shape
(n_bytes,). 0 = outside quoted field, 1 = inside.- n_rowsint
Number of data rows (excluding header if present).
- n_columnsint
Number of columns (fields per row).
- column_nameslist[str]
Column names from the header row. Empty list if no header.
- spatial_columnsdict[str, int]
Detected spatial columns. Keys are
"lat","lon", or"geom"mapped to 0-based column indices.
- d_row_ends: cupy.ndarray¶
- d_delimiters: cupy.ndarray¶
- d_quote_parity: cupy.ndarray¶
- n_rows: int¶
- n_columns: int¶
- column_names: list[str]¶
- spatial_columns: dict[str, int]¶
- vibespatial.io.csv_gpu.csv_structural_analysis(d_bytes: cupy.ndarray, delimiter: str = ',', has_header: bool = True) CsvStructuralResult¶
Perform GPU-accelerated structural analysis of a CSV file.
Identifies row boundaries, column boundaries, and header metadata for a device-resident CSV byte stream. All structural scanning runs on the GPU; only the small header row is copied to host for column name extraction.
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw CSV file bytes, shape
(n,).- delimiterstr, default
"," Single-character field delimiter. Common values:
","(comma),"\t"(tab),"|"(pipe).- has_headerbool, default True
If True, the first row is treated as a header containing column names. If False, columns are named
col_0,col_1, etc.
Returns¶
- CsvStructuralResult
Frozen dataclass with device-resident boundary arrays and host-side column metadata.
Raises¶
- ValueError
If the delimiter is not a single ASCII character, or if column counts are inconsistent across rows.
Notes¶
The CSV quoting rules follow RFC 4180:
Fields containing the delimiter, newline, or double-quote are enclosed in double quotes.
Literal double quotes inside quoted fields are escaped as
""(two consecutive double-quote characters).Newline characters inside quoted fields are NOT row boundaries.
The quote parity algorithm exploits the fact that
""escaping naturally cancels in a cumulative-sum toggle: each"flips parity, so two consecutive"characters flip it twice, returning to the original state. This is simpler than backslash-escape detection (used by JSON) because CSV has no backslash escaping.Examples¶
>>> import cupy as cp >>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n' >>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8) >>> result = csv_structural_analysis(d_bytes) >>> result.n_rows 2 >>> result.n_columns 3 >>> result.column_names ['name', 'lat', 'lon'] >>> result.spatial_columns {'lat': 1, 'lon': 2}
- class vibespatial.io.csv_gpu.CsvGpuResult¶
Result of GPU CSV spatial reading.
Attributes¶
- geometryOwnedGeometryArray
Device-resident geometry array. For lat/lon mode, contains Point geometries. For WKT mode, contains whatever types were in the WKT column.
- n_rowsint
Number of data rows read.
- attributesdict[str, list[str]] or None
Non-spatial columns extracted as host-resident string lists. Keys are column names, values are per-row string values. None when there are no non-spatial columns.
- n_rows: int¶
- attributes: dict[str, list[str]] | None = None¶
- vibespatial.io.csv_gpu.read_csv_gpu(d_bytes: cupy.ndarray, *, delimiter: str = ',', lat_col: str | None = None, lon_col: str | None = None, geom_col: str | None = None) CsvGpuResult¶
Read a CSV file on GPU and extract spatial geometry.
Performs structural analysis followed by spatial column extraction and geometry assembly. Supports two modes:
Lat/lon mode: When
lat_colandlon_colare specified (or auto-detected), extracts numeric latitude and longitude columns and assembles them as Point geometries.WKT/WKB mode: When
geom_colis specified (or auto-detected), extracts the geometry column. Hex-encoded WKB is auto-detected and decoded via the GPU WKB pipeline. WKT text is parsed via the GPU WKT parser.
All computation is device-resident. The only D->H transfers are the small header row (for column name extraction) and the structural metadata in the OwnedGeometryArray (offsets, validity – KB-scale).
Parameters¶
- d_bytescp.ndarray
Device-resident uint8 array of raw CSV file bytes, shape
(n,).- delimiterstr, default
"," Single-character field delimiter.
- lat_colstr or None, default None
Name of the latitude column. If None, auto-detected from header names.
- lon_colstr or None, default None
Name of the longitude column. If None, auto-detected from header names.
- geom_colstr or None, default None
Name of the geometry column (WKT or hex-encoded WKB). If None, auto-detected from header names. The format (WKT vs hex WKB) is auto-detected from field content.
Returns¶
- CsvGpuResult
Frozen dataclass with
geometry(OwnedGeometryArray) andn_rows(int).
Raises¶
- ValueError
If no spatial columns can be identified (neither lat/lon pair nor WKT geometry column), or if specified column names are not found in the header.
Notes¶
GIS convention: x = longitude, y = latitude. The Point geometry array stores longitude in the x coordinate and latitude in y.
Examples¶
>>> import cupy as cp >>> csv_bytes = b'name,lat,lon\nAlice,40.7,-74.0\nBob,34.0,-118.2\n' >>> d_bytes = cp.frombuffer(csv_bytes, dtype=cp.uint8) >>> result = read_csv_gpu(d_bytes) >>> result.n_rows 2 >>> result.geometry.row_count 2