Synthetic Data

Use the synthetic generator for benchmarks, smoke tests, and future regression corpora instead of checked-in external datasets.

Intent

Describe the repo-local synthetic geometry generator, its current output contract, and the verification path for extending it.

Request Signals

  • synthetic data

  • generator

  • benchmark data

  • regression corpus

  • fixture data

  • seeded geometry

  • license-free

Open First

  • docs/testing/synthetic-data.md

  • src/vibespatial/testing/synthetic.py

  • tests/test_synthetic_data.py

  • docs/testing/performance-tiers.md

Verify

  • uv run pytest tests/test_synthetic_data.py

  • uv run python scripts/check_docs.py --check

Risks

  • Large preset sizes can exhaust memory if generated eagerly in broad test runs.

  • Optional GeoParquet export depends on pyarrow and should stay out of default smoke paths.

  • Synthetic shapes can drift away from benchmark policy if dataset families are added ad hoc.

Current Contract

The bootstrap generator currently provides deterministic Shapely-backed datasets for:

  • points: uniform, clustered, grid, along-line

  • lines: random-walk, grid, river

  • polygons: regular-grid, convex-hull, star

  • multi geometries: MultiPoint, MultiLineString, MultiPolygon

  • mixed arrays: configurable point/line/polygon ratios

  • invalid shapes: bowtie-like, duplicate-vertex, repeated-segment, and NaN coordinate cases

Scale presets are defined in SCALE_PRESETS:

  • 1K

  • 10K

  • 100K

  • 1M

  • 10M

Outputs

  • SyntheticDataset.to_geoseries()

  • SyntheticDataset.to_geodataframe()

  • SyntheticDataset.write_geojson(path)

  • SyntheticDataset.write_geoparquet(path) when parquet dependencies exist

The current bootstrap implementation is Shapely-first so benchmarks and tests can start immediately. Owned device-oriented geometry buffers should become the primary output once Phase 2 geometry-buffer work lands.

Pytest Integration

The repo-level synthetic_dataset fixture accepts a SyntheticSpec and returns a generated dataset for point, line, or polygon families. Use it for narrow tests; larger benchmark suites should call the generator directly so they can document scale and distribution choices explicitly.

GPU kernel tests should cover null, empty, and mixed-geometry cases in the same file so early kernels do not accidentally specialize to clean homogeneous inputs. Outside the vendored upstream tree, do not check in external data files under tests/; build repo-local fixtures from this generator instead.

Verification

Use this narrow gate when changing the synthetic generator:

uv run pytest tests/test_synthetic_data.py
uv run pytest tests/test_runtime_harness.py tests/test_geopandas_shim.py
uv run python scripts/check_docs.py --check