Synthetic Data¶
Use the synthetic generator for benchmarks, smoke tests, and future regression corpora instead of checked-in external datasets.
Intent¶
Describe the repo-local synthetic geometry generator, its current output contract, and the verification path for extending it.
Request Signals¶
synthetic data
generator
benchmark data
regression corpus
fixture data
seeded geometry
license-free
Open First¶
docs/testing/synthetic-data.md
src/vibespatial/testing/synthetic.py
tests/test_synthetic_data.py
docs/testing/performance-tiers.md
Verify¶
uv run pytest tests/test_synthetic_data.pyuv run python scripts/check_docs.py --check
Risks¶
Large preset sizes can exhaust memory if generated eagerly in broad test runs.
Optional GeoParquet export depends on
pyarrowand should stay out of default smoke paths.Synthetic shapes can drift away from benchmark policy if dataset families are added ad hoc.
Current Contract¶
The bootstrap generator currently provides deterministic Shapely-backed datasets for:
points:
uniform,clustered,grid,along-linelines:
random-walk,grid,riverpolygons:
regular-grid,convex-hull,starmulti geometries:
MultiPoint,MultiLineString,MultiPolygonmixed arrays: configurable point/line/polygon ratios
invalid shapes: bowtie-like, duplicate-vertex, repeated-segment, and
NaNcoordinate cases
Scale presets are defined in SCALE_PRESETS:
1K10K100K1M10M
Outputs¶
SyntheticDataset.to_geoseries()SyntheticDataset.to_geodataframe()SyntheticDataset.write_geojson(path)SyntheticDataset.write_geoparquet(path)when parquet dependencies exist
The current bootstrap implementation is Shapely-first so benchmarks and tests can start immediately. Owned device-oriented geometry buffers should become the primary output once Phase 2 geometry-buffer work lands.
Pytest Integration¶
The repo-level synthetic_dataset fixture accepts a SyntheticSpec and
returns a generated dataset for point, line, or polygon families. Use it
for narrow tests; larger benchmark suites should call the generator directly so
they can document scale and distribution choices explicitly.
GPU kernel tests should cover null, empty, and mixed-geometry cases in the same
file so early kernels do not accidentally specialize to clean homogeneous
inputs. Outside the vendored upstream tree, do not check in external data files
under tests/; build repo-local fixtures from this generator instead.
Verification¶
Use this narrow gate when changing the synthetic generator:
uv run pytest tests/test_synthetic_data.py
uv run pytest tests/test_runtime_harness.py tests/test_geopandas_shim.py
uv run python scripts/check_docs.py --check