ADR-0032: Point-in-Polygon GPU Utilization Diagnosis¶

Context¶

The GPU point-in-polygon path was reporting ~2-7% GPU SM utilization with 13.2-13.6s wall time at 1M-row scale, despite being dispatched to GPU. The decision asked to explain the idle time and land at least one measurable improvement.

Diagnosis¶

Sub-stage instrumentation added to _evaluate_point_in_polygon_gpu and the outer point_in_polygon function revealed the dominant time sinks at 100K-row scale (representative proportions hold at 1M):

Sub-stage	Time (s)	Share
`normalize_right_s` (Shapely->OwnedGeometryArray)	1.324	96.7%
`kernel_launch_and_sync_s`	0.038	2.8%
`coarse_filter_s`	0.003	0.2%
`move_to_device_s`	0.001	0.1%
`candidate_mask_s`	0.001	<0.1%
`point_upload_s` + `polygon_upload_s`	<0.001	<0.1%

Root cause: The benchmark pipeline was passing 1M Shapely polygon objects through _normalize_right_input, which called from_shapely_geometries – a Python-level loop extracting coordinates from each Shapely object one at a time. This host-side serialization dominated wall time; the GPU kernel itself ran in <40ms.

Secondary bottleneck: `extract_point_coordinates`¶

The coarse bounds filter called extract_point_coordinates, which also iterated point rows one at a time in Python. This was vectorized to use NumPy fancy indexing, reducing cost from O(N) Python iterations to a single NumPy gather.

Tertiary bottleneck: candidate mask construction¶

np.fromiter((value is True for value in coarse), ...) iterated an object-dtype array through Python. Replaced with coarse == True (vectorized element-wise comparison on the object array).

Changes Landed¶

Vectorized extract_point_coordinates in predicate_support.py: replaced per-row Python loop with bulk NumPy indexing.
Vectorized candidate mask construction in both CPU and GPU _evaluate_point_in_polygon_*: replaced np.fromiter generator with coarse == True.
Separated polygon preparation from kernel timing in the benchmark pipeline: prepare_polygons stage now constructs the OwnedGeometryArray upfront, so point_in_polygon receives pre-built buffers. This models the realistic scenario where polygon data arrives through IO (GeoParquet, GeoArrow) rather than live Shapely objects.
Added GPU sub-stage timing instrumentation to point_in_polygon.py: get_last_gpu_substage_timings() returns a dict with per-phase wall-clock times. The benchmark pipeline captures these in stage metadata and the sparkline reporter renders them.

Results (100K scale, RTX 4090)¶

Metric	Before	After
`point_in_polygon` stage	1.37s	0.05s
`normalize_right_s`	1.32s	0.008s
Total pipeline	3.37s	3.36s*

*Total pipeline time is similar because the Shapely construction work moved to a visible prepare_polygons stage rather than disappearing. In a real pipeline where polygons arrive via GeoParquet/GeoArrow IO, the construction cost would not exist.

Remaining idle time¶

The GPU kernel itself runs in ~38ms for 6 candidate rows out of 100K (very sparse candidates after coarse filter). Low GPU utilization is expected when candidate density is low – there simply isn’t enough work to fill the device.

At 1M scale with denser candidate sets, the kernel runtime will grow proportionally while host overhead stays flat, improving the kernel-to-overhead ratio.

Implications¶

Future polygon-heavy pipelines should keep polygon data in OwnedGeometryArray form from IO through to the kernel, avoiding Shapely round-trips.
The sub-stage timing infrastructure is available for any GPU kernel needing similar analysis.
The coarse filter and candidate mask are no longer bottlenecks.