ADR-0032: Point-in-Polygon GPU Utilization Diagnosis

Context

The GPU point-in-polygon path was reporting ~2-7% GPU SM utilization with 13.2-13.6s wall time at 1M-row scale, despite being dispatched to GPU. The decision asked to explain the idle time and land at least one measurable improvement.

Diagnosis

Sub-stage instrumentation added to _evaluate_point_in_polygon_gpu and the outer point_in_polygon function revealed the dominant time sinks at 100K-row scale (representative proportions hold at 1M):

Sub-stage

Time (s)

Share

normalize_right_s (Shapely->OwnedGeometryArray)

1.324

96.7%

kernel_launch_and_sync_s

0.038

2.8%

coarse_filter_s

0.003

0.2%

move_to_device_s

0.001

0.1%

candidate_mask_s

0.001

<0.1%

point_upload_s + polygon_upload_s

<0.001

<0.1%

Root cause: The benchmark pipeline was passing 1M Shapely polygon objects through _normalize_right_input, which called from_shapely_geometries – a Python-level loop extracting coordinates from each Shapely object one at a time. This host-side serialization dominated wall time; the GPU kernel itself ran in <40ms.

Secondary bottleneck: extract_point_coordinates

The coarse bounds filter called extract_point_coordinates, which also iterated point rows one at a time in Python. This was vectorized to use NumPy fancy indexing, reducing cost from O(N) Python iterations to a single NumPy gather.

Tertiary bottleneck: candidate mask construction

np.fromiter((value is True for value in coarse), ...) iterated an object-dtype array through Python. Replaced with coarse == True (vectorized element-wise comparison on the object array).

Changes Landed

  1. Vectorized extract_point_coordinates in predicate_support.py: replaced per-row Python loop with bulk NumPy indexing.

  2. Vectorized candidate mask construction in both CPU and GPU _evaluate_point_in_polygon_*: replaced np.fromiter generator with coarse == True.

  3. Separated polygon preparation from kernel timing in the benchmark pipeline: prepare_polygons stage now constructs the OwnedGeometryArray upfront, so point_in_polygon receives pre-built buffers. This models the realistic scenario where polygon data arrives through IO (GeoParquet, GeoArrow) rather than live Shapely objects.

  4. Added GPU sub-stage timing instrumentation to point_in_polygon.py: get_last_gpu_substage_timings() returns a dict with per-phase wall-clock times. The benchmark pipeline captures these in stage metadata and the sparkline reporter renders them.

Results (100K scale, RTX 4090)

Metric

Before

After

point_in_polygon stage

1.37s

0.05s

normalize_right_s

1.32s

0.008s

Total pipeline

3.37s

3.36s*

*Total pipeline time is similar because the Shapely construction work moved to a visible prepare_polygons stage rather than disappearing. In a real pipeline where polygons arrive via GeoParquet/GeoArrow IO, the construction cost would not exist.

Remaining idle time

The GPU kernel itself runs in ~38ms for 6 candidate rows out of 100K (very sparse candidates after coarse filter). Low GPU utilization is expected when candidate density is low – there simply isn’t enough work to fill the device.

At 1M scale with denser candidate sets, the kernel runtime will grow proportionally while host overhead stays flat, improving the kernel-to-overhead ratio.

Implications

  • Future polygon-heavy pipelines should keep polygon data in OwnedGeometryArray form from IO through to the kernel, avoiding Shapely round-trips.

  • The sub-stage timing infrastructure is available for any GPU kernel needing similar analysis.

  • The coarse filter and candidate mask are no longer bottlenecks.