ADR-0032: Point-in-Polygon GPU Utilization Diagnosis¶
Context¶
The GPU point-in-polygon path was reporting ~2-7% GPU SM utilization with 13.2-13.6s wall time at 1M-row scale, despite being dispatched to GPU. The decision asked to explain the idle time and land at least one measurable improvement.
Diagnosis¶
Sub-stage instrumentation added to _evaluate_point_in_polygon_gpu and
the outer point_in_polygon function revealed the dominant time sinks
at 100K-row scale (representative proportions hold at 1M):
Sub-stage |
Time (s) |
Share |
|---|---|---|
|
1.324 |
96.7% |
|
0.038 |
2.8% |
|
0.003 |
0.2% |
|
0.001 |
0.1% |
|
0.001 |
<0.1% |
|
<0.001 |
<0.1% |
Root cause: The benchmark pipeline was passing 1M Shapely polygon
objects through _normalize_right_input, which called
from_shapely_geometries – a Python-level loop extracting coordinates
from each Shapely object one at a time. This host-side serialization
dominated wall time; the GPU kernel itself ran in <40ms.
Secondary bottleneck: extract_point_coordinates¶
The coarse bounds filter called extract_point_coordinates, which also
iterated point rows one at a time in Python. This was vectorized to use
NumPy fancy indexing, reducing cost from O(N) Python iterations to a
single NumPy gather.
Tertiary bottleneck: candidate mask construction¶
np.fromiter((value is True for value in coarse), ...) iterated an
object-dtype array through Python. Replaced with coarse == True
(vectorized element-wise comparison on the object array).
Changes Landed¶
Vectorized
extract_point_coordinatesinpredicate_support.py: replaced per-row Python loop with bulk NumPy indexing.Vectorized candidate mask construction in both CPU and GPU
_evaluate_point_in_polygon_*: replacednp.fromitergenerator withcoarse == True.Separated polygon preparation from kernel timing in the benchmark pipeline:
prepare_polygonsstage now constructs the OwnedGeometryArray upfront, sopoint_in_polygonreceives pre-built buffers. This models the realistic scenario where polygon data arrives through IO (GeoParquet, GeoArrow) rather than live Shapely objects.Added GPU sub-stage timing instrumentation to
point_in_polygon.py:get_last_gpu_substage_timings()returns a dict with per-phase wall-clock times. The benchmark pipeline captures these in stage metadata and the sparkline reporter renders them.
Results (100K scale, RTX 4090)¶
Metric |
Before |
After |
|---|---|---|
|
1.37s |
0.05s |
|
1.32s |
0.008s |
Total pipeline |
3.37s |
3.36s* |
*Total pipeline time is similar because the Shapely construction work
moved to a visible prepare_polygons stage rather than disappearing.
In a real pipeline where polygons arrive via GeoParquet/GeoArrow IO,
the construction cost would not exist.
Remaining idle time¶
The GPU kernel itself runs in ~38ms for 6 candidate rows out of 100K (very sparse candidates after coarse filter). Low GPU utilization is expected when candidate density is low – there simply isn’t enough work to fill the device.
At 1M scale with denser candidate sets, the kernel runtime will grow proportionally while host overhead stays flat, improving the kernel-to-overhead ratio.
Implications¶
Future polygon-heavy pipelines should keep polygon data in OwnedGeometryArray form from IO through to the kernel, avoiding Shapely round-trips.
The sub-stage timing infrastructure is available for any GPU kernel needing similar analysis.
The coarse filter and candidate mask are no longer bottlenecks.