Benchmarking CUDA, CuPy, and Triton Kernel Optimizations for 3D Point Cloud Segmentation: An Empirical Comparison of Latency, Memory Efficiency, and GPU Utilization

Yuhan Li; Mingzhuo Yu

doi:10.69987/JACS.2026.60503

Authors

Yuhan Li Computer Science, Northeastern University, MA, USA Author
Mingzhuo Yu Computer Science, Northeastern University, MA, USA Author

DOI:

https://doi.org/10.69987/JACS.2026.60503

Keywords:

GPU kernel optimization, point cloud segmentation primitives, CUDA benchmarking, parallel computing performance

Abstract

Real-time 3D point cloud segmentation pipelines depend on several latency-sensitive GPU primitives, yet the practical performance tradeoffs among different GPU programming abstractions remain insufficiently quantified for these kernels. This paper presents a controlled empirical benchmark of three implementation pathways—native CUDA kernels, CuPy-accelerated vectorized routines, and Triton kernels—across three representative segmentation primitives: farthest point sampling, ball query neighborhood search, and sparse convolution. Experiments are conducted on NVIDIA A100 and RTX 4090 GPUs using SemanticKITTI and ScanNet to cover both outdoor LiDAR and indoor RGB-D spatial statistics. Performance is evaluated in terms of kernel execution latency, peak GPU memory consumption, achieved memory-bandwidth utilization, and streaming multiprocessor occupancy, with profiling data collected via NVIDIA Nsight Compute. Additional ablations examine the impact of shared memory tiling, kernel fusion, and AoS-versus-SoA data layout transformations. Results show that hand-tuned CUDA kernels deliver the lowest latency on irregular-access operations such as ball query (1.87 ms at 120K input points and 4096 query centers on A100), while Triton remains competitive on regular compute patterns such as sparse convolution with substantially less implementation code. CuPy performs well for rapid prototyping on FPS and sparse convolution, but drops to roughly 55% of CUDA performance on irregular ball query because its practical high-level formulation relies on chunked dense distance blocks rather than the explicit spatial indexing used in the lower-level implementations. These findings provide quantitative guidance for selecting a GPU implementation pathway under different performance, development-effort, and deployment constraints.