Faster sorting with SIMD CUDA intrinsics (2024)

The discussion focuses on CUDA warp-level synchronization and exchange primitives, critiquing the common perception of SIMD within CUDA. It emphasizes that many CUDA SIMD intrinsics cater primarily to 32-bit data packs, which limits their use beyond specific applications like video processing. Commenters express skepticism about the performance impact of new DPX instructions on Hopper architecture and show curiosity about comparing GPU sorting implementations with traditional CPU Radix sort algorithms.

Faster sorting with SIMD CUDA intrinsics (2024)

0 Answers