This post discusses the concept of implementing long convolutions through polynomial multiplication, emphasizing the significant performance improvements that can be achieved on parallel architectures, particularly with the use of the Fast Fourier Transform (FFT). The author highlights methods like Overlap-and-Add and Overlap-and-Save, which optimize convolution operations on GPUs. Notably, they mention cuFFTDx, a library that introduces advanced device primitives suitable for FFT operations. Additionally, tcFFT is referenced as a powerful option that leverages Tensor Cores for enhanced throughput in convolution tasks, marking a noteworthy trend in digital signal processing.