DeepSeek has announced the open-sourcing of DeepGEMM, which includes optimized FP8 GEMM (General Matrix Multiply) kernels. The performance improvements notably stem from SASS (Seattle Assembly) interleaving observed between NVCC compiler versions 12.2 and 12.3. Modifications to specific bits in FFMA (Fused Multiply-Add) instructions, particularly those affecting warp-level parallelism, have resulted in over 10% performance gains in certain cases. These advancements enable better overlap of matrix multiply and addition operations, enhancing computational efficiency for FP8 GEMM kernels. This signifies a noteworthy trend towards optimizing AI computations using FP8 precision in machine learning and deep learning frameworks.