GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.