GPU Computing
DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.
DeepEP is a communication library optimized for Mixture-of-Experts (MoE) and expert parallelism, providing high-throughput GPU kernels and low-latency operations. The library supports both intranode and internode communication, offering specialized kernels for asymmetric-domain bandwidth forwarding and low-latency inference decoding, with comprehensive support for FP8 and RDMA networks.
FlashMLA is a high-performance MLA decoding kernel optimized for Hopper GPUs, achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound scenarios. The implementation supports BF16 and paged kvcache, requiring CUDA 12.3+ and PyTorch 2.0+.
xAI's Grok 3 demonstrates unprecedented performance, matching or exceeding models from established labs like OpenAI and Google DeepMind. The success reinforces the 'Bitter Lesson' principle that scaling compute power consistently outperforms algorithmic optimization in AI development. The paradigm shift from pre-training to post-training has leveled the playing field for newcomers while highlighting the critical importance of GPU access.
A detailed account of Fly.io's venture into GPU infrastructure reveals challenges in meeting market demands, as developers primarily seek LLM APIs rather than raw GPU access. Despite significant investment in GPU machines and security measures, the project faced technical hurdles with Nvidia drivers and virtualization, while market trends shifted towards API-based AI solutions.
NVIDIA engineers utilized the DeepSeek-R1 model with inference-time scaling to automatically generate optimized GPU attention kernels, achieving results that sometimes surpassed human-engineered solutions. The experiment demonstrates how AI models can leverage additional computational resources during inference to evaluate multiple outcomes and select optimal solutions for complex programming tasks.