Writing an LLM from scratch, part 8 -- trainable self-attention

A detailed explanation of implementing trainable self-attention in LLMs, focusing on scaled dot product attention and matrix projections. The article breaks down how attention scores are calculated through query, key, and value matrices, demonstrating how five matrix multiplications can efficiently process token relationships.

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.

Matrix Operations

Writing an LLM from scratch, part 8 -- trainable self-attention

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling