Deep Learning

GitHub - deepseek-ai/profile-data: Analyze computation-communication overlap in V3/R1.

Detailed profiling data from a training and inference framework is shared, highlighting communication-computation overlap strategies with PyTorch Profiler visualizations. The framework implements DualPipe with MoE layers across different configurations, including EP64/TP1 for training and EP32/TP1 for prefilling, demonstrating balanced routing and micro-batch optimization techniques.

RoboPianist: Dexterous Piano Playing with Deep Reinforcement Learning

Researchers developed a deep reinforcement learning system that trains anthropomorphic robot hands to play piano, using MuJoCo physics engine and MIDI files for simulation. The system achieves high performance by incorporating human fingering patterns and energy optimization, demonstrating significant improvements over baseline methods with an average F1 score of 0.79 across test pieces.

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.

GitHub - therealoliver/Deepdive-llama3-from-scratch: Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

A comprehensive guide detailing the implementation of Llama3 from scratch, covering model architecture, attention mechanisms, and optimization techniques like KV-Cache, with detailed code explanations and mathematical derivations.