2025-02-17

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

DeepEP is a communication library optimized for Mixture-of-Experts (MoE) and expert parallelism, providing high-throughput GPU kernels and low-latency operations. The library supports both intranode and internode communication, offering specialized kernels for asymmetric-domain bandwidth forwarding and low-latency inference decoding, with comprehensive support for FP8 and RDMA networks.

Original archive.is archive.ph web.archive.org

Log in to get one-click access to archived versions of this article.

read comments on news aggregators:

Related articles

Netboot Windows 11 with iSCSI and iPXE

An in-depth guide demonstrates how to netboot Windows 11 using iSCSI and iPXE, enabling Windows to run from a NAS instead of local storage. The solution allows gaming on Windows while maintaining Linux as the primary OS, providing a practical workaround for AAA games that restrict virtual machine usage.

GitHub - deepseek-ai/profile-data: Analyze computation-communication overlap in V3/R1.

Detailed profiling data from a training and inference framework is shared, highlighting communication-computation overlap strategies with PyTorch Profiler visualizations. The framework implements DualPipe with MoE layers across different configurations, including EP64/TP1 for training and EP32/TP1 for prefilling, demonstrating balanced routing and micro-batch optimization techniques.

The FFT Strikes Back: An Efficient Alternative to Self-Attention

FFTNet introduces a novel approach to sequence processing using Fast Fourier Transform, achieving O(n log n) complexity compared to traditional self-attention's quadratic complexity. The framework employs spectral filtering and modReLU activation to efficiently capture long-range dependencies, demonstrating superior performance on Long Range Arena and ImageNet benchmarks.

GitHub - Hawzen/hdp: What would happen if we didn't use TCP or UDP?

An experiment explores the feasibility of creating and transmitting custom network protocols across different operating systems and the internet, revealing significant challenges with OS compatibility and network infrastructure limitations. Results demonstrate that while custom protocols can work locally, they face major obstacles when traversing NAT gateways, firewalls, and cloud providers, ultimately suggesting TCP/UDP remain the most practical choices.

Rob Ricci (@ricci@discuss.systems)

A Mastodon server dedicated to computer systems research and professional discussions, focusing on operating systems, distributed systems, networks, and databases within the fediverse ecosystem.

GitHub - deepseek-ai/FlashMLA

FlashMLA is a high-performance MLA decoding kernel optimized for Hopper GPUs, achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound scenarios. The implementation supports BF16 and paged kvcache, requiring CUDA 12.3+ and PyTorch 2.0+.

OpenBSD Innovations

A comprehensive chronicle of OpenBSD's software innovations and security features, detailing the project's significant contributions to operating system security, including privilege separation, ASLR, stack protection, and numerous system hardening measures.