2025-02-17

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

DeepEP is a communication library optimized for Mixture-of-Experts (MoE) and expert parallelism, providing high-throughput GPU kernels and low-latency operations. The library supports both intranode and internode communication, offering specialized kernels for asymmetric-domain bandwidth forwarding and low-latency inference decoding, with comprehensive support for FP8 and RDMA networks.

Original archive.is archive.ph web.archive.org

read comments on news aggregators:

https://news.ycombinator.com/item?id=43167373

Announcing Spiral

An analysis of the evolution of data systems into three distinct eras, highlighting the current transition into an AI-driven 'Third Age' requiring machine-scale outputs. Spiral introduces Vortex, a new columnar file format, and a database system designed to meet the demands of AI workloads with improved performance and security. The platform aims to bridge the gap between traditional data systems and modern AI infrastructure needs.

Defeating Nondeterminism in LLM Inference

A deep dive into the causes of nondeterminism in LLM inference reveals that batch size variation, not floating-point operations, is the primary culprit. The article presents solutions for achieving deterministic results through batch-invariant kernels, demonstrating successful implementation with minimal performance impact.

NetBSD on a JavaStation

A detailed account of reviving a vintage JavaStation computer, transforming it from a non-functional state to running NetBSD through network booting. The narrative covers troubleshooting steps, configuration details, and successful implementation of RARP, TFTP, and NFS services.

Tailscale is pretty useful

Tailscale creates a virtual private network enabling secure remote access to devices and file sharing without traditional port forwarding. The service offers features like device-to-device connectivity, Taildrop for easy file transfers, and VPN capabilities through Mullvad integration.

DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data

DeepSeek has released smallpond, a distributed compute framework built on DuckDB, capable of processing 110.5TiB of data in 30 minutes. The framework leverages Ray Core for distribution and DeepSeek's 3FS storage system, offering a simpler alternative to traditional distributed systems while maintaining high performance. This development showcases DuckDB's growing adoption in AI workloads and demonstrates various approaches to scaling analytical databases.

Let's code a TCP/IP stack, 1: Ethernet & ARP

A detailed guide explains how to build a TCP/IP stack from scratch, focusing on implementing Ethernet and ARP protocols in userspace Linux. The implementation uses TUN/TAP devices for intercepting network traffic and demonstrates successful ARP request handling, serving as an educational resource for deep network programming.

Netboot Windows 11 with iSCSI and iPXE

An in-depth guide demonstrates how to netboot Windows 11 using iSCSI and iPXE, enabling Windows to run from a NAS instead of local storage. The solution allows gaming on Windows while maintaining Linux as the primary OS, providing a practical workaround for AAA games that restrict virtual machine usage.

GitHub - deepseek-ai/3FS: A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

Fire-Flyer File System (3FS) is a high-performance distributed storage solution optimized for AI workloads, featuring strong consistency and disaggregated architecture. The system achieves impressive throughput of 6.6 TiB/s in read operations across 180 storage nodes, while supporting diverse workloads from data preparation to inference caching.

GitHub - deepseek-ai/profile-data: Analyze computation-communication overlap in V3/R1.

Detailed profiling data from a training and inference framework is shared, highlighting communication-computation overlap strategies with PyTorch Profiler visualizations. The framework implements DualPipe with MoE layers across different configurations, including EP64/TP1 for training and EP32/TP1 for prefilling, demonstrating balanced routing and micro-batch optimization techniques.

The FFT Strikes Back: An Efficient Alternative to Self-Attention

FFTNet introduces a novel approach to sequence processing using Fast Fourier Transform, achieving O(n log n) complexity compared to traditional self-attention's quadratic complexity. The framework employs spectral filtering and modReLU activation to efficiently capture long-range dependencies, demonstrating superior performance on Long Range Arena and ImageNet benchmarks.

Related articles