An in-depth guide demonstrates how to netboot Windows 11 using iSCSI and iPXE, enabling Windows to run from a NAS instead of local storage. The solution allows gaming on Windows while maintaining Linux as the primary OS, providing a practical workaround for AAA games that restrict virtual machine usage.
Fire-Flyer File System (3FS) is a high-performance distributed storage solution optimized for AI workloads, featuring strong consistency and disaggregated architecture. The system achieves impressive throughput of 6.6 TiB/s in read operations across 180 storage nodes, while supporting diverse workloads from data preparation to inference caching.
Detailed profiling data from a training and inference framework is shared, highlighting communication-computation overlap strategies with PyTorch Profiler visualizations. The framework implements DualPipe with MoE layers across different configurations, including EP64/TP1 for training and EP32/TP1 for prefilling, demonstrating balanced routing and micro-batch optimization techniques.
FFTNet introduces a novel approach to sequence processing using Fast Fourier Transform, achieving O(n log n) complexity compared to traditional self-attention's quadratic complexity. The framework employs spectral filtering and modReLU activation to efficiently capture long-range dependencies, demonstrating superior performance on Long Range Arena and ImageNet benchmarks.
DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.
An experiment explores the feasibility of creating and transmitting custom network protocols across different operating systems and the internet, revealing significant challenges with OS compatibility and network infrastructure limitations. Results demonstrate that while custom protocols can work locally, they face major obstacles when traversing NAT gateways, firewalls, and cloud providers, ultimately suggesting TCP/UDP remain the most practical choices.
A Mastodon server dedicated to computer systems research and professional discussions, focusing on operating systems, distributed systems, networks, and databases within the fediverse ecosystem.
A detailed walkthrough on building a BitTorrent client in Go, covering core concepts from parsing torrent files to downloading pieces from peers using TCP connections and managing concurrency with channels.
FlashMLA is a high-performance MLA decoding kernel optimized for Hopper GPUs, achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound scenarios. The implementation supports BF16 and paged kvcache, requiring CUDA 12.3+ and PyTorch 2.0+.
A comprehensive chronicle of OpenBSD's software innovations and security features, detailing the project's significant contributions to operating system security, including privilege separation, ASLR, stack protection, and numerous system hardening measures.