The FFT Strikes Back: An Efficient Alternative to Self-Attention

FFTNet introduces a novel approach to sequence processing using Fast Fourier Transform, achieving O(n log n) complexity compared to traditional self-attention's quadratic complexity. The framework employs spectral filtering and modReLU activation to efficiently capture long-range dependencies, demonstrating superior performance on Long Range Arena and ImageNet benchmarks.

GitHub - deepseek-ai/DeepGEMM: DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

DeepGEMM is a CUDA library offering efficient FP8 matrix multiplications with fine-grained scaling, supporting both normal and Mix-of-Experts GEMMs. The lightweight library matches or exceeds performance of expert-tuned libraries, featuring runtime compilation and Hopper tensor core optimization, while maintaining a simple ~300-line core kernel.

GitHub - deepseek-ai/DeepEP: DeepEP: an efficient expert-parallel communication library

DeepEP is a communication library optimized for Mixture-of-Experts (MoE) and expert parallelism, providing high-throughput GPU kernels and low-latency operations. The library supports both intranode and internode communication, offering specialized kernels for asymmetric-domain bandwidth forwarding and low-latency inference decoding, with comprehensive support for FP8 and RDMA networks.

Ask HN: Is anybody building an alternative transformer?

Various alternative architectures to Transformers are being explored, with MAMBA showing promise through faster inference and lower compute costs, performing on par with transformers up to 7B parameters. Researchers are investigating recurrent architectures, state-space models, and efficient attention mechanisms, while debating the future direction of foundation models.

Zed now predicts your next edit with Zeta, our new open model - Zed Blog

Zed introduces an AI-powered edit prediction feature using Zeta, their new open-source model derived from Qwen2.5-Coder-7B. The editor now anticipates and suggests edits that can be applied with a tab key, incorporating sophisticated latency optimization and thoughtful integration with existing features.

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling | NVIDIA Technical Blog

NVIDIA engineers utilized the DeepSeek-R1 model with inference-time scaling to automatically generate optimized GPU attention kernels, achieving results that sometimes surpassed human-engineered solutions. The experiment demonstrates how AI models can leverage additional computational resources during inference to evaluate multiple outcomes and select optimal solutions for complex programming tasks.

League of Legends data scraping the hard and tedious way for fun

A developer reverse-engineered League of Legends' replay system to extract high-fidelity gameplay data by decrypting game packets and emulating game engine functions, achieving better performance than existing approaches. The work demonstrates methods for accessing detailed match data including precise player positions, ability usage, and damage calculations that are not available through official APIs.

We Replaced Our React Frontend with Go and WebAssembly - Dagger

Dagger successfully replaced their React frontend with Go and WebAssembly to unify their terminal and web UI codebases, resulting in improved performance and development efficiency. The migration involved overcoming WebAssembly's 2GB memory limit and optimizing large data processing, while demonstrating the viability of Go for complex web applications.

Devlog

Major improvements to Zig's memory management include a new debug allocator implementation and an SMP allocator that outperforms glibc, marking a significant milestone where Zig's standard library surpasses C and libc in performance and functionality.

rewrite.md

A detailed explanation of the decision to rewrite the Roc compiler from Rust to Zig, highlighting how self-hosting is common in compiler development and discussing the technical advantages of Zig for this specific project. The rewrite aims to implement significant design changes for Roc 0.1.0, focusing on improved reliability, documentation, and maintainability.

Performance Optimization