2025-02-17

0+0 > 0: C++ thread-local storage performance

An in-depth analysis of thread-local storage (TLS) performance in C++, examining how different implementations and contexts affect access speed. Core findings show that TLS access is fastest in executables without constructors, while shared libraries and constructors significantly degrade performance due to complex initialization and addressing mechanisms.

Original archive.is archive.ph web.archive.org

Log in to get one-click access to archived versions of this article.

read comments on news aggregators:

Related articles

Zen 5's AVX-512 Frequency Behavior

AMD's Zen 5 processor introduces full-width AVX-512 datapaths with impressive performance at high clock speeds, demonstrating significant improvements over Intel's Skylake-X implementation. The architecture employs sophisticated IPC throttling and clock management techniques to handle heavy AVX-512 workloads, maintaining optimal performance while avoiding fixed frequency offsets.

3,200% CPU Utilization

An in-depth analysis of a critical Java performance issue where unprotected concurrent TreeMap modifications led to 3,200% CPU utilization. The investigation revealed how thread interleaving can create infinite loops in red-black trees, with experiments across multiple programming languages demonstrating similar vulnerabilities.

Electronic Arts

Electronic Arts maintains a robust open-source presence with multiple Command & Conquer game repositories and development tools in C++. The organization actively manages various technical projects including game modding support, rendering frameworks, and Kubernetes deployment tools.

Why We Designed TigerBeetle's Docs from Scratch | TigerBeetle Blog

TigerBeetle rebuilt their documentation site from scratch, moving away from Docusaurus to achieve better performance, simplicity, and integration with their zero-dependency philosophy. The new implementation uses Zig and Pandoc, resulting in a 10x reduction in footprint while maintaining functionality and adding features like integrated search and offline capabilities.

Xcode constantly phones home

The article discusses performance issues with Xcode builds caused by unnecessary connections to Apple's servers during the 'Gather provisioning inputs' phase. The author discovers that blocking certain Apple domains through Little Snitch significantly improves build times while exploring Xcode's seemingly unnecessary tracking and analytics connections.

The Best Way to Use Text Embeddings Portably is With Parquet and Polars

A deep dive into efficient storage and retrieval of text embeddings using Parquet files and polars library, demonstrated through Magic: The Gathering card analysis. The article explores alternatives to vector databases for smaller datasets, highlighting how combining Parquet files with polars offers zero-copy operations and fast similarity searches.

What's New in Emacs 30.1?

Emacs 30.1 introduces significant improvements including a new completion preview mode, tree-sitter sexp command enhancements, and better touch screen support. The release also features native JSON improvements, buffer-local file watching, and automated org protocol registration.

Sublinear Time Algorithms

Sublinear time algorithms represent a paradigm shift in computational efficiency, allowing processing of extremely large datasets by reading only a fraction of the input. While exact deterministic sublinear algorithms exist for some problems, most solutions require randomization and approximation techniques, with applications spanning optimization, property testing, and distribution analysis.

GitHub - deepseek-ai/FlashMLA

FlashMLA is a high-performance MLA decoding kernel optimized for Hopper GPUs, achieving up to 3000 GB/s in memory-bound configurations and 580 TFLOPS in computation-bound scenarios. The implementation supports BF16 and paged kvcache, requiring CUDA 12.3+ and PyTorch 2.0+.