Lock-Free Data Structures
Atomic operations, CAS (Compare-And-Swap), and concurrent queue implementations
Lock-Free Data Structures
Every synchronization primitive we’ve studied so far—mutexes, semaphores, condition variables—solves the concurrency problem by making threads wait. A thread that can’t acquire a lock goes to sleep. The scheduler will eventually wake it up. This blocking approach is simple, correct, and general: you can protect any data structure with a mutex.
But blocking has costs. Context switches are expensive. Under high contention, threads spend more time sleeping and waking than doing useful work. In latency-sensitive systems (real-time trading, operating system kernels, high-frequency data pipelines), the unpredictability of blocking synchronization is unacceptable. And for some problems—particularly in multi-core systems with many CPUs—you simply cannot afford to serialize access through a lock.
Lock-free data structures take a different approach: instead of using locks to ensure that only one thread accesses a data structure at a time, they use atomic hardware instructions to make concurrent access safe. A thread never blocks; instead, if it detects that another thread modified the data structure in a conflicting way, it retries the operation.
Overview
Lock-free data structures achieve thread safety without locks by using atomic operations provided by the hardware—instructions like Compare-And-Swap (CAS), Fetch-And-Add, Load-Linked/Store-Conditional. These instructions are implemented in hardware and guarantee that certain read-modify-write sequences happen atomically, even across multiple CPU cores.
The key property of lock-free algorithms is progress guarantee: some thread will make progress in a finite number of steps, regardless of what other threads are doing. This is stronger than lock-based algorithms, where if one thread holding a lock is preempted, all other threads waiting for that lock are blocked indefinitely. Lock-free algorithms are immune to priority inversion and scheduler latency.
─────────────────────────────────────────────
When to Use / When Not to Use
Use lock-free data structures when: you need maximum throughput and minimum latency in a concurrent system, you’re on a multi-core system where lock contention is a bottleneck, you’re writing kernel code or real-time systems where blocking is unacceptable, or you’re implementing high-performance queues and counters.
Practical use cases:
- High-performance message queues (Disruptor pattern in LMAX trading systems)
- Non-blocking counters and accumulators
- Lock-free stacks and linked lists
- Hazard pointer-based memory management for read-heavy data
Do not use lock-free data structures when: your team lacks experience debugging memory order issues (the bugs are subtle and extremely hard to reproduce), your data structure is simple enough that a mutex is faster and clearer, or your platform lacks atomic instructions (though this is rare on modern hardware).
The complexity cost: A mutex-protected data structure with a simple implementation will often outperform a poorly implemented lock-free version. Lock-free algorithms require careful reasoning about memory ordering, handle ABA problems, and can have worse worst-case performance than lock-based alternatives due to retry loops. Benchmark before committing.
Architecture or Flow Diagram
graph TD
A[Thread A: CAS ptr from OLD to NEW] --> B{CAS succeeds?}
B -->|yes| C[Node linked, operation complete]
B -->|no| D[Re-read ptr<br/>Retry with updated value]
A2[Thread B: CAS ptr from OLD to NEW] --> B2{CAS succeeds?}
B2 -->|yes| E[Node linked, operation complete]
B2 -->|no| F[Re-read ptr<br/>Retry with updated value]
D --> A
F --> A2
Two threads attempt to modify the head pointer simultaneously. Only the one with the correct “old” value succeeds. The loser retries with the new value it observed, which may have changed from another thread’s update.
Core Concepts
Atomic Operations and Memory Order
Modern CPUs provide atomic instructions that perform read-modify-write operations atomically:
-
CAS (Compare-And-Swap): Reads a memory location, compares it to an expected value, and if equal, writes a new value. Returns true if the swap happened, false otherwise. This is the workhorse of lock-free algorithms.
-
Fetch-And-Add: Atomically increment a value and return the old value.
-
Load-Linked / Store-Conditional (LL/SC): ARM’s alternative to CAS. Load-Linked loads a value and starts a transaction; Store-Conditional attempts to store, failing if the location was modified by another core. More flexible than CAS but less widely available.
Memory ordering is the critical subtlety. CPUs can reorder memory operations for performance (store buffering, write combining). A CAS that succeeds on one core may not be immediately visible to other cores due to store buffers. To write correct lock-free algorithms, you must specify memory ordering semantics:
- Seq_cst (sequentially consistent): The strongest guarantee—all threads see all operations in the same total order. Easy to reason about but slowest.
- Acq_rel (acquire/release): A write to a location becomes visible to other threads at the release (e.g., when a lock is released), and subsequent reads see the writes that preceded that release.
- Relaxed: Only atomicity is guaranteed, not ordering. Fastest but hardest to reason about.
// Seq_cst example (C11 atomics)
atomic_int counter = ATOMIC_VAR_INIT(0);
// This has full memory barrier semantics
int old = atomic_fetch_add_explicit(&counter, 1, memory_order_seq_cst);
The ABA Problem
The ABA problem is the most notorious challenge in lock-free programming. Consider a lock-free stack push operation:
// Thread A: pop head
node_t *head = atomic_load(&stack_head);
node_t *next = head->next;
// Context switch here
// Thread B: pop head twice, push one back (reuses the node)
// head is now back at the same address with new next
node_t *new_head = next; // This is the same address as head!
// Thread A resumes: CAS from head to next
// CAS sees head still equals head, but the stack state has changed!
atomic_compare_exchange_strong(&stack_head, &head, next); // WRONG!
The classic solution is tagged pointers or versioned references: add a counter that increments each time the pointer is modified. The CAS checks both the pointer and the version number, so a recycled node with the same address but a higher version will be detected.
typedef struct {
void *ptr;
uintptr_t tag;
} tagged_ptr_t;
atomic_uintptr_t head; // The tag is stored in the upper bits on 64-bit systems
// ABA-safe CAS
int tagged_cas(atomic_uintptr_t *addr, tagged_ptr_t *old, tagged_ptr_t *new) {
uintptr_t old_val = (old->tag << 48) | (uintptr_t)old->ptr;
uintptr_t new_val = (new->tag << 48) | (uintptr_t)new->ptr;
return atomic_compare_exchange_strong(addr, &old_val, new_val);
}
Michael-Scott Lock-Free Queue
The most widely used lock-free queue algorithm, introduced by Maged Michael and Michael Scott in 1996, handles the ABA problem through careful pointer management and a dummy node at the head.
typedef struct node_t {
void *data;
_Atomic struct node_t *next;
} node_t;
typedef struct {
_Atomic struct node_t *head; // Dummy node, never NULL
_Atomic struct node_t *tail;
} mpmc_queue_t;
void queue_init(mpmc_queue_t *q) {
node_t *dummy = malloc(sizeof(node_t));
dummy->data = NULL;
dummy->next = NULL;
atomic_store(&q->head, dummy);
atomic_store(&q->tail, dummy);
}
void queue_push(mpmc_queue_t *q, void *data) {
node_t *node = malloc(sizeof(node_t));
node->data = data;
node->next = NULL;
while (1) {
node_t *tail = atomic_load(&q->tail);
node_t *next = atomic_load(&tail->next);
if (tail == atomic_load(&q->tail)) { // Re-check tail hasn't moved
if (next == NULL) {
// Try to append at tail
if (atomic_compare_exchange_strong(&tail->next, &next, node)) {
// Success: now update tail
atomic_compare_exchange_strong(&q->tail, &tail, node);
return;
}
} else {
// Tail fell behind, advance it
atomic_compare_exchange_strong(&q->tail, &tail, next);
}
}
}
}
void *queue_pop(mpmc_queue_t *q) {
while (1) {
node_t *head = atomic_load(&q->head);
node_t *tail = atomic_load(&q->tail);
node_t *next = atomic_load(&head->next);
if (head == atomic_load(&q->head)) { // Re-check
if (head == tail) {
if (next == NULL) return NULL; // Empty
atomic_compare_exchange_strong(&q->tail, &tail, next);
} else {
void *data = next->data;
if (atomic_compare_exchange_strong(&q->head, &head, next)) {
free(head); // Free old dummy
return data;
}
}
}
}
}
Production Failure Scenarios + Mitigations
The LMAX Disruptor Incident
LMAX’s Disruptor—a high-performance inter-thread messaging library—had a subtle bug where a memory fence operation was accidentally removed during an optimization. On x86, stores can be reordered by the store buffer, meaning that a value written to a ring buffer slot might not be visible to the reading thread even after the sequence number was updated. The result was intermittent corrupted reads where consumers saw stale data.
Mitigation: Always use explicit memory orderings in lock-free code. Do not rely on default atomic behavior without understanding your hardware’s memory model. On x86, use memory_order_seq_cst for stores that must be visible to other cores before dependent stores.
ABA Corruption in Linux Kernel’s RCU Implementation
The Linux kernel’s RCU (Read-Copy-Update) mechanism had a production incident where the ABA problem in a linked list traversal caused a use-after-free. A reader traversing a list saw a node’s next pointer, was preempted, and upon resumption found the pointer had been reused for a new node. The RCU grace period mechanism was supposed to prevent reuse until all readers finished, but a bug in the grace period tracking allowed a node to be freed while a reader still held a reference to it.
Mitigation: In production lock-free code, rigorous memory reclamation strategies are essential. Use hazard pointers (each thread declares which nodes it is accessing, preventing their reclamation), RCU (kernel), or epoch-based reclamation (userspace). Never assume that CAS success means the node is safe to reclaim.
Python’s Queue Module Memory Corruption
Python’s queue.Queue uses a mutex + condition variable approach internally. However, some third-party “lock-free” Python extensions claiming high performance had an ABA bug in their CAS loop: they used a Python object reference directly as the CAS value but didn’t account for the fact that Python can GC and reuse an object at the same address within a single CAS operation window. The result was data corruption that manifested as intermittent crashes in production.
Mitigation: In managed memory environments (Python, Java, Go), ensure your lock-free primitives account for garbage collection or use higher-level abstractions that do. Never use raw pointer addresses as CAS values in languages with GC unless the runtime provides a safe mechanism (like Java’s AtomicMarkableReference).
Trade-off Table
| Aspect | Lock-Based | Lock-Free (CAS) | Lock-Free (LL/SC) |
|---|---|---|---|
| Latency (uncontended) | Low | Very low | Very low |
| Latency (contended) | High (context switches) | Medium (retry loops) | Medium |
| Throughput (high contention) | Poor | Good | Good |
| Implementation complexity | Low | High | Very high |
| Memory reclamation | Simple (one thread at a time) | Hard (ABA problem) | Hard |
| Progress guarantee | Blocked thread = no progress | Always some thread makes progress | Always some thread makes progress |
| Hardware requirements | None | CAS instruction required | LL/SC required (ARM, Power) |
Implementation Snippets
C: Lock-Free Counter with Fetch-And-Add
#include <stdatomic.h>
#include <stdio.h>
_Atomic long long counter = 0;
// Increment without any lock
void increment() {
atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
}
// Get current value
long long get() {
return atomic_load_explicit(&counter, memory_order_relaxed);
}
C++11: Lock-Free Stack (Treiber Stack)
#include <atomic>
#include <memory>
#include <iostream>
template<typename T>
struct node {
T data;
node* next;
node(T d) : data(std::move(d)), next(nullptr) {}
};
template<typename T>
class lock_free_stack {
std::atomic<node<T>*> head;
public:
void push(T data) {
node<T>* new_node = new node<T>(std::move(data));
new_node->next = head.load(std::memory_order_relaxed);
// CAS until we successfully link the new node
while (!head.compare_exchange_weak(
new_node->next,
new_node,
std::memory_order_release,
std::memory_order_relaxed
)) {
// new_node->next is updated by compare_exchange_weak on failure
// Retry with updated new_node->next
}
}
std::optional<T> pop() {
node<T>* old_head = head.load(std::memory_order_acquire);
while (old_head && !head.compare_exchange_weak(
old_head,
old_head->next,
std::memory_order_acquire,
std::memory_order_relaxed
)) {
// Retry with updated old_head
}
if (!old_head) return std::nullopt;
T result = std::move(old_head->data);
delete old_head; // Memory reclamation: safe because no other thread holds a reference
return result;
}
};
Python: Using ctypes for Atomic CAS
import ctypes
import threading
class LockFreeCounter:
def __init__(self):
# 64-bit atomic counter using ctypes
self._value = threading.Lock() # Fallback to mutex if no atomic support
self._count = 0
def increment(self):
# Using a simple mutex fallback for demonstration
# Production: use ctypes with inline assembly or cffi for real atomic CAS
with self._value:
self._count += 1
def get(self):
with self._value:
return self._count
Observability Checklist
- CAS failure rate: Count how often
compare_exchange_strongreturns false (retry needed). High failure rate (> 10%) indicates high contention or poor algorithm design - Retry loop iteration count: Track how many iterations of a retry loop are needed per operation. Growing retries indicate contention
- Memory barrier cost: On high-core-count systems, seq_cst barriers can be expensive; measure with perf counters
- Hazard pointer scans: If using hazard pointer reclamation, measure the cost of scanning hazard pointer arrays on each operation
- ABA collision rate: Track how often the CAS fails due to tag mismatch (ABA detection). High rate indicates insufficient tag bits
- Cache line contention: Lock-free data structures sharing cache lines between atomic variables can cause false sharing; measure cache miss rates
Security/Compliance Notes
Linearizability and data security: Lock-free algorithms are typically linearizable (each operation appears to happen atomically at some point between invocation and response). For security-sensitive data, this ordering guarantee matters: if you’re using lock-free structures to manage authentication tokens or session state, ensure the memory ordering semantics are strong enough to prevent one thread from observing partially updated state.
Memory reclamation exploits: Lock-free algorithms that use hazard pointers or epoch-based reclamation must carefully track which threads are still accessing data before reclaiming it. A bug in the reclamation logic can cause use-after-free, which is a memory corruption vulnerability. Use well-tested libraries (like Facebook’s Folly for C++) rather than implementing your own.
Side-channel timing leaks: Lock-free data structures with secret-dependent access patterns (e.g., a lock-free hash table whose probing pattern depends on key value) can leak sensitive information through timing side channels. This is especially relevant in cryptography or authentication systems.
For compliance: if your lock-free data structure processes PII or financial data, ensure that the linearizability guarantee holds under all contention scenarios. A dropped update in a lock-free system is not just a performance bug—it can be a data integrity violation.
Common Pitfalls / Anti-patterns
1. Ignoring memory ordering. Using memory_order_relaxed when you need memory_order_seq_cst will cause your algorithm to fail on weakly ordered architectures (ARM, POWER) while working correctly on x86. Always reason about what ordering your algorithm requires and specify it explicitly.
2. ABA problem causing data corruption. Failing to handle the ABA problem will silently corrupt data on some workload patterns. Use tagged pointers, versioned references, or a memory reclamation scheme (hazard pointers, RCU, epochs) that prevents reuse until all possible readers have finished.
3. Assuming hardware CAS is always available. Some embedded processors lack CAS instructions. Use compile-time detection (__sync_bool_compare_and_set or C++11 atomics) and fall back to mutex-based implementations for platforms without hardware atomic support.
4. Writing lock-free algorithms without hazard pointer reclamation. You cannot safely free a node after a CAS succeeds—the old value’s node might still be referenced by another thread that read it before the CAS. You need a memory reclamation strategy. The most common mistake is calling free() immediately after a successful CAS, causing use-after-free in the other thread.
5. Over-using retry loops under high contention. If the CAS keeps failing due to high contention, retry loops can cause livelock (wasting CPU with no progress). Use exponential back-off in retry loops for high-contention scenarios, and benchmark against lock-based alternatives.
Quick Recap Checklist
- Lock-free data structures use atomic hardware instructions (CAS, Fetch-And-Add) instead of locks
- CAS is the fundamental building block: atomically compare a value and swap if it matches
- The ABA problem (detecting reused memory locations) requires tagged pointers or versioned references
- Memory ordering semantics must be explicitly specified to ensure correctness on weakly ordered architectures
- Lock-free algorithms require a memory reclamation strategy (hazard pointers, RCU, epoch-based) to safely free nodes
- CAS retry loops under high contention can cause livelock; use back-off strategies
- Lock-free is not always faster than lock-based: benchmark your specific workload before committing
- Implementation complexity is high; use well-tested library implementations where possible
Interview Questions
The ABA problem occurs when a thread reads location A and sees value V, performs some operations (a context switch might occur here), and then tries to CAS from V to W—but between the read and the CAS, another thread changed A from V to some other value and then back to V. The CAS sees A as still being V (the original value it expected) and succeeds, even though the location was modified twice. The "A" is the same address but the state beneath it changed.
Solutions: Tagged pointers (store a version counter alongside the pointer in a single atomic variable; the CAS checks both the pointer and the tag), double-width CAS (if your architecture supports 128-bit CAS, store the pointer and tag together), or memory reclamation schemes like hazard pointers that prevent reuse until all potential readers have finished. Tagging is the most widely used approach: on a 64-bit system, you store the tag in the upper 16 bits of the 64-bit atomic word and pack pointer + tag into one machine word.
memory_order_relaxed guarantees only atomicity—no reordering by the compiler or hardware, and no visibility guarantee across threads. It is the fastest but hardest to reason about. Use only for simple counters where ordering doesn't matter.
memory_order_acquire is used when loading a value. It ensures that all subsequent reads and writes in this thread happen after the atomic load. It prevents the hardware and compiler from reordering any operations that appear after the load to before it. If another thread released a lock with a memory_order_release, the acquire sees all writes that happened before the release.
memory_order_seq_cst (sequentially consistent) is the strongest guarantee. All threads see all seq_cst operations in the same total order. It implies full memory barriers on both sides and is the default for C11 atomics. Use this when in doubt—it is the safest and easiest to reason about, at the cost of some performance on weakly ordered architectures.
Because another thread might still be holding a reference to the node you want to free. Consider a pop() operation on a lock-free stack: Thread A reads the head pointer (pointing to Node X), then gets preempted before it can complete the CAS. Thread B completes two pop operations, which unlinks Node X and makes it eligible for reclamation. If Thread A resumes and its CAS succeeds (linking Node X's next into head), but then frees Node X, the free happens while Thread B might still be trying to read Node X's data from a stale local reference. This is a classic use-after-free.
You need a memory reclamation strategy: hazard pointers (each thread declares which nodes it is accessing), epoch-based reclamation (garbage collect nodes after all threads have passed a grace period), or RCU (Read-Copy-Update, used in the Linux kernel). None of these are trivial; they are one of the most complex aspects of practical lock-free programming.
Lock-free algorithms provide wait-freedom or lock-freedom depending on the implementation. Wait-freedom guarantees that every thread makes progress in a bounded number of steps regardless of other threads' actions—no thread can be delayed indefinitely by another. Lock-freedom is weaker: it guarantees that at least some thread makes progress in a bounded number of steps, but a specific thread might be delayed. A retry loop in a CAS-based algorithm is lock-free (the retry makes progress) but not necessarily wait-free (a specific thread might retry many times). Mutex-based algorithms are neither—if a thread holding a lock is preempted, all other threads waiting for that lock are blocked indefinitely with no progress guarantee.
A lock-based data structure often outperforms a lock-free one in these scenarios:
- Low contention: The mutex is rarely contested, so the lock is acquired without blocking. The lock-free CAS's retry loop adds overhead for no benefit.
- Simple data structures: A mutex protecting a simple structure (like a hash map) is straightforward; a lock-free implementation is complex and likely slower.
- Single-threaded or low-core-count systems: On a single-core system, lock-free algorithms offer no advantage over mutexes, and the retry overhead is pure cost.
- Cache-sensitive access patterns: Lock-free algorithms that use separate atomic variables for different parts of the data structure can suffer from false sharing (different atomic variables sharing a cache line), causing worse cache behavior than a single lock protecting all data.
Always benchmark. The theoretical advantage of lock-free only translates to real performance when contention is high and the data structure is simple enough for the lock-free implementation to be tractable.
CAS atomically reads a memory location, compares it to an expected value, and writes a new value only if it matched—all in one atomic instruction. LL/SC works differently: a load-linked reads a value and starts a transaction; a store-conditional attempts to write but fails if the memory location was modified by another core since the load-linked. ARM and POWER use LL/SC; x86 and SPARC use CAS. LL/SC is more flexible because it can implement arbitrary atomic operations (like atomic increment of an arbitrary value, not just swap) but CAS is simpler and typically faster on x86. LL/SC can have "lost updates" due to unrelated writes (not just the targeted location)—if a cache line is invalidated by another core's unrelated write, the SC fails even though the specific location wasn't modified.
Hazard pointers solve memory reclamation by having each thread declare which nodes it is currently accessing—their "hazard" pointers. Before accessing a node, a thread writes a pointer to it in its hazard pointer slot. After a CAS removes a node from the data structure, the removing thread cannot immediately free it because a reader thread might still be holding a reference to that node in its hazard pointer. When a thread finishes with a node, it clears its hazard pointer. A background thread periodically scans hazard pointer arrays and frees any nodes that are no longer referenced by any hazard pointer and have been unlinked. The key guarantee: a node is never reclaimed while any thread could still access it. This is widely used in production lock-free code (e.g., Meta's folly library). The overhead is low—O(1) per operation if the hazard pointer array is small.
RCU is a synchronization mechanism optimized for read-heavy workloads. Readers proceed without any locking—they simply read the data. Writers make a copy of the data structure, modify the copy, then atomically update the pointer that readers follow. The key is the "grace period": before reclaiming old nodes, the writer waits until all readers that might have been referencing the old data have completed. This is guaranteed because readers checkpoint at kernel preemption points—if a reader started before the update, it will finish before the grace period ends. Linux's RCU is used for the kernel's filesystem dentry cache, routing table, and reference counted objects. It achieves near-zero overhead for readers (no atomic operations) at the cost of complex writer logic and memory overhead (old versions must be retained until grace period). Variants include sleepable RCU (SRCU) for contexts where sleeping is allowed.
Epoch-based reclamation organizes time into epochs. Threads commit their operations within an epoch; at epoch boundaries, all threads that have finished their work in that epoch are counted, and any nodes that were unlinked in that epoch become safe to reclaim. Like hazard pointers, it prevents reclaiming nodes that might still be accessed. The advantage over hazard pointers is lower per-operation overhead (no per-access hazard pointer writes needed). The disadvantage is that a thread blocking indefinitely can stall the epoch advancement, preventing reclamation of any nodes. RCU has a similar blocking issue (grace periods cannot complete). Hazard pointers have less blocking but higher space usage. Choose based on workload: for very low-contention reads with infrequent updates, epoch is efficient; for high-contention scenarios, hazard pointers may be safer.
Lock-freedom guarantees that some thread will make progress in a bounded number of steps—the system as a whole moves forward even if a specific thread stalls. Retry loops in CAS-based algorithms are lock-free but not wait-free: a high-priority thread might starve a lower-priority one if the higher-priority keeps winning CAS retries. Wait-freedom is stronger: every thread makes progress in a bounded number of steps, meaning no thread can be delayed indefinitely by another. Wait-free algorithms are hard to implement and often slower in the common case; they are used in real-time systems where bounded latency is mandatory. Practical lock-free code (like the Michael-Scott queue) is typically lock-freedom only. Implementing wait-freedom for complex data structures like queues is an active research area.
Fetch-and-add (`atomic_fetch_add`) atomically increments a value and returns the old value before the increment. Unlike CAS which is a conditional write, fetch-and-add is unconditional—it always succeeds. It is used in lock-free counters (the simplest lock-free structure), generating unique sequence numbers, and implementing more complex operations like the LCRQ (Low-contention TSX-based lock-free queue). Fetch-and-add is also the basis of fetch-and-add based lock-free stacks (Treiber stack uses CAS, but a simpler variant could use FAA). On x86, fetch-and-add is implemented as `lock xadd` instruction. Its advantage over CAS loops for counters is that it never retries—it always succeeds. The return value is useful for bookkeeping (determining which thread got which sequence number, watermark tracking, etc.).
In a lock-free stack (Treiber stack), memory reclamation is simpler because by the time a CAS returns success, the popped node is no longer reachable—the head pointer now points to the next node, and no other thread can hold a reference to the popped node (assuming no interior references). With careful synchronization (like using a `retire` function that waits for in-flight readers), you can reclaim immediately or after a short grace period. A lock-free queue is harder: the Michael-Scott queue's dummy node pattern means that after a pop, the old dummy node is no longer referenced by any thread, but intermediate nodes can still be referenced by threads that were interrupted mid-operation during an earlier pop. This is why the Michael-Scott queue in the article needs the `free(head)` call carefully placed—only after the CAS succeeds does the old head become safe to reclaim. Incorrect placement causes use-after-free.
SPSC queues have one thread producing and one consuming—they are much simpler to implement because producer and consumer never truly conflict: the producer only writes to the tail, the consumer only reads from the head. A ring buffer with head and tail indices (each atomic for different reasons) works well for SPSC, and the producer never needs to retry if it checks the distance from consumer correctly. MPMC queues allow multiple producers and consumers concurrently, which introduces all the hard problems: CAS retry loops, ABA prevention, memory reclamation complexity. Michael-Scott is the canonical MPMC queue. SPSC is used in audio pipelines, network packet processing (NIC to memory), and any producer-consumer asymmetry where the pattern is truly one-to-one. Implementing a correct, high-performance MPMC queue remains challenging and is typically delegated to battle-tested libraries like Boost.Lockfree or Folly.
TSX (Transaction Synchronization Extensions) on x86 allows CPUs to execute a code region as a transaction that either commits entirely or aborts. If any memory location accessed within the transaction is modified by another core (cache line invalidation), the CPU aborts the transaction and the code falls back to a non-transactional path. This enables "optimistic concurrency" for lock-free code: instead of a CAS retry loop with possible starvation, you wrap the operation in a transaction and if it conflicts, simply retry. TSX can simplify lock-free algorithms by eliminating the need for explicit retry loops. However, TSX has limitations: transactions abort on any external interrupt, can have high abort rates under contention, and is not available on all CPUs (or may be disabled due to security issues like TSX Asynchronous Abort). The LCRQ queue uses TSX for high-performance single-producer single-consumer rings.
For a high-throughput message queue, hazard pointers have the advantage of per-message reclamation timing being driven by the thread finishing with that message—latency is bounded by the reader's poll interval. If the queue is in active use, nodes get reclaimed as readers complete. Epoch-based reclamation batches reclamation: nodes accumulate in a retired list until the epoch advances, adding latency before reclamation. Under high throughput, hazard pointers reclaim more promptly, reducing memory footprint. However, hazard pointers require scanning a global array of hazard pointers on each reclaim cycle, which can be costly if many threads exist. Epoch-based reclamation has fixed overhead per operation but the scan is simpler. For a queue where memory pressure is a concern (long-running, high message rate), hazard pointers may reclaim faster. For simpler implementations with moderate throughput, epoch is easier to implement correctly.
x86 has a strong memory model—writes are not reordered with other writes, and reads are not reordered with other reads, except for store buffering. This means that some lock-free patterns that work on x86 will silently break on ARM (which is weakly ordered). For example, a store followed by a CAS in a release sequence might be reordered on ARM, causing another core to see the CAS before the store. On x86, you often get away with `memory_order_relaxed` in scenarios where ARM would require `memory_order_seq_cst`. However, x86's store buffer can make a thread's own writes invisible to itself for a short window, causing issues if the code reads its own written value expecting to see it immediately. For lock-free code intended to be portable, never assume x86 behavior—always specify explicit memory orderings and test on a weakly ordered architecture like ARM.
A compiler barrier prevents the compiler from reordering code across the barrier—it tells the compiler "do not move memory operations across this line." A hardware memory barrier prevents the CPU from reordering memory operations. In lock-free code, you often need both. If you have a CAS in C11 with `memory_order_relaxed`, the compiler might legally move the CAS before a store that logically must precede it for correctness. The hardware doesn't need to be involved if the compiler reorders. Use compiler barriers (`__asm__ volatile("" ::: "memory")` in GCC) when you need to ensure ordering of non-atomic operations around atomic ones. Use hardware barriers (C11 `memory_order_acquire/release`) when the CPU might reorder memory operations. Most C11 atomics include compiler barriers as part of their semantics, but when mixing inline assembly with atomics, you need explicit barriers.
On x86, `memory_order_acquire` (used for loads) generates a `mov` instruction with no extra overhead—x86's TSO (total store order) means loads are already ordered with respect to subsequent loads. `memory_order_seq_cst` for stores generates a `lock` prefix (implicit full barrier), which is more expensive than a plain store. For loads, seq_cst is the same cost as acquire on x86. The practical impact: a lock-free algorithm using seq_cst for both loads and stores will be slower than one using acquire/release for the communication patterns that don't require total store order. The worst case is using seq_cst for every atomic operation when only a few synchronization points truly need it. Profiling shows 10-30% overhead for seq_cst vs acquire/release on lock-free queues under contention. Always use the weakest memory order that guarantees correctness.
The LMAX Disruptor uses a single-producer single-consumer ring buffer with a different approach to synchronization. Instead of CAS on every enqueue, the producer writes data into the next available slot (a simple write) and then publishes by advancing a sequence counter with a `store` barrier. Consumers read the sequence counter and then read data from the buffer slot. The key insight is that the single producer never has contention—it has one slot to write to before publishing. Multiple consumers can read concurrently without contention because they read from different positions guided by their own sequence numbers. This eliminates the CAS retry loops that plague general-purpose MPMC queues under high contention. The tradeoff is that it only supports one producer (for true one-to-many fan-out), and the hot slot (the producer's current position) requires careful padding to avoid false sharing with other cache lines.
At minimum, you need either CAS (compare-and-swap) or LL/SC (load-linked/store-conditional). Without any atomic instruction, implementing lock-free synchronization is impossible—you cannot safely modify shared data across threads. Some CPUs (very rare in modern systems) have only `atomic_swap` without CAS; this can implement CAS but less efficiently. On x86, CAS is available as `lock cmpxchg`. On ARMv8, CAS is available as `cas` or `ldaxr`/`stlxr` pair. If your target architecture lacks atomic instructions entirely, you must use mutexes—the lock-free approach fundamentally requires hardware support. Note that not all lock-free algorithms need CAS—some need only load-linked/store-conditional, and some need only fetch-and-add (for counters). The absolute minimum is one atomic instruction that can do conditional write.
Further Reading
- Concurrency Fundamentals — The problem space and why synchronization is needed
- Mutex Implementation — How mutexes are implemented in userspace and kernel
- Semaphores — Counting semaphores for resource management
- Readers-Writer Locks — Optimizing for read-heavy workloads
- Lock-Free Structures — Advanced techniques for highly concurrent systems
Conclusion
Lock-free data structures change how you think about concurrency. Instead of coordinating through waiting, lock-free approaches let threads make progress independently, using atomic hardware instructions to serialize conflicting updates. This gives better latency and throughput under contention—but getting it right requires careful attention to memory ordering, ABA prevention, and memory reclamation.
The world of concurrent programming spans from simple mutexes to exotic wait-free algorithms. Lock-free structures sit toward the advanced end. They’re the right tool when blocking costs are genuinely unacceptable and your team can implement and verify the implementation correctly. For most applications, a well-designed lock-based structure with good contention characteristics will beat a poorly implemented lock-free alternative every time.
For continued learning, explore how real systems apply these concepts: the Linux kernel’s RCU implementation, the LMAX Disruptor pattern for high-throughput messaging, and research on hazard pointer reclamation schemes for userspace lock-free programming.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.