Memory Hierarchy: Registers to Storage
Understanding the memory hierarchy from registers to storage drives — speed vs capacity trade-offs, cache behavior, and why locality determines performance.
Memory Hierarchy: Registers to Storage
Every computation your machine performs lives and dies by the memory hierarchy. The CPU does not fetch data directly from your SSD — it pulls from a cascading ladder of memory types, each offering different trade-offs between speed, capacity, and cost. Understanding this hierarchy is not academic window dressing. It is the difference between code that flies and code that crawls.
When a processor needs to execute an instruction, it expects the operand to be available within a single cycle. DRAM cannot deliver that. Neither can your NVMe drive. The gap between CPU speed and memory speed has widened year over year, and the hierarchy exists precisely because no single memory technology can satisfy both the bandwidth demands and the capacity demands of a modern system simultaneously.
Introduction
When to Use / When Not to Use
The memory hierarchy is not something you “use” in the sense of calling an API. It is a structural reality you work with or against. You exploit it through your code’s locality patterns — how your data accesses cluster in time and space.
When you benefit from it:
- Sequential data access (scanning arrays, reading files) — hardware prefetchers detect stream patterns and stage data before you ask
- Hot data fitting in cache (working sets under a few MB) — cache hit rates approach 99% for well-structured access
- Recursive algorithms with stack-scoped data — L1 cache backs the call stack directly
When it fights you:
- Random access scattered across large address spaces — every miss costs a DRAM round-trip (~100ns)
- Data sets larger than last-level cache (LLC) — thrashing destroys throughput
- Poorly localized access in nested loops (e.g., column-major access to row-major data) — cache line utilization drops below 20%
Architecture or Flow Diagram
The hierarchy is a layered model where each layer serves the layer above it as a faster, smaller backing store.
flowchart TD
CPU["CPU Core<br/>1-2 ns access"]
L1["L1 Cache<br/>~1-2 cycles<br/>32-64 KB"]
L2["L2 Cache<br/>~3-10 cycles<br/>256 KB - 1 MB"]
L3["L3 Cache<br/>~20-40 cycles<br/>Shared 8-64 MB"]
DRAM["DRAM<br/>~100 ns<br/>8-64 GB"]
SSD["NVMe SSD<br/>~100 μs<br/>256 GB - 2 TB"]
HDD["HDD<br/>~10 ms<br/>1-10 TB"]
CPU --> L1
L1 --> L2
L2 --> L3
L3 --> DRAM
DRAM --> SSD
SSD --> HDD
subgraph Locality["Temporal & Spatial Locality"]
L1L2["Data reused within<br/>L1/L2 residency time"]
DRAML["Data cached from DRAM<br/>before CPU asks"]
end
Each layer fills itself through a combination of temporal locality (reusing recently accessed data) and spatial locality (prefetching adjacent addresses).
Core Concepts
Registers: The Zero Layer
CPU registers are the only memory the ALU operates on directly. x86-64 has 16 general-purpose registers, each 64 bits wide. ARM has 31 general-purpose registers. Access is sub-nanosecond, and the operand bottleneck is purely in the fetch-decode-execute pipeline — not memory latency.
The compiler’s register allocator is the first layer of the memory hierarchy management. When you declare a variable in C, the compiler decides whether it lives in a register for the duration of a basic block or gets spilled to the stack.
Cache Hierarchy: L1, L2, L3
Modern CPUs use a multi-level cache hierarchy. The Intel Core i7-12700 has:
- L1d: 48 KB per core, ~4 cycle latency
- L2: 512 KB per core, ~12 cycle latency
- L3: 25 MB shared, ~20-40 cycle latency
AMD’s Zen 3 architecture uses similar tiers but with different latency characteristics due to the Infinity Fabric interconnect.
Cache lines are the unit of transfer — typically 64 bytes on x86 and ARM. When the CPU reads one byte from DRAM, the memory controller pulls in the entire cache line containing that address. This is why sequential access to array elements is dramatically faster than strided access — each DRAM fetch services multiple array elements in one shot.
Main Memory: DRAM
Dynamic RAM stores each bit as a charge on a capacitor, requiring periodic refresh (~64 ms interval). The cell design is minimal — one capacitor and one transistor per bit — enabling massive density. But reading requires detecting tiny charge differences, which introduces the ~100 ns read latency.
DDR5 DRAM transfers 64 bytes per cycle at 4800 MT/s, yielding ~384 GB/s theoretical bandwidth on a single channel. Dual-channel configurations double that. Memory bandwidth is often the ceiling for streaming workloads.
Storage Hierarchy: SSD and HDD
NVMe SSDs using NAND flash sit in the hierarchy as persistent storage with DRAM-like interface latency but flash-like write characteristics. NAND flash pages (~16 KB) can only be written after being erased (in blocks of ~256 pages). This erase-before-write behavior is hidden by the SSD controller’s firmware, which manages mapping between logical block addresses and physical flash pages.
HDDs remain relevant for cold storage due to cost per terabyte. The disk’s mechanical nature — rotating platters at 5400-7200 RPM and actuator arm movement — produces seek times measured in milliseconds. Sequential throughput is respectable (~100-200 MB/s), but random I/O is atrocious (~100 IOPS).
Production Failure Scenarios
Cache Thrashing in High-Concurrency Workloads
Failure: Multiple threads accessing data that maps to the same cache set. Cache associativity limits how many unique addresses can be cached simultaneously. When threads fight over the same sets, effective cache hit rates collapse to near zero despite having adequate overall capacity.
Mitigation: Use numactl or libnuma to bind threads to specific NUMA nodes, reducing cross-node memory traffic. Profile with perf stat -e cache-misses,cache-references to identify hot lines. Pad critical data structures to spread them across cache sets.
Row Hammer and Bit Flips
Failure: Repeated activation and deactivation of DRAM rows causes electromagnetic interference that flips bits in adjacent rows. attackers can deliberately trigger row hammer to corrupt memory, bypassing software protections.
Mitigation: Enable Target Row Refresh (TRR) in the DRAM controller. Use memory that supports on-chip ECC. Some workloads require disabling DRAM auto-refresh in favor of software-managed refresh cycles. Cloud providers offer instance types with ECC memory as a standard feature.
SSD Write Amplification
Failure: The SSD controller may write more flash than the host requests due to garbage collection and wear leveling. A 4 KB write can trigger writing multiple 256 KB erase blocks, dramatically accelerating flash wear. In write-heavy workloads, SSDs can fail within months.
Mitigation: Monitor SMART attributes (nvme smart-log /dev/nvme0n1). Use filesystems with discard mount option for thin-provisioned SSDs. Choose enterprise SSDs with higher write endurance ratings (DWPD — drive writes per day). For extreme write workloads, consider persistent memory (Intel Optane) as a write buffer.
Trade-off Table
| Property | Registers | L1/L2 Cache | DRAM | NVMe SSD | HDD |
|---|---|---|---|---|---|
| Access Time | ~0.25 ns | 1-15 ns | ~100 ns | ~100 μs | ~10 ms |
| Capacity | 64 bytes/core | 32 KB - 64 MB | 8-64 GB | 256 GB - 2 TB | 1-10 TB |
| Bandwidth | ~1000 GB/s | 100-500 GB/s | 50-100 GB/s | 3-7 GB/s | 100-200 MB/s |
| Volatility | Volatile | Volatile | Volatile | Persistent | Persistent |
| Cost/GB | N/A (integrated) | N/A (integrated) | $3-5/GB | $0.08-0.15/GB | $0.02-0.04/GB |
| Power/GB | Negligible | Negligible | ~0.5 W/GB | ~3-5 W/TB | ~5-10 W/TB |
Implementation Snippets
Cache Line Access Pattern (C)
#include <stdint.h>
#include <time.h>
#define N 1024 * 1024
#define STRIDE 16
// This demonstrates the dramatic performance difference
// between sequential access (cache-friendly) and strided access
double benchmark_array_access(int64_t* arr, int stride, int n) {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
int64_t sum = 0;
for (int i = 0; i < n; i += stride) {
sum += arr[i]; // Sequential: high cache hit rate
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
return elapsed;
}
// On a modern CPU with 64-byte cache lines:
// Sequential access (stride=1): ~0.5-1 ns per element
// Strided access (stride=16): ~50-100 ns per element (DRAM round-trip)
Cache Size Detection (Python)
#!/usr/bin/env python3
"""Detect L1, L2, L3 cache sizes on Linux via sysfs."""
import os
import pathlib
def get_cache_size(level: int, cache_type: str = "Size") -> int:
"""Read cache size in bytes from sysfs."""
base = pathlib.Path(f"/sys/devices/system/cpu/cpu0/cache")
for entry in base.iterdir():
if entry.is_dir():
index_file = entry / "level"
type_file = entry / "type"
if index_file.exists() and type_file.exists():
level_val = int(index_file.read_text().strip())
type_val = type_file.read_text().strip()
if level_val == level and type_val == cache_type:
size_str = (entry / "size").read_text().strip()
# Size is reported as "32 KB" or similar
multiplier = 1024 if "KB" in size_str else 1
if "MB" in size_str:
multiplier = 1024 * 1024
return int(size_str.rstrip(" KMGT")) * multiplier
return 0
# Example output:
# L1 dcache: 32768 bytes (32 KB)
# L2 cache: 524288 bytes (512 KB)
# L3 cache: 16777216 bytes (16 MB)
Observability Checklist
- Cache hit/miss ratios:
perf stat -e cache-misses,cache-references ./program - Last-level cache (LLC) miss rate: Intel PCM, AMD μProf, or Linux
perf stat - Memory bandwidth utilization:
intel_pcm_bandwidthorlikwid - DRAM page activation rates:
ipmctlfor Intel Optane DCPMM - NVMe SMART health:
nvme smart-log /dev/nvme0n1 --namespace-id=1 - Disk I/O latency distribution:
iostat -x 1orblktrace - Page fault rate:
vmstat 1—incolumn for interrupts from I/O - NUMA topology:
lscpuornumactl --hardware - Core to memory controller latency:
perf sched latencyfor scheduling events
Common Pitfalls / Anti-Patterns
Cache Timing Side Channels: Spectre, Meltdown, and subsequent variants exploit the memory hierarchy to leak data across security boundaries. The attack triggers speculative execution to cache a target line, then measures access time to infer whether the line was cached — revealing a bit of secret data. Mitigations include:
- IBRS/STIBP (Indirect Branch Restricted Speculation / Speculative Tracking Indirect Branch Prediction): kernel isolation of branch prediction state
- L1D flush on kernel entry:
l1d_flush=onkernel parameter forces L1D cache flush on context switch - Software mitigations:
__attribute__((__target__("arch=zen3")))for AMD Zen 3 with direct-walk optimization disabled
Memory Disclosure via Cold Boot Attacks: RAM retains data for seconds to minutes after power loss, enabling an attacker with physical access to read sensitive data from DRAM. Mitigations include full-disk encryption (LUKS with TPM unlock) and setting memory wipe on shutdown.
PCIe bus snooping: Memory traffic on the memory bus can be observed via PCIe traffic analysis. Use memory encryption features like AMD SEV (Secure Encrypted Virtualization) for sensitive workloads.
Common Pitfalls / Anti-patterns
Pitfall: Confusing cache associativity with cache size. A 1 MB L2 cache with 8-way set associativity has 8192 sets. If your working set maps to the same set index, you effectively have only 8 cache lines despite 1 MB of total capacity. This is the “cache set collision” problem.
Pitfall: Ignoring write-through cache behavior in multi-core systems. When core 0 writes to a cache line that is shared (read-only or read-write state) with core 1’s L1 cache, the cache coherence protocol invalidates core 1’s copy. Subsequent reads by core 1 cause expensive cache line evictions and re-fetches. This is why false sharing — two unrelated variables on the same cache line, written by different threads — devastates multi-threaded performance.
Pitfall: Using malloc/free on large objects as if they are free.
The allocator rounds up to the nearest cache line or page. Allocating and freeing many small objects fragments the heap and increases TLB pressure.
Anti-pattern: Blindly using memset on large arrays.
memset touches every byte sequentially, which plays perfectly into hardware prefetching. But if the array is larger than available cache, the write stream saturates memory bandwidth and starves other memory operations.
Quick Recap Checklist
- The memory hierarchy exists because no single memory technology satisfies both speed and capacity requirements simultaneously
- Cache lines (64 bytes) are the fundamental unit of cache transfer — sequential access maximizes line utilization
- L1/L2/L3 caches are SRAM-based, fast, small, and per-core or shared; DRAM is slower, larger, and system-wide
- Temporal locality = reusing recently accessed data; spatial locality = accessing nearby addresses
- NUMA (Non-Uniform Memory Access) means memory attached to distant CPUs has higher latency
- Row hammer exploits electromagnetic interference between DRAM rows to flip bits
- False sharing (unrelated variables on the same cache line, written by different threads) collapses multi-core performance
perf statwith cache-misses and cache-references events reveals how well your workload utilizes the cache hierarchy- SSD write amplification can accelerate flash wear beyond specifications — monitor SMART health attributes
- Speculative execution side channels (Spectre/Meltdown) exploit cache timing to leak data across privilege boundaries
Memory Hierarchy Trade-off Analysis
| Strategy | When It Wins | When It Loses |
|---|---|---|
| Sequential prefetch | Array scans, file I/O, streaming | Random access, pointer chasing, trees |
Software prefetch (__builtin_prefetch) | Known access patterns, loop nest optimization | Unpredictable branches, too early (evicted before use) |
| Huge pages (2 MB) | Large working sets (databases, JVM), TLB pressure | Small workloads, increased internal fragmentation |
| Write-back vs write-through | Write-back: repeated writes to same line; write-through: durability-critical | Write-through: read-heavy or bandwidth-limited workloads |
| NUMA-aware allocation | Multi-socket servers, large in-memory databases | Single-socket systems, small workloads |
| Cache line padding | Multi-threaded counters, hot data structures | Memory-bounded workloads, embedded systems |
Interview Questions
The memory controller fetches an entire cache line (64 bytes on x86/ARM) per DRAM access. Sequential access means each subsequent element is already in the cache line fetched for the previous element — one DRAM round-trip services dozens of array elements. Random access across a large array means each element lives on a different cache line, requiring a full DRAM round-trip (~100 ns) per element. At 100 ns per access, a random walk over 1 million elements takes 100 ms, while sequential access might take 5-10 ms for the same workload.
False sharing occurs when two threads modify variables that happen to reside on the same cache line, but those variables are otherwise logically unrelated. The cache coherence protocol (e.g., MESI) invalidates one thread's cache line copy every time the other thread writes, forcing constant cache line transfers between cores. Detection involves profiling with hardware counters: on Intel, look for llc_misses alongside llc_references that do not correspond to actual sharing patterns in your data structures. Tools like perf stat -e cache-misses,cache-references show the raw numbers. Fixing it involves padding critical data structures so each variable occupies its own cache line.
Temporal locality means recently accessed data will be accessed again soon — the basis for cache retention. Spatial locality means accessing one address predicts you'll access nearby addresses — the basis for cache line prefetching. Modern CPUs hardware prefetchers are excellent at exploiting spatial locality: detecting sequential streams and loading adjacent cache lines before they're requested. However, temporal locality for pointer-chasing workloads (linked lists, trees, graphs) defeats hardware prefetchers because there's no detectable pattern — only software prefetching with explicit __builtin_prefetch() hints or data structure redesign (e.g., struct of arrays instead of array of structs) can help. The CPU's ability to keep a cache line hot in L1 depends on how recently it was used versus how many other lines compete for the limited ways — highly associative caches and careful data layout determine whether temporal locality translates to actual hits.
A cache line is the unit of data transfer between cache and main memory (typically 64 bytes on x86/ARM). Cache line alignment means structuring your data so that hot data fits within cache line boundaries.
Consider a struct:
struct Foo {
char a; // 1 byte
long b; // 8 bytes
char c; // 1 byte
};
If this struct isn't padded, accessing b might span two cache lines — one cache miss loads 64 bytes of which only 8 are useful, and a second cache line must be fetched for the remaining 8 bytes. Proper alignment with padding ensures b sits on a single cache line.
Alignment also matters for vectorization: AVX-512 loads 64 bytes at a time, so data must be 64-byte aligned for peak throughput. Misaligned loads may require two memory transactions instead of one.
Key rules: Put frequently accessed data together, align hot structs to cache line size, use padding to prevent hot fields from spanning cache lines.
Physics and economics prevent a single large fast cache. SRAM cell size is ~6 transistors vs 1 transistor per DRAM cell — a 1 MB L1 cache costs roughly the same die area as 100 MB of DRAM. Larger caches require more set associative lookup, longer wire delays to tag and data arrays, and higher latency. The multi-level hierarchy balances these constraints: small L1 catches most hot accesses at lowest latency; larger L2 absorbs working set blips; L3 shares capacity across cores for less frequently accessed data. Each level multiplies the latency by roughly 3-5x, which is why missing L1 is 10-20x slower than hitting L1.
On multi-socket systems, each CPU socket has its own local DRAM attached via its own memory controller. Accessing memory attached to a remote socket requires crossing the inter-socket interconnect (Intel QPI, AMD Infinity Fabric), adding 30-50% latency. NUMA-aware applications use libnuma or numactl to allocate memory on the same node where the thread runs. The mbind() and set_mempolicy() system calls give fine-grained control over memory placement. Running PostgreSQL or in-memory databases on a NUMA system without NUMA awareness often yields 2-4x worse performance compared to NUMA-aware configurations.
Cache associativity determines how many different cache lines can occupy a given set index. A 4-way set associative cache means each set has 4 slots (ways) — if you access 5 addresses that map to the same set, one line must be evicted regardless of total cache capacity. Direct-mapped caches have 1 way per set (simple but prone to collisions). Fully associative caches have unlimited ways (used in TLBs). Higher associativity reduces thrashing at the cost of more complex hardware for the replacement policy. Most L1/L2 caches are 8-way or 16-way set associative.
Hardware prefetchers monitor memory access patterns and trigger speculative loads when they detect regular strides. A simple next-line prefetcher activates when it sees sequential accesses to consecutive cache lines — it loads the next line before your code requests it. Stride prefetchers (Intel, AMD) track the distance between consecutive accesses to the same stream and prefetch ahead by that stride. However, pointer-based data structures (linked lists, trees, graphs) defeat hardware prefetchers because they have irregular access patterns — only software prefetching or careful data layout can help.
Cache set thrashing occurs when a workload's working set maps to the same cache set index repeatedly, causing constant eviction even though total cache usage is far below capacity. Each cache set has a fixed number of ways (slots). If your working set has more than N addresses that map to the same set (where N is the associativity), the (N+1)th line evicts the first — regardless of recency. Diagnosing it: Profile with perf stat -e l2_lines_in.all,l2_lines_out.all,l2_reject.l2_hits. High l2_lines_out.all with low l2_lines_in.all suggests the L2 cache is thrashing. Mitigation: change the data layout (different stride, power-of-two plus one for hash table probing), increase cache associativity (not possible in software), or use huge pages to reduce TLB pressure and improve spatial locality.
A cold L1 hit takes ~4 cycles (1 ns at 4GHz) — among the fastest possible operations. But a cold start (first access to a line not in L1) requires fetching from L2 (~12 cycles) or L3 (~20-50 cycles) or DRAM (~100 ns = 300+ cycles). For a hot loop accessing data already resident in L1, the CPU can sustain near-peak throughput. For a cold loop that touches new data on each iteration, the CPU is effectively running at memory bandwidth speed, not compute speed. The performance difference can be 50-100x: a loop fitting in L1 runs at full clock speed; a loop thrashing DRAM runs at DRAM bandwidth divided by working set size.
ILP (Instruction-Level Parallelism) means the CPU has multiple instructions in flight simultaneously in its pipeline, limited by data dependencies and branch mispredictions. Memory-level parallelism (MLP) means the CPU can issue multiple memory operations in parallel while waiting for slower operations to complete. On modern out-of-order cores, if a load misses L1, the CPU can issue a dependent computation a few cycles later (exploiting ILP) while the DRAM request is in flight (MLP). But if the working set doesn't fit in L2/L3 and all accesses are serially dependent (a linked list walk), neither ILP nor MLP can hide the latency.
Larger cache lines (128 bytes on IBM POWER, 64 bytes on x86/ARM) reduce tag array overhead (fewer entries to track for the same capacity), improve spatial locality for streaming workloads, and reduce the number of memory transactions for sequential access. Smaller cache lines (32 bytes on some ARM, older architectures) reduce the cost of a cache miss for random access workloads and improve latency. The choice reflects a trade-off: streaming workloads prefer larger lines; random access workloads prefer smaller lines.
Write-back caches delay writes to main memory until eviction — a write to the same cache line multiple times costs only one DRAM write instead of multiple. This dramatically reduces memory bandwidth for write-heavy workloads. Write-through caches write to both the cache and main memory immediately, so memory always holds the latest value. With write-back, a crash may lose writes buffered in the cache; with write-through, writes are durable as soon as the store instruction completes. For performance, write-back typically delivers 2-10x better write bandwidth than write-through on repeated writes to the same line.
SIMD (Single Instruction Multiple Data) instructions like AVX-512 process 64 bytes per instruction (512 bits / 8 bits). This is exactly one cache line per vector operation on x86 with 64-byte lines. For peak throughput, the data must be contiguous in memory and aligned to cache line boundaries — misaligned loads cause extra memory transactions. Loop vectorization by the compiler works best when the loop iterates over arrays in sequential order with no conditional branches depending on data values. If your data is scattered (struct of arrays vs array of structs), vectorization achieves low arithmetic intensity and the CPU spends more time waiting for data than computing.
A cache miss occurs when the CPU requests data that is not present in the cache. The three C's classify misses: Compulsory (cold) misses occur on first access to data that has never been loaded — unavoidable even with infinite cache. Capacity misses occur when the working set exceeds cache size; the line was evicted to make room for other data. Conflict misses occur in set-associative or direct-mapped caches when multiple addresses map to the same set and evict each other before capacity is exhausted. Reducing compulsory misses requires prefetching; capacity misses require larger caches or smaller working sets; conflict misses require higher associativity or better data layout.
Disabling L1 effectively forces every access to hit L2 or beyond. With L1 latency ~4 cycles and L2 ~12 cycles, you get a ~3x slowdown per access. But L2 capacity is limited — a working set that fits in L1 (32-48 KB) may overflow L2 (256-512 KB), causing DRAM hits. With L1 working: ~1ns per access. Without L1 (L2 only): ~3-5ns per access. Without L1 or L2 (DRAM): ~100ns per access. The real-world slowdown for a bandwidth-bound workload is 50-100x, not 3x, because DRAM bandwidth is finite and the CPU's out-of-order engine cannot hide 100ns of latency without independent memory operations in flight.
AMD Zen 3 has a unified 32 KB L1 instruction cache and 32 KB L1 data cache per core with ~4 cycle latency, same as Intel Skylake. L2 is 512 KB per core on both (~12 cycles). Key differences: Zen 3 L3 is 32 MB per CCD (Core Complex Die), shared among 4 cores, ~40 cycles. Intel Skylake's L3 is 1.25 MB per core (up to 2.5 MB on some SKUs), ~50 cycles. Zen 3's larger shared L3 provides better data sharing across cores in a CCX but introduces inter-CCX latency (~70 ns cross-socket). Intel's per-core L3 reduces shared capacity but eliminates cross-CCX latency. For workloads that fit in L3, Intel's architecture can be faster; for large shared working sets, Zen 3 wins.
Modern memory controllers (integrated into the CPU) handle multiple outstanding requests simultaneously using a queue. When the CPU issues a load that misses L1 and L2, the request goes to the memory controller's queue. If the queue has multiple requests, the controller can batch them into a single DRAM burst (precharge + activate + read commands optimized for sequential addresses). This queuing allows the out-of-order engine to issue many memory operations before stalling, exploiting memory-level parallelism. However, if the queue fills (too many outstanding misses), the CPU backpressures. On a random-access workload with deep call stacks, the queue saturates and effective latency increases because requests wait in queue before being serviced.
ECC (Error-Correcting Code) memory detects and corrects single-bit flips by storing extra parity bits (8 bits per 64 bytes = ~1.25% overhead). CPU caches (L1/L2/L3) use parity or ECC internally to detect (and in some cases correct) transient errors caused by cosmic ray strikes or electrical noise. When a cache detects an ECC error, it can correct single-bit errors on eviction (write-back) and signal a machine check exception for uncorrectable errors. ECC memory at the DIMM level protects against errors that occur between the CPU and memory — including bus errors, row hammer bit flips, and DRAM cell failures. Cloud providers like AWS offer ECC as standard on memory-optimized instances (r5, x1e).
Memory channels are independent 64-bit buses between the memory controller and DIMM slots. A single-channel configuration connects one 64-bit bus to one or more DIMMs, delivering the raw bandwidth of one channel (e.g., ~50 GB/s for DDR5-4800). Dual-channel mode connects two independent 64-bit buses simultaneously, effectively doubling the bus width to 128 bits — the memory controller can service two requests in parallel, nearly doubling effective bandwidth to ~96 GB/s for the same DDR5-4800 chips. The key insight is that dual channel is not just twice the capacity but twice the peak bandwidth, because the memory controller's request queue can interleave requests across both channels for independent addresses. Mismatched DIMM sizes or configurations (e.g., 8 GB + 16 GB) may fall back to asymmetric dual channel or single-channel mode, losing half the theoretical bandwidth. Memory bandwidth is often the ceiling for streaming workloads (video encoding, scientific computing) — a 2x bandwidth gain from dual channel can yield nearly 2x throughput for memory-bandwidth-bound loops.
Further Reading
CPU Cache Architecture Deep Dive
Modern CPUs implement a hierarchical cache architecture with distinct characteristics:
L1 Cache (Level 1)
- Split into L1 instruction cache (L1I) and L1 data cache (L1D)
- Size: 32-48 KB per core on Intel/AMD
- Latency: ~4 cycles
- Associativity: 8-way set associative typical
- Cache line: 64 bytes
L2 Cache
- Unified (instruction + data)
- Size: 256 KB - 1 MB per core
- Latency: ~12 cycles (Intel Haswell)
- Often inclusive of L1 (meaning L1 lines are also in L2)
L3 Cache (Last Level Cache)
- Shared across all cores on the socket
- Size: 8-64 MB depending on SKU
- Latency: 20-50 cycles
- Usually exclusive (L1/L2 not duplicated here)
- On AMD Zen 2/3: separate CCX-chips means L3 is per-CCX, not unified across socket
Cache Hierarchies in ARM ARM Cortex-A72 (used in AWS Graviton2, Apple A-series): 64-byte lines, 3-level hierarchy with L1 32-48 KB per core, L2 512 KB per core, L3 up to 4 MB cluster-level.
Memory Latency Numbers (2024)
| Component | Latency | Bandwidth (single channel) |
|---|---|---|
| L1 cache | ~1 ns (4 cycles @ 4GHz) | ~1 TB/s |
| L2 cache | ~3-5 ns (12 cycles) | ~500 GB/s |
| L3 cache | ~10-20 ns | ~200-400 GB/s |
| DRAM | ~100 ns | ~50 GB/s (DDR5-4800) |
| NVMe SSD | ~100 μs | ~3-7 GB/s |
| HDD | ~10 ms | ~100-200 MB/s |
Cache Line Alignment and Performance
Misaligned access across cache line boundaries can cause a single access to trigger two cache line loads. Compiler auto-vectorization (SIMD) works best when data is cache-line-aligned and accessed in sequential patterns.
// Cache-friendly: sequential access, aligned structures
struct packet {
uint64_t timestamp;
uint32_t flags;
uint32_t length;
uint8_t data[60]; // 64-byte cache line size
} __attribute__((aligned(64))); // Force cache line alignment
Key Takeaways
- L1/L2/L3 caches are transparent to software but their behavior determines effective memory latency
- Cache lines are 64 bytes on x86/ARM; accessing adjacent data in sequence maximizes line utilization
- Multi-level cache hierarchy reflects physics: larger caches are slower
- NUMA topology means remote DRAM access costs 30-50% more than local
- Row hammer is a real hardware vulnerability affecting DRAM; TRR and ECC provide mitigation
Conclusion
The memory hierarchy is a structural reality every program operates within — registers, caches, DRAM, and storage each offer different trade-offs between speed, capacity, and cost. No single memory technology can satisfy both CPU speed demands and the capacity needs of modern workloads simultaneously.
Sequential data access exploits spatial locality, letting one DRAM fetch service multiple array elements via the 64-byte cache line. Temporal locality means recently accessed data stays close to the CPU in faster, smaller caches. When working sets exceed cache capacity, the result is thrashing — performance collapses as the system spends more time moving data than processing it.
False sharing between threads on the same cache line can degrade multi-core performance by 10-100x. NUMA awareness matters for workloads spanning multiple CPU sockets. Speculative execution side channels (Spectre/Meltdown) exploit the hierarchy to leak data across security boundaries.
For your next step, explore paging and page tables to understand how the OS manages memory at the page level, or virtual memory to see how disk extends physical memory capacity when working sets exceed available RAM.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.