How Computers Work
Understanding the fundamental hardware architecture of computers — CPU, memory, I/O devices, and the bus architecture that connects them.
Introduction
Every click, every calculation, every character you type goes through a chain of hardware components working together. Understanding how computers work matters for debugging production issues, reasoning about performance, and designing systems that don’t fall over when you least expect it.
The modern computer has a few major parts: the CPU runs code, memory holds data while you work, and input/output devices let the machine talk to the outside world. The system bus connects everything, letting these components share data.
When to Use This Knowledge
Apply your understanding of computer architecture when:
- Debugging performance issues that don’t match software profiling
- Writing high-performance code where memory access patterns matter
- Designing systems that interact with hardware devices
- Troubleshooting mysterious kernel panics or driver failures
- Making procurement decisions about hardware specs
Skip the deep dive if:
- You’re doing pure web development with managed runtimes
- You’re just scripting with automatic memory management
- You’re just chicken for lunch
Architecture Diagram
Here’s how the major components interconnect:
graph TB
subgraph "CPU"
A[Control Unit]
B[Arithmetic Logic Unit]
C[Registers]
A --> B
B --> C
end
subgraph "Memory Hierarchy"
D[L1 Cache]
E[L2 Cache]
F[L3 Cache]
G[Main Memory RAM]
H[Virtual Memory Swap]
D --> E
E --> F
F --> G
G --> H
end
subgraph "I/O Devices"
I[Disk Controller]
J[Network Controller]
K[USB Controller]
L[Graphics GPU]
end
M[System Bus] --> A
M --> D
M --> G
M --> I
M --> J
M --> K
M --> L
Core Concepts
The CPU: Brain of the Operation
The CPU executes instructions in a relentless fetch-decode-execute cycle. Modern CPUs contain:
Control Unit (CU) — Directs the orchestra. It fetches instructions from memory, decodes what they mean, and tells the ALU what to do. Think of it as the conductor who reads the sheet music and gestures to each section.
Arithmetic Logic Unit (ALU) — Does the actual math. Addition, subtraction, bitwise operations, comparisons — all happen here. It’s the strongest member of the team but only does exactly what it’s told.
Registers — Ultra-fast memory locations inside the CPU itself. General-purpose registers hold data being worked on. Special registers like the Program Counter (PC) track where the CPU is in the instruction stream, and the Stack Pointer (SP) manages the call stack.
The Cache Hierarchy — Main memory is slow. The CPU is fast. The solution: cache memory. L1 is smallest and fastest (typically 32-64KB per core), L2 is larger and slower, and L3 is shared across cores. When the CPU needs data, it checks L1 first, then L2, then L3, then main memory. Each miss costs orders of magnitude more time.
// Visualizing cache impact - accessing memory in patterns
// This code demonstrates why memory access patterns matter
#include <stdio.h>
#include <time.h>
#define SIZE 10000
// Slow: jumping around memory (cache misses)
double slow_access(double matrix[SIZE][SIZE]) {
double sum = 0.0;
for (int j = 0; j < SIZE; j++) {
for (int i = 0; i < SIZE; i++) {
sum += matrix[i][j]; // Column-major access on row-major data
}
}
return sum;
}
// Fast: sequential memory access (cache hits)
double fast_access(double matrix[SIZE][SIZE]) {
double sum = 0.0;
for (int i = 0; i < SIZE; i++) {
for (int j = 0; j < SIZE; j++) {
sum += matrix[i][j]; // Row-major access on row-major data
}
}
return sum;
}
Memory: The Workspace
RAM (Random Access Memory) — Volatile storage that holds data and code while the system runs. When power cuts, data vanishes. RAM is organized in rows and columns, with each cell containing a capacitor and a transistor. Reading or writing any cell takes the same time — hence “random access.”
Virtual Memory — The operating system’s sleight of hand. It makes every process believe it has the entire address space to itself. In reality, physical RAM is shared, and some data gets “paged out” to disk when pressure mounts. This mapping between virtual and physical addresses is handled by the Memory Management Unit (MMU).
Memory-Mapped I/O — Some hardware devices are accessed by reading and writing to specific memory addresses. The CPU doesn’t distinguish between RAM and a device register — it just reads or writes an address. The hardware routes the access to the appropriate device.
The System Bus: Information Highway
The bus is a collection of wires carrying data, addresses, and control signals. Three main buses exist in typical architectures:
Data Bus — The wide highway for actual data transfer. A 64-bit data bus moves 8 bytes at a time.
Address Bus — Specifies where the data should go. The width of this bus determines how much memory the system can address. A 32-bit address bus can handle 4GB of memory; a 64-bit bus can handle way more than any physically installed RAM.
Control Bus — Carries timing and control signals: “read now,” “write now,” “memory is ready,” “device is busy.”
Production Failure Scenarios
Scenario 1: Cache Line Contention in Multi-threaded Systems
What happened: A trading platform experienced periodic latency spikes every 60 seconds despite having ample CPU headroom. Investigation revealed that threads on different cores were accessing data that shared the same cache line.
Root cause: Modern CPUs maintain cache coherence across cores, but when multiple cores write to the same cache line, they engage in “cache line ping-pong” — constantly invalidating and refreshing each other’s copies. This is called False Sharing.
Mitigation:
// Bad: Two threads fight over the same cache line
struct {
long counter_a; // Thread A writes this
long counter_b; // Thread B writes this
// These sit in the same 64-byte cache line!
} shared;
// Good: Pad to separate cache lines
struct {
long counter_a;
char pad[56]; // Fill the rest of the cache line
} thread_a_data;
struct {
long counter_b;
char pad[56];
} thread_b_data;
Scenario 2: NUMA-Aware Memory Allocation Failures
What happened: A database server running on a 4-socket NUMA system showed half the expected performance. The database was spawning worker threads that allocated memory locally, but queries that joined across partitions caused remote memory accesses.
Root cause: Non-Uniform Memory Access (NUMA) means memory attached to CPU 0 is faster to access from CPU 0 than from CPU 3. The database wasn’t NUMA-aware.
Mitigation:
# Check NUMA topology on Linux
numactl --hardware
# Shows: available: 4 nodes (0-3), and which node each CPU/memory bank belongs to
# Run a process with memory locality preferences
numactl --membind=0 --cpunodebind=0 mydatabase
Scenario 3: I/O Device Interrupt Storms
What happened: A web server became unresponsive every 15 minutes. The kernel showed thousands of interrupts per second from a malfunctioning network card.
Root cause: A network card stuck in a loop was generating interrupts for every incoming packet, starving the CPU of cycles for actual work. The system spent more time handling interrupts than doing useful work.
Mitigation: Use interrupt coalescing settings on network cards, configure interrupt affinity to spread handling across cores, and monitor /proc/interrupts for anomalies.
Trade-off Table
| Aspect | CPU-Intensive Workload | I/O-Intensive Workload | Balanced Workload |
|---|---|---|---|
| Core Count vs. Clock Speed | Fewer, faster cores (3-4 GHz) | Many, slower cores can be better | Moderate counts, good single-thread performance |
| **Memory | Less RAM needed, fast L3 critical | Large RAM to absorb I/O bursts | Size for working set, fast memory still matters |
| **Storage | NVMe for fast swap/page handling | RAID of fast SSDs, focus on IOPs | Single fast NVMe, or modest RAID |
| **Network | 10GbE if data transfer is bottleneck | Multiple 1GbE channels, or 25GbE | 10GbE generally sufficient |
| **Cache Behavior | Minimize misses, keep data hot | Large caches help absorb bursts | Profile to find actual hot set |
Implementation Snippets
Reading CPU Information in Linux
#!/bin/bash
# cpu_info.sh - Gather CPU information for diagnostics
echo "=== CPU Information ==="
cat /proc/cpuinfo | grep "model name" | head -1
echo "Core count: $(nproc)"
echo ""
echo "=== Cache Sizes ==="
for level in L1d L1i L2 L3; do
size=$(getconf $level_CACHE_SIZE 2>/dev/null || echo "N/A")
echo "$level: $size bytes"
done
echo ""
echo "=== Current CPU Frequencies ==="
cpufreq-info 2>/dev/null | grep "current policy" || cat /proc/cpuinfo | grep "cpu MHz"
echo ""
echo "=== Memory Information ==="
free -h
echo ""
echo "=== NUMA Topology ==="
numactl --hardware 2>/dev/null || echo "NUMA tools not available"
Memory-Mapped I/O Access Pattern
#!/usr/bin/env python3
"""
Demonstrates memory-mapped I/O for high-performance file access.
Maps a file directly into the process's virtual address space.
"""
import mmap
import os
def memory_mapped_read(filepath: str, pattern: bytes) -> list[int]:
"""
Find all occurrences of a byte pattern in a file using memory mapping.
This is significantly faster than reading in chunks for large files.
"""
offsets = []
with open(filepath, 'rb') as f:
# Memory-map the file (read-only)
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# Search for pattern
start = 0
while True:
pos = mm.find(pattern, start)
if pos == -1:
break
offsets.append(pos)
start = pos + 1
return offsets
# Usage
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <filepath>")
sys.exit(1)
# Find null bytes as an example
null_offsets = memory_mapped_read(sys.argv[1], b'\x00')
print(f"Found {len(null_offsets)} null bytes in {sys.argv[1]}")
Observability Checklist
When debugging computer architecture issues, monitor these:
CPU Metrics:
- Per-core utilization (
mpstat -P ALL 1) - Context switches per second (
vmstat 1) - Interrupt rate (
cat /proc/interrupts) - CPU frequency scaling states (
cpufreq-info) - Cache hit/miss ratios (CPU performance counters)
Memory Metrics:
- Available memory (
free -h) - Swap usage and swap-in/swap-out rates (
vmstat 1) - NUMA hit/miss ratios (
numastat) - Memory pressure indicators (
cat /proc/pressure/*) - Huge pages utilization (
cat /proc/meminfo | grep -i huge)
I/O Metrics:
- Disk I/O operations per second (
iostat -xz 1) - I/O wait percentage (high value indicates storage bottleneck)
- Network interface errors (
ip -s link show) - Interrupt distribution across cores (
cat /proc/irq/*/smp_affinity)
Alerting Thresholds:
- CPU: Alert if any core sustains >90% for >5 minutes
- Memory: Alert if available <20% of total
- I/O wait: Alert if sustained >20%
- Context switches: Alert if >50,000/sec sustained
Common Pitfalls / Anti-Patterns
Spectre and Meltdown Vulnerabilities — Modern CPUs’ speculative execution creates side channels that can leak data across security boundaries. Kernel page table isolation (KPTI) and speculative store bypass disable features mitigate these at a performance cost.
DMA Attacks — Devices with direct memory access can read/write arbitrary physical memory. Enable IOMMU (Input-Output Memory Management Unit) in BIOS and verify intel_iommu=on or amd_iommu=on in kernel parameters.
Rowhammer — Repeated memory access can flip bits in adjacent rows. Enable target row refresh (TRR) and consider ECC memory for high-security environments.
Firmware Security — The CPU microcode can be updated to patch vulnerabilities. Keep firmware updated and verify boot integrity with Secure Boot.
Common Pitfalls / Anti-patterns
Anti-pattern: Ignoring NUMA topology
Placing workloads on the wrong NUMA node causes 2-10x memory latency increases. Always use numactl or taskset for performance-critical applications on multi-socket systems.
Anti-pattern: Overcommitting Memory
Promising more memory to processes than physically available leads to thrashing. Monitor vm.overcommit_memory settings and the ratio of swap to RAM.
Anti-pattern: Assuming Cache Coherence is Free Multi-core systems must maintain cache coherence through the cache coherency protocol. This creates implicit serialization points that can destroy parallelism.
Anti-pattern: Neglecting Interrupt Affinity
Default interrupt routing can concentrate all I/O handling on CPU 0, creating a bottleneck. Configure irqbalance or manually set smp_affinity for critical interrupts.
Quick Recap Checklist
- The CPU operates in a fetch-decode-execute cycle, with registers providing the fastest storage
- Cache memory bridges the speed gap between CPU and main memory; cache misses cost hundreds of cycles
- Virtual memory extends physical RAM using disk, but page faults are expensive
- The system bus has separate data, address, and control lines connecting all components
- Multi-socket systems have NUMA characteristics — local memory is faster than remote memory
- Cache line contention between cores causes “false sharing” and performance degradation
- I/O device performance depends heavily on interrupt handling efficiency
- Monitor CPU, memory, and I/O metrics holistically — bottlenecks often appear where you don’t expect them
Interview Questions
When a CPU needs data that's not in L1 cache, it checks L2, then L3. If the data isn't in any cache level (a last-level cache miss), the CPU must fetch it from main memory. This takes approximately 100-300 nanoseconds, compared to 1 nanosecond for an L1 hit. The CPU typically stalls during this retrieval, though modern CPUs employ out-of-order execution to do other useful work while waiting. The data then moves up the cache hierarchy, potentially evicting other data to make room.
In von Neumann architecture, code and data share the same memory and bus. The CPU fetches instructions and data over the same pathway, creating the famous "von Neumann bottleneck" — the CPU can either fetch an instruction or access data, but not both simultaneously. Harvard architecture has separate memory spaces and buses for instructions and data, allowing simultaneous access. Modern CPUs use a modified Harvard architecture internally with separate L1 instruction and data caches, while presenting a unified memory view to software.
A memory barrier is an instruction that enforces ordering constraints on memory operations. Without barriers, CPUs and compilers may reorder memory accesses for performance optimization. A store barrier ensures all pending writes are flushed to memory before subsequent operations proceed. A load barrier ensures all subsequent loads see effects of prior stores. In C/C++, `__sync_synchronize() and C11 atomics with memory_order specifications. They're essential in multi-threaded code for establishing happens-before relationships.
Interrupt latency is the time between hardware interrupt assertion and the first instruction of the interrupt handler. Causes include:
- Current code holding interrupts disabled — longer critical sections reduce responsiveness
- Cache misses — the interrupt handler code must be loaded into cache
- Multi-core coordination — some interrupts must be handled on specific cores
- Device driver design — poorly written drivers may delay interrupt acknowledgment
Mitigations include using threaded interrupt handlers, assigning interrupts to specific cores with IRQ affinity, keeping interrupt handlers short, and using MSI-X interrupts for better scaling.
Virtual memory creates an abstraction layer between logical addresses used by programs and physical addresses in RAM. The Memory Management Unit (MMU) uses a page table to translate virtual addresses to physical addresses. Pages not in RAM are marked as "not present" in the page table; accessing them triggers a page fault, causing the OS to suspend the process and load the needed page from disk.
Costs include:
- Translation overhead — every memory access requires an address translation
- Page table size — can be large; mitigated by multi-level page tables and TLB
- Page fault latency — disk access takes milliseconds vs. nanoseconds for RAM
- TLB misses — translation lookaside buffer misses require page table walks
The von Neumann bottleneck is the limitation that both instructions and data share the same memory and bus. The CPU can either fetch an instruction or access data, but not simultaneously. As CPU speeds increased faster than memory speeds, this became a significant constraint.
Modern mitigations include:
- Cache hierarchies — multiple levels of fast cache memory between CPU and main memory
- Separate instruction and data caches — Harvard architecture at L1 level allows simultaneous access
- Out-of-order execution — CPU can do other useful work while waiting for memory
- Prefetching — speculatively loading data before it's needed
- Multi-threading — switching between threads during memory latency
L1 Cache: Smallest (typically 32-64KB per core), fastest (1-2 cycle latency). Split into separate instruction and data caches in most modern CPUs. Located closest to the core.
L2 Cache: Larger (typically 256KB-1MB per core), slower (10-20 cycle latency). Can be per-core or shared between pairs of cores. May be inclusive or exclusive of L1.
L3 Cache: Largest (typically 8-64MB shared), slowest cache level (40-60 cycle latency). Shared across all cores on the chip. Serves as a buffer between fast core caches and main memory.
The inclusion hierarchy matters: inclusive L3 means L1/L2 data is also in L3 (wasteful but simple); exclusive means data exists in only one level (efficient but complex to manage).
NUMA (Non-Uniform Memory Access) occurs in multi-socket systems where each CPU has its own local memory. Accessing local memory is fast (100-200ns); accessing memory attached to another CPU socket is slower (300-500ns) due to the interconnect.
Performance implications:
- Memory-intensive workloads should be pinned to local NUMA nodes
- Database systems and HPC applications are particularly sensitive
- Virtual machines can be allocated memory from specific NUMA nodes
- Kernel NUMA balancing can help but adds overhead
- Tools:
numactl --hardwareshows topology;numastatshows memory hit/miss per node
Cache coherency (MESI/MOESI protocols) ensures that any memory location read by any core returns the most recently written value by any core. It's about what value you read.
Memory ordering defines the order in which memory operations from one thread become visible to other threads. x86 has strong ordering (stores aren't reordered with other stores); ARM has weak ordering (loads/stores can be reordered).
Example: On weak ordering (ARM), store x = 1; store y = 1; might be observed as y = 1; x = 1; by another core. On strong ordering (x86), the stores appear in program order.
Caches implement coherency: when core A writes to a cache line that core B has in shared state, core B's copy is invalidated. The cache hierarchy enforces coherency.
Memory ordering is enforced by memory barrier instructions (dmb, dsb on ARM; mfence on x86). These prevent the CPU or compiler from reordering loads/stores across the barrier.
DMA (Direct Memory Access) allows I/O devices to transfer data directly to/from memory without CPU intervention. The CPU initiates the transfer, then the DMA controller handles the actual data movement while the CPU continues with other work.
The DMA workflow:
- CPU programs the DMA controller with source address, destination address, and transfer count
- CPU starts the transfer and continues with other work
- DMA controller reads from source (device or memory) and writes to destination
- DMA controller raises an interrupt when transfer completes
Without DMA, the CPU would need to copy each byte: read from device, write to memory, repeat. This "programmed I/O" consumes all CPU cycles for the transfer. With DMA, the CPU is free to execute other instructions while the DMA controller handles the transfer at memory bus speeds.
Typical uses: disk I/O, network packet reception, audio playback, GPU texture transfers. High-bandwidth devices have their own DMA channels. Modern systems have multiple DMA controllers with scatter-gather support for non-contiguous buffers.
Memory-mapped I/O (MMIO): Device registers appear in the system's physical address space as if they were regular memory. The CPU reads/writes to these addresses to communicate with the device.
Example: On ARM, peripheral registers might be at addresses 0x40000000-0x4FFFFFFF. Reading from 0x40000004 might return a device status register. The same load/store instructions used for memory access also access devices.
Port-mapped I/O (PMIO): Devices have a separate address space from memory, accessed through special instructions. On x86, IN and OUT instructions read/write to port addresses (0-65535). The address bus selects between memory and I/O devices.
Comparison:
- MMIO: Uses standard load/store instructions, simpler programming model, flexible address space
- PMIO: Requires special instructions, separate address space, still used for legacy ISA devices
- MMIO is more common in modern systems; PMIO is mainly on x86 and some embedded processors
The disadvantage of MMIO is that incorrect pointer access can corrupt device registers rather than just memory. Device drivers must carefully validate pointers before dereferencing MMIO addresses.
The TLB is a hardware cache that stores recent virtual-to-physical address translations. Every memory access requires address translation through the MMU, which involves walking multi-level page tables—expensive (4-10 memory accesses). The TLB provides a fast-path lookup for the most common translations.
TLB lookup:
- CPU presents virtual address to TLB
- If address matches a TLB entry (TLB hit): translation available in 1 cycle
- If no match (TLB miss): MMU walks page tables, typically 4-10 cycles, then caches result in TLB
TLB entries typically contain: virtual page number, physical page number, protection bits (read/write/execute), valid bit, and sometimes an ASID (Address Space ID) to identify which process's mapping.
TLB size: typically 64-1024 entries. With 4KB pages, that's only 256KB-4MB of address space cached. TLB misses are expensive—a page table walk can take 100+ cycles.
Some processors have separate instruction (ITLB) and data (DTLB) TLBs. Others have a unified TLB. Large working sets (like database servers) often suffer from TLB pressure, leading to performance degradation.
A microprocessor (like Intel Core i7, AMD Ryzen, ARM Cortex-A) is just the CPU—it has registers, ALU, control unit, cache, but no peripherals or memory. It requires external RAM, ROM, and I/O devices to form a complete system. Used in PCs, servers, smartphones.
A microcontroller (like Arduino's ATmega328P, STM32, PIC) integrates the CPU, memory (flash, SRAM), and peripherals onto a single chip. It can connect directly to sensors, actuators, and other components without additional chips. Used in appliances, embedded systems, IoT devices.
Key differences:
- Integration: Microcontroller = CPU + memory + peripherals on one chip
- Cost: Microcontrollers can be under $1; microprocessors require supporting chips
- Power: Microcontrollers use milliwatts; microprocessors can use 50-100W
- Peripherals: Microcontrollers include timers, ADC, DAC, communication (UART, SPI, I2C), PWM
- Development: Microcontrollers can often run from internal ROM; microprocessors require external boot devices
The choice depends on the application: a microwave needs a microcontroller; a smartphone needs a microprocessor with multiple cores, GPU, and cellular modem.
ECC (Error-Correcting Code) memory detects and corrects single-bit errors and detects (but cannot correct) double-bit errors. It uses extra bits (check bits) stored alongside the data to validate and correct.
How ECC works:
- For 64-bit data, ECC adds 8 check bits (72 bits total per DIMM)
- The check bits encode parity information across the data bits
- Single-bit errors flip both data and check bits in a correlated way—reading the check bits reveals which data bit is wrong
Types of errors:
- Soft errors: Caused by cosmic rays or electrical noise—temporary, random bit flips
- Hard errors: Caused by manufacturing defects or physical damage to chips—persistent
ECC is necessary in:
- Servers and workstations: Data corruption can be catastrophic in commercial environments
- Mission-critical systems: Medical devices, aerospace, financial systems where errors are unacceptable
- High-availability systems: ECC prevents silent data corruption
Regular desktop RAM (non-ECC) cannot detect errors—it just passes through data. Some errors go undetected, causing corruption that manifests as crashes, security vulnerabilities, or data loss.
The memory hierarchy exploits the principle of locality: programs tend to access the same data repeatedly (temporal locality) and access nearby data (spatial locality). Faster, smaller memory sits closer to the CPU; slower, larger memory sits further away.
The levels (fastest to slowest):
- Registers: 1 cycle access, 1-2 KB total
- L1 Cache: 1-2 cycles, 32-64 KB per core
- L2 Cache: 10-20 cycles, 256KB-1MB per core
- L3 Cache: 40-60 cycles, 8-64MB shared
- Main Memory (RAM): 100-300 cycles, nanoseconds
- Secondary Storage (SSD/HDD): milliseconds
Data flows up and down the hierarchy. When the CPU needs data, it checks L1; on miss, L2; on miss, L3; on miss, main memory. The data is then cached at all levels for future use. When memory is written, the write may go to just L1 (write-back) or all levels (write-through), depending on the policy.
The illusion: programs see memory as fast as L1 but large as disk. Without caching, programs would run 100x slower waiting for main memory.
The system bus connects CPU, memory, and I/O devices, carrying data, addresses, and control signals. It's the communication backbone of the computer.
Three bus types:
- Data bus: Carries actual data. Width (bits) determines how much data can be transferred per cycle. 64-bit data bus can move 8 bytes simultaneously.
- Address bus: Carries memory addresses for read/write operations. Width determines maximum addressable memory: 32-bit address bus can address 4GB.
- Control bus: Carries timing and control signals: read/write enable, interrupt acknowledge, bus request/grant, clock signals.
Bus architectures:
- Front-side bus (FSB): Traditional shared bus connecting CPU, memory, and I/O (older systems)
- Point-to-point interconnects: Modern systems use dedicated links (Intel QPI, AMD HyperTransport, ARM CCI). No shared bus bottleneck.
- PCI Express: Point-to-point serial bus for expansion cards and GPUs. Replaced parallel PCI/AGP.
- DDR memory bus: Dedicated channel to RAM modules.
The bus width and clock speed determine bandwidth. A 64-bit bus at 100MHz can theoretically transfer 800 MB/s.
A watchdog timer is a hardware timer that resets the system if the software fails to periodically "feed" (reset) it. If the software gets stuck (hang, infinite loop, crash), it stops feeding the watchdog, and the watchdog resets the system.
Operation:
- Software initializes watchdog with timeout period (e.g., 1 second)
- Software periodically resets the timer before it expires
- If software hangs: timer expires, generates reset signal
- System reboots and hopefully recovers
Use cases in embedded systems:
- Industrial control: Unattended systems must recover from faults
- Safety systems: Medical devices, automotive controllers
- Remote systems: Solar-powered devices in inaccessible locations
- Consumer appliances: Smart TVs, routers that run for months without restart
The watchdog must be reset by the main loop or critical tasks. If interrupts are disabled for too long, the watchdog may fire even though the software is running correctly. Some systems use windowed watchdogs that require resets within a specific time window (not too early, not too late).
Volatile memory loses its contents when power is removed. It requires power to maintain the stored data. Examples:
- SRAM (Static RAM): Uses flip-flops (6 transistors per bit). Fast (1 cycle), expensive. Used for CPU registers and L1/L2 cache.
- DRAM (Dynamic RAM): Uses a capacitor + transistor per bit. Slower (needs refresh), cheaper. Used for main memory (RAM).
Non-volatile memory retains data without power. Examples:
- Flash: EEPROM variant. Blocks can be electrically erased and reprogrammed. Used for SSDs, USB drives, firmware storage.
- ROM (Mask ROM): Programmed during manufacturing. Cannot be modified.
- PROM: Programmable once at manufacture.
- EPROM: Erasable with UV light, then reprogrammable.
- EEPROM: Electrically erasable. Used for BIOS/UEFI chips, embedded firmware.
- NVRAM: Non-volatile SRAM (battery-backed or magnetic). Used for routers, game consoles.
Trade-offs: Volatile is faster and cheaper per bit; non-volatile retains data when powered off. Systems use both: DRAM for active data, Flash for persistent storage, SRAM for critical cached data that must survive power loss.
CPU performance is measured by Instructions Per Cycle (IPC) and clock frequency. The product gives instructions per second: Performance = IPC × Frequency.
However, real-world performance depends on workload:
- Compute-bound workloads: Benefit from higher frequency and more cores
- Memory-bound workloads: Limited by memory bandwidth and cache effectiveness
- Branch-heavy workloads: Sensitive to branch prediction accuracy
Key metrics:
- Instructions per cycle (IPC): Higher is better. Indicates how efficiently the CPU executes instructions.
- Cache hit rates: L1/L2/L3 hit rates determine memory stall time.
- Branch misprediction rate: 5-10% is acceptable; higher hurts performance.
- Memory bandwidth: GB/s for sustained data transfer.
- Latency: How long specific operations take (memory load latency, cache miss latency).
Tools:
perf stat: Hardware performance countersvtune(Intel): Detailed microarchitectural analysisarm-pmu(ARM): Similar analysis on ARM- Microbenchmarks isolate specific operations
Registers are the CPU's internal storage locations, directly addressable by instructions. They're part of the CPU silicon, accessible in a single cycle. Modern x86 has 16 general-purpose registers plus special registers (RIP, RSP, RFLAGS, etc.). ARM64 has 31 general-purpose registers.
Registers hold:
- Data being actively processed (operands for ALU)
- Addresses for memory access
- Control values (flags, status)
- Return addresses for function calls
Cache lines are the unit of data transfer between CPU cache and main memory. When the CPU accesses memory, an entire cache line (typically 64 bytes) is loaded into L1 cache, not just the requested byte. This exploits spatial locality—accessing one byte means nearby bytes are likely accessed soon.
Key differences:
- Access: Registers are addressed by instruction operands (e.g.,
mov rax, rbx). Cache is transparently managed—software doesn't explicitly address cache lines. - Size: Registers are typically 64 bits (8 bytes). Cache lines are 64 bytes.
- Management: Compiler allocates registers. Hardware manages cache automatically (except for software hints like
prefetch). - Visibility: Registers are architecturally visible (programmers see them). Cache is invisible (transparent to software).
Further Reading
- What Every Programmer Should Know About Memory — Ulrich Drepper’s comprehensive overview of memory hierarchies
- Intel 64 and IA-32 Architectures Software Developer’s Manual — Complete x86 reference
- ARM Architecture Reference Manual — Comprehensive ARM documentation
- Running Linux Memory — Linux kernel memory management documentation
- CPU Cache Flushing Fallacies — Common misconceptions about cache behavior
Conclusion
Understanding how computers work at the hardware level provides the foundation for reasoning about performance, security, and system design. The fetch-decode-execute cycle, cache hierarchies, and virtual memory subsystems are not abstract concepts but practical realities that directly impact application behavior.
The concepts covered here—CPU architecture, memory hierarchy, and I/O systems—connect directly to how operating systems manage hardware resources. When you’re debugging a performance issue or designing a system that scales, this low-level understanding becomes your most powerful tool.
Continue your journey into computer architecture by exploring number systems and data representation to understand how bits become meaningful data, or dive into boolean logic and gates to see how simple transistors become functional units.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.