How Computers Work

Understanding the fundamental hardware architecture of computers — CPU, memory, I/O devices, and the bus architecture that connects them.

published: reading time: 27 min read author: GeekWorkBench

Introduction

Every click, every calculation, every character you type goes through a chain of hardware components working together. Understanding how computers work matters for debugging production issues, reasoning about performance, and designing systems that don’t fall over when you least expect it.

The modern computer has a few major parts: the CPU runs code, memory holds data while you work, and input/output devices let the machine talk to the outside world. The system bus connects everything, letting these components share data.

When to Use This Knowledge

Apply your understanding of computer architecture when:

  • Debugging performance issues that don’t match software profiling
  • Writing high-performance code where memory access patterns matter
  • Designing systems that interact with hardware devices
  • Troubleshooting mysterious kernel panics or driver failures
  • Making procurement decisions about hardware specs

Skip the deep dive if:

  • You’re doing pure web development with managed runtimes
  • You’re just scripting with automatic memory management
  • You’re just chicken for lunch

Architecture Diagram

Here’s how the major components interconnect:

graph TB
    subgraph "CPU"
        A[Control Unit]
        B[Arithmetic Logic Unit]
        C[Registers]
        A --> B
        B --> C
    end

    subgraph "Memory Hierarchy"
        D[L1 Cache]
        E[L2 Cache]
        F[L3 Cache]
        G[Main Memory RAM]
        H[Virtual Memory Swap]
        D --> E
        E --> F
        F --> G
        G --> H
    end

    subgraph "I/O Devices"
        I[Disk Controller]
        J[Network Controller]
        K[USB Controller]
        L[Graphics GPU]
    end

    M[System Bus] --> A
    M --> D
    M --> G
    M --> I
    M --> J
    M --> K
    M --> L

Core Concepts

The CPU: Brain of the Operation

The CPU executes instructions in a relentless fetch-decode-execute cycle. Modern CPUs contain:

Control Unit (CU) — Directs the orchestra. It fetches instructions from memory, decodes what they mean, and tells the ALU what to do. Think of it as the conductor who reads the sheet music and gestures to each section.

Arithmetic Logic Unit (ALU) — Does the actual math. Addition, subtraction, bitwise operations, comparisons — all happen here. It’s the strongest member of the team but only does exactly what it’s told.

Registers — Ultra-fast memory locations inside the CPU itself. General-purpose registers hold data being worked on. Special registers like the Program Counter (PC) track where the CPU is in the instruction stream, and the Stack Pointer (SP) manages the call stack.

The Cache Hierarchy — Main memory is slow. The CPU is fast. The solution: cache memory. L1 is smallest and fastest (typically 32-64KB per core), L2 is larger and slower, and L3 is shared across cores. When the CPU needs data, it checks L1 first, then L2, then L3, then main memory. Each miss costs orders of magnitude more time.

// Visualizing cache impact - accessing memory in patterns
// This code demonstrates why memory access patterns matter

#include <stdio.h>
#include <time.h>

#define SIZE 10000

// Slow: jumping around memory (cache misses)
double slow_access(double matrix[SIZE][SIZE]) {
    double sum = 0.0;
    for (int j = 0; j < SIZE; j++) {
        for (int i = 0; i < SIZE; i++) {
            sum += matrix[i][j];  // Column-major access on row-major data
        }
    }
    return sum;
}

// Fast: sequential memory access (cache hits)
double fast_access(double matrix[SIZE][SIZE]) {
    double sum = 0.0;
    for (int i = 0; i < SIZE; i++) {
        for (int j = 0; j < SIZE; j++) {
            sum += matrix[i][j];  // Row-major access on row-major data
        }
    }
    return sum;
}

Memory: The Workspace

RAM (Random Access Memory) — Volatile storage that holds data and code while the system runs. When power cuts, data vanishes. RAM is organized in rows and columns, with each cell containing a capacitor and a transistor. Reading or writing any cell takes the same time — hence “random access.”

Virtual Memory — The operating system’s sleight of hand. It makes every process believe it has the entire address space to itself. In reality, physical RAM is shared, and some data gets “paged out” to disk when pressure mounts. This mapping between virtual and physical addresses is handled by the Memory Management Unit (MMU).

Memory-Mapped I/O — Some hardware devices are accessed by reading and writing to specific memory addresses. The CPU doesn’t distinguish between RAM and a device register — it just reads or writes an address. The hardware routes the access to the appropriate device.

The System Bus: Information Highway

The bus is a collection of wires carrying data, addresses, and control signals. Three main buses exist in typical architectures:

Data Bus — The wide highway for actual data transfer. A 64-bit data bus moves 8 bytes at a time.

Address Bus — Specifies where the data should go. The width of this bus determines how much memory the system can address. A 32-bit address bus can handle 4GB of memory; a 64-bit bus can handle way more than any physically installed RAM.

Control Bus — Carries timing and control signals: “read now,” “write now,” “memory is ready,” “device is busy.”

Production Failure Scenarios

Scenario 1: Cache Line Contention in Multi-threaded Systems

What happened: A trading platform experienced periodic latency spikes every 60 seconds despite having ample CPU headroom. Investigation revealed that threads on different cores were accessing data that shared the same cache line.

Root cause: Modern CPUs maintain cache coherence across cores, but when multiple cores write to the same cache line, they engage in “cache line ping-pong” — constantly invalidating and refreshing each other’s copies. This is called False Sharing.

Mitigation:

// Bad: Two threads fight over the same cache line
struct {
    long counter_a;  // Thread A writes this
    long counter_b;  // Thread B writes this
    // These sit in the same 64-byte cache line!
} shared;

// Good: Pad to separate cache lines
struct {
    long counter_a;
    char pad[56];  // Fill the rest of the cache line
} thread_a_data;

struct {
    long counter_b;
    char pad[56];
} thread_b_data;

Scenario 2: NUMA-Aware Memory Allocation Failures

What happened: A database server running on a 4-socket NUMA system showed half the expected performance. The database was spawning worker threads that allocated memory locally, but queries that joined across partitions caused remote memory accesses.

Root cause: Non-Uniform Memory Access (NUMA) means memory attached to CPU 0 is faster to access from CPU 0 than from CPU 3. The database wasn’t NUMA-aware.

Mitigation:

# Check NUMA topology on Linux
numactl --hardware
# Shows: available: 4 nodes (0-3), and which node each CPU/memory bank belongs to

# Run a process with memory locality preferences
numactl --membind=0 --cpunodebind=0 mydatabase

Scenario 3: I/O Device Interrupt Storms

What happened: A web server became unresponsive every 15 minutes. The kernel showed thousands of interrupts per second from a malfunctioning network card.

Root cause: A network card stuck in a loop was generating interrupts for every incoming packet, starving the CPU of cycles for actual work. The system spent more time handling interrupts than doing useful work.

Mitigation: Use interrupt coalescing settings on network cards, configure interrupt affinity to spread handling across cores, and monitor /proc/interrupts for anomalies.

Trade-off Table

AspectCPU-Intensive WorkloadI/O-Intensive WorkloadBalanced Workload
Core Count vs. Clock SpeedFewer, faster cores (3-4 GHz)Many, slower cores can be betterModerate counts, good single-thread performance
**MemoryLess RAM needed, fast L3 criticalLarge RAM to absorb I/O burstsSize for working set, fast memory still matters
**StorageNVMe for fast swap/page handlingRAID of fast SSDs, focus on IOPsSingle fast NVMe, or modest RAID
**Network10GbE if data transfer is bottleneckMultiple 1GbE channels, or 25GbE10GbE generally sufficient
**Cache BehaviorMinimize misses, keep data hotLarge caches help absorb burstsProfile to find actual hot set

Implementation Snippets

Reading CPU Information in Linux

#!/bin/bash
# cpu_info.sh - Gather CPU information for diagnostics

echo "=== CPU Information ==="
cat /proc/cpuinfo | grep "model name" | head -1
echo "Core count: $(nproc)"
echo ""

echo "=== Cache Sizes ==="
for level in L1d L1i L2 L3; do
    size=$(getconf $level_CACHE_SIZE 2>/dev/null || echo "N/A")
    echo "$level: $size bytes"
done
echo ""

echo "=== Current CPU Frequencies ==="
cpufreq-info 2>/dev/null | grep "current policy" || cat /proc/cpuinfo | grep "cpu MHz"
echo ""

echo "=== Memory Information ==="
free -h
echo ""

echo "=== NUMA Topology ==="
numactl --hardware 2>/dev/null || echo "NUMA tools not available"

Memory-Mapped I/O Access Pattern

#!/usr/bin/env python3
"""
Demonstrates memory-mapped I/O for high-performance file access.
Maps a file directly into the process's virtual address space.
"""

import mmap
import os

def memory_mapped_read(filepath: str, pattern: bytes) -> list[int]:
    """
    Find all occurrences of a byte pattern in a file using memory mapping.
    This is significantly faster than reading in chunks for large files.
    """
    offsets = []

    with open(filepath, 'rb') as f:
        # Memory-map the file (read-only)
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            # Search for pattern
            start = 0
            while True:
                pos = mm.find(pattern, start)
                if pos == -1:
                    break
                offsets.append(pos)
                start = pos + 1

    return offsets

# Usage
if __name__ == "__main__":
    import sys

    if len(sys.argv) < 2:
        print(f"Usage: {sys.argv[0]} <filepath>")
        sys.exit(1)

    # Find null bytes as an example
    null_offsets = memory_mapped_read(sys.argv[1], b'\x00')
    print(f"Found {len(null_offsets)} null bytes in {sys.argv[1]}")

Observability Checklist

When debugging computer architecture issues, monitor these:

CPU Metrics:

  • Per-core utilization (mpstat -P ALL 1)
  • Context switches per second (vmstat 1)
  • Interrupt rate (cat /proc/interrupts)
  • CPU frequency scaling states (cpufreq-info)
  • Cache hit/miss ratios (CPU performance counters)

Memory Metrics:

  • Available memory (free -h)
  • Swap usage and swap-in/swap-out rates (vmstat 1)
  • NUMA hit/miss ratios (numastat)
  • Memory pressure indicators (cat /proc/pressure/*)
  • Huge pages utilization (cat /proc/meminfo | grep -i huge)

I/O Metrics:

  • Disk I/O operations per second (iostat -xz 1)
  • I/O wait percentage (high value indicates storage bottleneck)
  • Network interface errors (ip -s link show)
  • Interrupt distribution across cores (cat /proc/irq/*/smp_affinity)

Alerting Thresholds:

  • CPU: Alert if any core sustains >90% for >5 minutes
  • Memory: Alert if available <20% of total
  • I/O wait: Alert if sustained >20%
  • Context switches: Alert if >50,000/sec sustained

Common Pitfalls / Anti-Patterns

Spectre and Meltdown Vulnerabilities — Modern CPUs’ speculative execution creates side channels that can leak data across security boundaries. Kernel page table isolation (KPTI) and speculative store bypass disable features mitigate these at a performance cost.

DMA Attacks — Devices with direct memory access can read/write arbitrary physical memory. Enable IOMMU (Input-Output Memory Management Unit) in BIOS and verify intel_iommu=on or amd_iommu=on in kernel parameters.

Rowhammer — Repeated memory access can flip bits in adjacent rows. Enable target row refresh (TRR) and consider ECC memory for high-security environments.

Firmware Security — The CPU microcode can be updated to patch vulnerabilities. Keep firmware updated and verify boot integrity with Secure Boot.

Common Pitfalls / Anti-patterns

Anti-pattern: Ignoring NUMA topology Placing workloads on the wrong NUMA node causes 2-10x memory latency increases. Always use numactl or taskset for performance-critical applications on multi-socket systems.

Anti-pattern: Overcommitting Memory Promising more memory to processes than physically available leads to thrashing. Monitor vm.overcommit_memory settings and the ratio of swap to RAM.

Anti-pattern: Assuming Cache Coherence is Free Multi-core systems must maintain cache coherence through the cache coherency protocol. This creates implicit serialization points that can destroy parallelism.

Anti-pattern: Neglecting Interrupt Affinity Default interrupt routing can concentrate all I/O handling on CPU 0, creating a bottleneck. Configure irqbalance or manually set smp_affinity for critical interrupts.

Quick Recap Checklist

  • The CPU operates in a fetch-decode-execute cycle, with registers providing the fastest storage
  • Cache memory bridges the speed gap between CPU and main memory; cache misses cost hundreds of cycles
  • Virtual memory extends physical RAM using disk, but page faults are expensive
  • The system bus has separate data, address, and control lines connecting all components
  • Multi-socket systems have NUMA characteristics — local memory is faster than remote memory
  • Cache line contention between cores causes “false sharing” and performance degradation
  • I/O device performance depends heavily on interrupt handling efficiency
  • Monitor CPU, memory, and I/O metrics holistically — bottlenecks often appear where you don’t expect them

Interview Questions

1. What happens when a CPU cache miss occurs?

When a CPU needs data that's not in L1 cache, it checks L2, then L3. If the data isn't in any cache level (a last-level cache miss), the CPU must fetch it from main memory. This takes approximately 100-300 nanoseconds, compared to 1 nanosecond for an L1 hit. The CPU typically stalls during this retrieval, though modern CPUs employ out-of-order execution to do other useful work while waiting. The data then moves up the cache hierarchy, potentially evicting other data to make room.

2. Explain the difference between von Neumann and Harvard architectures.

In von Neumann architecture, code and data share the same memory and bus. The CPU fetches instructions and data over the same pathway, creating the famous "von Neumann bottleneck" — the CPU can either fetch an instruction or access data, but not both simultaneously. Harvard architecture has separate memory spaces and buses for instructions and data, allowing simultaneous access. Modern CPUs use a modified Harvard architecture internally with separate L1 instruction and data caches, while presenting a unified memory view to software.

3. What is a memory barrier (fence) and why is it needed?

A memory barrier is an instruction that enforces ordering constraints on memory operations. Without barriers, CPUs and compilers may reorder memory accesses for performance optimization. A store barrier ensures all pending writes are flushed to memory before subsequent operations proceed. A load barrier ensures all subsequent loads see effects of prior stores. In C/C++, `` provides __sync_synchronize() and C11 atomics with memory_order specifications. They're essential in multi-threaded code for establishing happens-before relationships.

4. What causes interrupt latency, and how can it be mitigated?

Interrupt latency is the time between hardware interrupt assertion and the first instruction of the interrupt handler. Causes include:

  • Current code holding interrupts disabled — longer critical sections reduce responsiveness
  • Cache misses — the interrupt handler code must be loaded into cache
  • Multi-core coordination — some interrupts must be handled on specific cores
  • Device driver design — poorly written drivers may delay interrupt acknowledgment

Mitigations include using threaded interrupt handlers, assigning interrupts to specific cores with IRQ affinity, keeping interrupt handlers short, and using MSI-X interrupts for better scaling.

5. How does virtual memory work, and what are its costs?

Virtual memory creates an abstraction layer between logical addresses used by programs and physical addresses in RAM. The Memory Management Unit (MMU) uses a page table to translate virtual addresses to physical addresses. Pages not in RAM are marked as "not present" in the page table; accessing them triggers a page fault, causing the OS to suspend the process and load the needed page from disk.

Costs include:

  • Translation overhead — every memory access requires an address translation
  • Page table size — can be large; mitigated by multi-level page tables and TLB
  • Page fault latency — disk access takes milliseconds vs. nanoseconds for RAM
  • TLB misses — translation lookaside buffer misses require page table walks
6. What is the von Neumann bottleneck and how do modern CPUs mitigate it?

The von Neumann bottleneck is the limitation that both instructions and data share the same memory and bus. The CPU can either fetch an instruction or access data, but not simultaneously. As CPU speeds increased faster than memory speeds, this became a significant constraint.

Modern mitigations include:

  • Cache hierarchies — multiple levels of fast cache memory between CPU and main memory
  • Separate instruction and data caches — Harvard architecture at L1 level allows simultaneous access
  • Out-of-order execution — CPU can do other useful work while waiting for memory
  • Prefetching — speculatively loading data before it's needed
  • Multi-threading — switching between threads during memory latency
7. Explain the differences between L1, L2, and L3 caches.

L1 Cache: Smallest (typically 32-64KB per core), fastest (1-2 cycle latency). Split into separate instruction and data caches in most modern CPUs. Located closest to the core.

L2 Cache: Larger (typically 256KB-1MB per core), slower (10-20 cycle latency). Can be per-core or shared between pairs of cores. May be inclusive or exclusive of L1.

L3 Cache: Largest (typically 8-64MB shared), slowest cache level (40-60 cycle latency). Shared across all cores on the chip. Serves as a buffer between fast core caches and main memory.

The inclusion hierarchy matters: inclusive L3 means L1/L2 data is also in L3 (wasteful but simple); exclusive means data exists in only one level (efficient but complex to manage).

8. What is NUMA and how does it affect performance?

NUMA (Non-Uniform Memory Access) occurs in multi-socket systems where each CPU has its own local memory. Accessing local memory is fast (100-200ns); accessing memory attached to another CPU socket is slower (300-500ns) due to the interconnect.

Performance implications:

  • Memory-intensive workloads should be pinned to local NUMA nodes
  • Database systems and HPC applications are particularly sensitive
  • Virtual machines can be allocated memory from specific NUMA nodes
  • Kernel NUMA balancing can help but adds overhead
  • Tools: numactl --hardware shows topology; numastat shows memory hit/miss per node
9. What are the key differences between CPU cache coherency protocols and memory ordering models?

Cache coherency (MESI/MOESI protocols) ensures that any memory location read by any core returns the most recently written value by any core. It's about what value you read.

Memory ordering defines the order in which memory operations from one thread become visible to other threads. x86 has strong ordering (stores aren't reordered with other stores); ARM has weak ordering (loads/stores can be reordered).

Example: On weak ordering (ARM), store x = 1; store y = 1; might be observed as y = 1; x = 1; by another core. On strong ordering (x86), the stores appear in program order.

Caches implement coherency: when core A writes to a cache line that core B has in shared state, core B's copy is invalidated. The cache hierarchy enforces coherency.

Memory ordering is enforced by memory barrier instructions (dmb, dsb on ARM; mfence on x86). These prevent the CPU or compiler from reordering loads/stores across the barrier.

10. What is Direct Memory Access (DMA) and how does it improve I/O performance?

DMA (Direct Memory Access) allows I/O devices to transfer data directly to/from memory without CPU intervention. The CPU initiates the transfer, then the DMA controller handles the actual data movement while the CPU continues with other work.

The DMA workflow:

  1. CPU programs the DMA controller with source address, destination address, and transfer count
  2. CPU starts the transfer and continues with other work
  3. DMA controller reads from source (device or memory) and writes to destination
  4. DMA controller raises an interrupt when transfer completes

Without DMA, the CPU would need to copy each byte: read from device, write to memory, repeat. This "programmed I/O" consumes all CPU cycles for the transfer. With DMA, the CPU is free to execute other instructions while the DMA controller handles the transfer at memory bus speeds.

Typical uses: disk I/O, network packet reception, audio playback, GPU texture transfers. High-bandwidth devices have their own DMA channels. Modern systems have multiple DMA controllers with scatter-gather support for non-contiguous buffers.

11. Explain the difference between memory-mapped I/O and port-mapped I/O.

Memory-mapped I/O (MMIO): Device registers appear in the system's physical address space as if they were regular memory. The CPU reads/writes to these addresses to communicate with the device.

Example: On ARM, peripheral registers might be at addresses 0x40000000-0x4FFFFFFF. Reading from 0x40000004 might return a device status register. The same load/store instructions used for memory access also access devices.

Port-mapped I/O (PMIO): Devices have a separate address space from memory, accessed through special instructions. On x86, IN and OUT instructions read/write to port addresses (0-65535). The address bus selects between memory and I/O devices.

Comparison:

  • MMIO: Uses standard load/store instructions, simpler programming model, flexible address space
  • PMIO: Requires special instructions, separate address space, still used for legacy ISA devices
  • MMIO is more common in modern systems; PMIO is mainly on x86 and some embedded processors

The disadvantage of MMIO is that incorrect pointer access can corrupt device registers rather than just memory. Device drivers must carefully validate pointers before dereferencing MMIO addresses.

12. How does a Translation Lookaside Buffer (TLB) improve memory access performance?

The TLB is a hardware cache that stores recent virtual-to-physical address translations. Every memory access requires address translation through the MMU, which involves walking multi-level page tables—expensive (4-10 memory accesses). The TLB provides a fast-path lookup for the most common translations.

TLB lookup:

  1. CPU presents virtual address to TLB
  2. If address matches a TLB entry (TLB hit): translation available in 1 cycle
  3. If no match (TLB miss): MMU walks page tables, typically 4-10 cycles, then caches result in TLB

TLB entries typically contain: virtual page number, physical page number, protection bits (read/write/execute), valid bit, and sometimes an ASID (Address Space ID) to identify which process's mapping.

TLB size: typically 64-1024 entries. With 4KB pages, that's only 256KB-4MB of address space cached. TLB misses are expensive—a page table walk can take 100+ cycles.

Some processors have separate instruction (ITLB) and data (DTLB) TLBs. Others have a unified TLB. Large working sets (like database servers) often suffer from TLB pressure, leading to performance degradation.

13. What is the difference between a microcontroller and a microprocessor?

A microprocessor (like Intel Core i7, AMD Ryzen, ARM Cortex-A) is just the CPU—it has registers, ALU, control unit, cache, but no peripherals or memory. It requires external RAM, ROM, and I/O devices to form a complete system. Used in PCs, servers, smartphones.

A microcontroller (like Arduino's ATmega328P, STM32, PIC) integrates the CPU, memory (flash, SRAM), and peripherals onto a single chip. It can connect directly to sensors, actuators, and other components without additional chips. Used in appliances, embedded systems, IoT devices.

Key differences:

  • Integration: Microcontroller = CPU + memory + peripherals on one chip
  • Cost: Microcontrollers can be under $1; microprocessors require supporting chips
  • Power: Microcontrollers use milliwatts; microprocessors can use 50-100W
  • Peripherals: Microcontrollers include timers, ADC, DAC, communication (UART, SPI, I2C), PWM
  • Development: Microcontrollers can often run from internal ROM; microprocessors require external boot devices

The choice depends on the application: a microwave needs a microcontroller; a smartphone needs a microprocessor with multiple cores, GPU, and cellular modem.

14. What is ECC memory and when is it necessary?

ECC (Error-Correcting Code) memory detects and corrects single-bit errors and detects (but cannot correct) double-bit errors. It uses extra bits (check bits) stored alongside the data to validate and correct.

How ECC works:

  • For 64-bit data, ECC adds 8 check bits (72 bits total per DIMM)
  • The check bits encode parity information across the data bits
  • Single-bit errors flip both data and check bits in a correlated way—reading the check bits reveals which data bit is wrong

Types of errors:

  • Soft errors: Caused by cosmic rays or electrical noise—temporary, random bit flips
  • Hard errors: Caused by manufacturing defects or physical damage to chips—persistent

ECC is necessary in:

  • Servers and workstations: Data corruption can be catastrophic in commercial environments
  • Mission-critical systems: Medical devices, aerospace, financial systems where errors are unacceptable
  • High-availability systems: ECC prevents silent data corruption

Regular desktop RAM (non-ECC) cannot detect errors—it just passes through data. Some errors go undetected, causing corruption that manifests as crashes, security vulnerabilities, or data loss.

15. How does the memory hierarchy work together to provide the illusion of fast, large memory?

The memory hierarchy exploits the principle of locality: programs tend to access the same data repeatedly (temporal locality) and access nearby data (spatial locality). Faster, smaller memory sits closer to the CPU; slower, larger memory sits further away.

The levels (fastest to slowest):

  • Registers: 1 cycle access, 1-2 KB total
  • L1 Cache: 1-2 cycles, 32-64 KB per core
  • L2 Cache: 10-20 cycles, 256KB-1MB per core
  • L3 Cache: 40-60 cycles, 8-64MB shared
  • Main Memory (RAM): 100-300 cycles, nanoseconds
  • Secondary Storage (SSD/HDD): milliseconds

Data flows up and down the hierarchy. When the CPU needs data, it checks L1; on miss, L2; on miss, L3; on miss, main memory. The data is then cached at all levels for future use. When memory is written, the write may go to just L1 (write-back) or all levels (write-through), depending on the policy.

The illusion: programs see memory as fast as L1 but large as disk. Without caching, programs would run 100x slower waiting for main memory.

16. What is the purpose of a system bus and what are the different types of buses in a computer?

The system bus connects CPU, memory, and I/O devices, carrying data, addresses, and control signals. It's the communication backbone of the computer.

Three bus types:

  • Data bus: Carries actual data. Width (bits) determines how much data can be transferred per cycle. 64-bit data bus can move 8 bytes simultaneously.
  • Address bus: Carries memory addresses for read/write operations. Width determines maximum addressable memory: 32-bit address bus can address 4GB.
  • Control bus: Carries timing and control signals: read/write enable, interrupt acknowledge, bus request/grant, clock signals.

Bus architectures:

  • Front-side bus (FSB): Traditional shared bus connecting CPU, memory, and I/O (older systems)
  • Point-to-point interconnects: Modern systems use dedicated links (Intel QPI, AMD HyperTransport, ARM CCI). No shared bus bottleneck.
  • PCI Express: Point-to-point serial bus for expansion cards and GPUs. Replaced parallel PCI/AGP.
  • DDR memory bus: Dedicated channel to RAM modules.

The bus width and clock speed determine bandwidth. A 64-bit bus at 100MHz can theoretically transfer 800 MB/s.

17. How does a watchdog timer work and why is it used in embedded systems?

A watchdog timer is a hardware timer that resets the system if the software fails to periodically "feed" (reset) it. If the software gets stuck (hang, infinite loop, crash), it stops feeding the watchdog, and the watchdog resets the system.

Operation:

  1. Software initializes watchdog with timeout period (e.g., 1 second)
  2. Software periodically resets the timer before it expires
  3. If software hangs: timer expires, generates reset signal
  4. System reboots and hopefully recovers

Use cases in embedded systems:

  • Industrial control: Unattended systems must recover from faults
  • Safety systems: Medical devices, automotive controllers
  • Remote systems: Solar-powered devices in inaccessible locations
  • Consumer appliances: Smart TVs, routers that run for months without restart

The watchdog must be reset by the main loop or critical tasks. If interrupts are disabled for too long, the watchdog may fire even though the software is running correctly. Some systems use windowed watchdogs that require resets within a specific time window (not too early, not too late).

18. What is the difference between volatile and non-volatile memory?

Volatile memory loses its contents when power is removed. It requires power to maintain the stored data. Examples:

  • SRAM (Static RAM): Uses flip-flops (6 transistors per bit). Fast (1 cycle), expensive. Used for CPU registers and L1/L2 cache.
  • DRAM (Dynamic RAM): Uses a capacitor + transistor per bit. Slower (needs refresh), cheaper. Used for main memory (RAM).

Non-volatile memory retains data without power. Examples:

  • Flash: EEPROM variant. Blocks can be electrically erased and reprogrammed. Used for SSDs, USB drives, firmware storage.
  • ROM (Mask ROM): Programmed during manufacturing. Cannot be modified.
  • PROM: Programmable once at manufacture.
  • EPROM: Erasable with UV light, then reprogrammable.
  • EEPROM: Electrically erasable. Used for BIOS/UEFI chips, embedded firmware.
  • NVRAM: Non-volatile SRAM (battery-backed or magnetic). Used for routers, game consoles.

Trade-offs: Volatile is faster and cheaper per bit; non-volatile retains data when powered off. Systems use both: DRAM for active data, Flash for persistent storage, SRAM for critical cached data that must survive power loss.

19. How do you measure CPU performance and what metrics matter most?

CPU performance is measured by Instructions Per Cycle (IPC) and clock frequency. The product gives instructions per second: Performance = IPC × Frequency.

However, real-world performance depends on workload:

  • Compute-bound workloads: Benefit from higher frequency and more cores
  • Memory-bound workloads: Limited by memory bandwidth and cache effectiveness
  • Branch-heavy workloads: Sensitive to branch prediction accuracy

Key metrics:

  • Instructions per cycle (IPC): Higher is better. Indicates how efficiently the CPU executes instructions.
  • Cache hit rates: L1/L2/L3 hit rates determine memory stall time.
  • Branch misprediction rate: 5-10% is acceptable; higher hurts performance.
  • Memory bandwidth: GB/s for sustained data transfer.
  • Latency: How long specific operations take (memory load latency, cache miss latency).

Tools:

  • perf stat: Hardware performance counters
  • vtune (Intel): Detailed microarchitectural analysis
  • arm-pmu (ARM): Similar analysis on ARM
  • Microbenchmarks isolate specific operations
20. What is the difference between a register and a cache line?

Registers are the CPU's internal storage locations, directly addressable by instructions. They're part of the CPU silicon, accessible in a single cycle. Modern x86 has 16 general-purpose registers plus special registers (RIP, RSP, RFLAGS, etc.). ARM64 has 31 general-purpose registers.

Registers hold:

  • Data being actively processed (operands for ALU)
  • Addresses for memory access
  • Control values (flags, status)
  • Return addresses for function calls

Cache lines are the unit of data transfer between CPU cache and main memory. When the CPU accesses memory, an entire cache line (typically 64 bytes) is loaded into L1 cache, not just the requested byte. This exploits spatial locality—accessing one byte means nearby bytes are likely accessed soon.

Key differences:

  • Access: Registers are addressed by instruction operands (e.g., mov rax, rbx). Cache is transparently managed—software doesn't explicitly address cache lines.
  • Size: Registers are typically 64 bits (8 bytes). Cache lines are 64 bytes.
  • Management: Compiler allocates registers. Hardware manages cache automatically (except for software hints like prefetch).
  • Visibility: Registers are architecturally visible (programmers see them). Cache is invisible (transparent to software).

Further Reading

Conclusion

Understanding how computers work at the hardware level provides the foundation for reasoning about performance, security, and system design. The fetch-decode-execute cycle, cache hierarchies, and virtual memory subsystems are not abstract concepts but practical realities that directly impact application behavior.

The concepts covered here—CPU architecture, memory hierarchy, and I/O systems—connect directly to how operating systems manage hardware resources. When you’re debugging a performance issue or designing a system that scales, this low-level understanding becomes your most powerful tool.

Continue your journey into computer architecture by exploring number systems and data representation to understand how bits become meaningful data, or dive into boolean logic and gates to see how simple transistors become functional units.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science