Building a Simple CPU Simulator: Implementing the Fetch-Decode-Execute Cycle

Implement a functional CPU simulator in Python that demonstrates the fetch-decode-execute cycle, register file, ALU operations, and memory access.

published: reading time: 33 min read author: GeekWorkBench

Introduction

Building a CPU simulator is one of the most educational projects in computer science. By implementing one, you see how instructions actually flow through a processor, how the fetch-decode-execute cycle works, and why real architectures make the design decisions they do. Writing a simulator also develops intuition for debugging systems-level code—state management, error handling, trace output—since these same challenges appear in kernel debugging and emulator development.

The simulator we’ll build implements a simple RISC-like instruction set with a register file, ALU, and memory subsystem. It shows the fundamental concepts used in actual CPU design while staying small enough to understand completely.

When to Use / When Not to Use

When to Use a CPU Simulator:

  • Teaching and learning computer architecture fundamentals
  • Testing instruction set changes before hardware implementation
  • Writing and debugging code for a new or custom ISA
  • Researching security vulnerabilities in a controlled environment
  • Creating test harnesses for compiler backends

When Not to Use:

  • Production performance requirements—interpretive simulation is slow
  • Whencycle-accurate timing matters—our simulator is functional, not timing-accurate
  • As a replacement for actual hardware testing before deployment
  • When you need memory hierarchy simulation (caches, TLBs) in detail

CPU Architecture

flowchart TB
    subgraph "Fetch Unit"
        A["Program Counter<br/>PC"] --> B["Instruction<br/>Memory"]
        B --> C["Instruction<br/>Register"]
    end

    subgraph "Decode Unit"
        C --> D["Control Unit"]
        D --> E["Register File<br/>Read Ports"]
        D --> F["Immediate<br/>Generator"]
    end

    subgraph "Execute Unit"
        E --> G["ALU"]
        F --> G
        E --> H["Branch<br/>Comparator"]
    end

    subgraph "Memory Unit"
        G --> I["Data<br/>Memory"]
        I --> J["Write Back<br/>Mux"]
    end

    subgraph "Write Back"
        J --> K["Register File<br/>Write Port"]
        K --> E
        G --> K
    end

    subgraph "PC Update"
        H --> L["PC Src<br/>Mux"]
        L --> A
        G --> L
    end

Core Concepts

The Fetch-Decode-Execute Cycle

The fundamental operation of any CPU is the fetch-decode-execute cycle, repeated billions of times per second:

  1. Fetch: Read the next instruction from memory at the address in the Program Counter (PC)
  2. Decode: Examine the instruction bits to determine what operation to perform
  3. Execute: Perform the computation—ALU operation, memory access, or register transfer
  4. Write Back: Store the result back to registers or memory
  5. Update PC: Increment PC (or set to branch target if taken)

Our simulator implements this cycle in software, using Python data structures to model the hardware components.

Register File Design

The register file is a small, fast memory within the CPU that holds working values. Our simulator uses a simple array-based implementation:

class RegisterFile:
    """Simulates the register file with read and write ports."""

    def __init__(self, num_registers=8):
        self.num_registers = num_registers
        self.registers = [0] * num_registers
        self.pc = 0  # Program counter is separate

    def read(self, reg_num):
        """Read a register value."""
        if 0 <= reg_num < self.num_registers:
            return self.registers[reg_num]
        raise ValueError(f"Invalid register number: {reg_num}")

    def write(self, reg_num, value):
        """Write a value to a register."""
        if 0 <= reg_num < self.num_registers:
            self.registers[reg_num] = value
        else:
            raise ValueError(f"Invalid register number: {reg_num}")

    def dump(self):
        """Return current register state for debugging."""
        return {
            f"R{i}": self.registers[i]
            for i in range(self.num_registers)
        }

ALU Operations

The Arithmetic Logic Unit (ALU) performs computational operations. Our simple simulator supports:

OperationOpcodeDescription
ADD0x0R[rd] = R[rs1] + R[rs2]
SUB0x1R[rd] = R[rs1] - R[rs2]
AND0x2R[rd] = R[rs1] & R[rs2]
OR0x3R[rd] = R[rs1] | R[rs2]
XOR0x4R[rd] = R[rs1] ^ R[rs2]
SLT0x5R[rd] = R[rs1] < R[rs2] (signed)
SLL0x6R[rd] = R[rs1] << R[rs2]
SRL0x7R[rd] = R[rs1] >> R[rs2] (logical)
ADDI0x8R[rd] = R[rs1] + immediate
ANDI0x9R[rd] = R[rs1] & immediate
ORI0xAR[rd] = R[rs1] | immediate

Instruction Format

Our simulator uses a simple 32-bit R-type/I-type instruction format:

flowchart LR
    subgraph "R-Type [opcode 0-7]"
        A["rd<br/>[5 bits]"] --> B["rs1<br/>[5 bits]"] --> C["rs2<br/>[5 bits]"] --> D["opcode<br/>[7 bits]"]
    end

    subgraph "I-Type [opcode 8-A]"
        E["rd<br/>[5 bits]"] --> F["rs1<br/>[5 bits]"] --> G["immediate<br/>[12 bits]"] --> H["opcode<br/>[7 bits]"]
    end

    subgraph "J-Type [opcode B: jump]"
        I["rd<br/>[5 bits]"] --> J["target<br/>[20 bits]"] --> K["opcode<br/>[7 bits]"]
    end

Implementation Snippets

Complete CPU Simulator

#!/usr/bin/env python3
"""
Simple CPU Simulator implementing fetch-decode-execute cycle.
Supports R-type, I-type, and J-type instructions.
"""

from enum import IntEnum
from dataclasses import dataclass
from typing import List, Optional


class Opcode(IntEnum):
    """Opcodes for our simple instruction set."""
    ADD = 0x0
    SUB = 0x1
    AND = 0x2
    OR = 0x3
    XOR = 0x4
    SLT = 0x5
    SLL = 0x6
    SRL = 0x7
    ADDI = 0x8
    ANDI = 0x9
    ORI = 0xA
    J = 0xB
    BEQZ = 0xC
    LOAD = 0xD
    STORE = 0xE
    HALT = 0xF


@dataclass
class Instruction:
    """Parsed instruction with all fields."""
    opcode: Opcode
    rd: int
    rs1: int
    rs2: int
    immediate: int = 0


class Memory:
    """Simulates instruction and data memory."""

    def __init__(self, size=1024):
        self.size = size
        self.data = [0] * size

    def load_byte(self, address: int) -> int:
        if 0 <= address < self.size:
            return self.data[address] & 0xFF
        raise MemoryError(f"Invalid load address: {address}")

    def store_byte(self, address: int, value: int):
        if 0 <= address < self.size:
            self.data[address] = value & 0xFF
        else:
            raise MemoryError(f"Invalid store address: {address}")

    def load_word(self, address: int) -> int:
        """Load 4 bytes as a little-endian word."""
        if address + 3 < self.size:
            return (self.data[address] |
                    self.data[address + 1] << 8 |
                    self.data[address + 2] << 16 |
                    self.data[address + 3] << 24)
        raise MemoryError(f"Invalid word load address: {address}")

    def store_word(self, address: int, value: int):
        """Store 4 bytes as a little-endian word."""
        if address + 3 < self.size:
            self.data[address] = value & 0xFF
            self.data[address + 1] = (value >> 8) & 0xFF
            self.data[address + 2] = (value >> 16) & 0xFF
            self.data[address + 3] = (value >> 24) & 0xFF
        else:
            raise MemoryError(f"Invalid word store address: {address}")

    def load_instruction(self, address: int) -> int:
        """Load a 32-bit instruction (4 bytes little-endian)."""
        return self.load_word(address)


class RegisterFile:
    """Simulates the register file."""

    def __init__(self, num_registers=8):
        self.num_registers = num_registers
        self.registers = [0] * num_registers
        self.pc = 0  # Program counter
        self.registers[0] = 0  # R0 is hardwired to zero

    def read(self, reg_num: int) -> int:
        if reg_num == 0:
            return 0  # R0 is always zero
        if 0 < reg_num < self.num_registers:
            return self.registers[reg_num]
        raise ValueError(f"Invalid register: {reg_num}")

    def write(self, reg_num: int, value: int):
        if 0 < reg_num < self.num_registers:
            self.registers[reg_num] = value
        # R0 writes are ignored (it's hardwired to zero)

    def dump(self) -> dict:
        return {
            f"R{i}": self.registers[i]
            for i in range(min(8, self.num_registers))
        }


class ALU:
    """Arithmetic Logic Unit for the CPU simulator."""

    @staticmethod
    def execute(opcode: Opcode, a: int, b: int) -> int:
        """Execute ALU operation and return result."""
        ops = {
            Opcode.ADD: lambda x, y: x + y,
            Opcode.SUB: lambda x, y: x - y,
            Opcode.AND: lambda x, y: x & y,
            Opcode.OR: lambda x, y: x | y,
            Opcode.XOR: lambda x, y: x ^ y,
            Opcode.SLT: lambda x, y: 1 if x < y else 0,
            Opcode.SLL: lambda x, y: (x << y) & 0xFFFFFFFF,
            Opcode.SRL: lambda x, y: x >> y,
        }

        if opcode in ops:
            return ops[opcode](a, b)

        # For immediate operations, 'b' contains the immediate value
        # handled in execute_instruction
        raise ValueError(f"Unsupported ALU opcode: {opcode}")


class CPUSimulator:
    """Complete CPU simulator with fetch-decode-execute cycle."""

    def __init__(self, memory_size=1024):
        self.memory = Memory(memory_size)
        self.registers = RegisterFile()
        self.alu = ALU()
        self.halted = False
        self.cycles = 0
        self.trace_enabled = False

    def load_program(self, instructions: List[int], address=0):
        """Load a list of 32-bit instructions into memory."""
        for i, instr in enumerate(instructions):
            addr = address + i * 4
            self.memory.store_word(addr, instr)

    def fetch(self) -> int:
        """Fetch instruction from memory at PC."""
        if self.trace_enabled:
            print(f"[FETCH] PC={self.registers.pc:04x} ", end="")
        instr = self.memory.load_instruction(self.registers.pc)
        if self.trace_enabled:
            print(f"Instruction=0x{instr:08x}")
        return instr

    def decode(self, instruction: int) -> Instruction:
        """Decode a 32-bit instruction."""
        opcode = Opcode(instruction & 0xF)
        rd = (instruction >> 4) & 0x7
        rs1 = (instruction >> 7) & 0x7
        rs2 = (instruction >> 10) & 0x7
        immediate = (instruction >> 10) & 0xFFF  # 12-bit immediate

        # Sign-extend immediate if bit 11 is set
        if immediate & 0x800:
            immediate |= 0xFFFFF000

        return Instruction(opcode=opcode, rd=rd, rs1=rs1, rs2=rs2,
                          immediate=immediate)

    def execute(self, instr: Instruction) -> bool:
        """
        Execute a decoded instruction.
        Returns True if we should advance PC, False for branch/jump.
        """
        op = instr.opcode
        pc_src = True  # True = increment PC, False = use calculated address

        # Handle immediate operands for I-type
        if op in (Opcode.ADDI, Opcode.ANDI, Opcode.ORI):
            operand2 = instr.immediate
        else:
            operand2 = self.registers.read(instr.rs2)

        operand1 = self.registers.read(instr.rs1)

        # ALU operations
        if op in (Opcode.ADD, Opcode.SUB, Opcode.AND, Opcode.OR,
                  Opcode.XOR, Opcode.SLT, Opcode.SLL, Opcode.SRL):
            result = self.alu.execute(op, operand1, operand2)
            self.registers.write(instr.rd, result)

        # Immediate operations
        elif op == Opcode.ADDI:
            result = operand1 + operand2
            self.registers.write(instr.rd, result)

        elif op == Opcode.ANDI:
            result = operand1 & operand2
            self.registers.write(instr.rd, result)

        elif op == Opcode.ORI:
            result = operand1 | operand2
            self.registers.write(instr.rd, result)

        # Jump (unconditional)
        elif op == Opcode.J:
            target = instr.immediate & 0xFFFFFFFC  # Word-aligned target
            self.registers.pc = target - 4  # Will be incremented after
            if self.trace_enabled:
                print(f"[JUMP]  PC={self.registers.pc:04x}")
            pc_src = False

        # Branch if equal zero
        elif op == Opcode.BEQZ:
            if operand1 == 0:
                offset = instr.immediate & 0xFFFFFFFC
                self.registers.pc = self.registers.pc + offset - 4
                if self.trace_enabled:
                    print(f"[BRANCH] Taken to PC={self.registers.pc:04x}")
                pc_src = False

        # Load from memory
        elif op == Opcode.LOAD:
            addr = operand1 + instr.immediate
            value = self.memory.load_word(addr)
            self.registers.write(instr.rd, value)
            if self.trace_enabled:
                print(f"[LOAD]  R{instr.rd}=MEM[{addr:04x}]={value}")

        # Store to memory
        elif op == Opcode.STORE:
            addr = operand1 + instr.immediate
            value = self.registers.read(instr.rd)
            self.memory.store_word(addr, value)
            if self.trace_enabled:
                print(f"[STORE] MEM[{addr:04x}]=R{instr.rd}={value}")

        # Halt
        elif op == Opcode.HALT:
            self.halted = True
            if self.trace_enabled:
                print("[HALT]")
            pc_src = False

        return pc_src

    def step(self) -> bool:
        """
        Execute one complete fetch-decode-execute cycle.
        Returns True if CPU is still running, False if halted.
        """
        if self.halted:
            return False

        self.cycles += 1

        # Fetch
        instruction = self.fetch()

        # Decode
        decoded = self.decode(instruction)

        if self.trace_enabled:
            print(f"[DECODE] op={decoded.opcode.name} rd=R{decoded.rd} "
                  f"rs1=R{decoded.rs1} rs2=R{decoded.rs2} "
                  f"imm={decoded.immediate}")

        # Execute
        pc_update = self.execute(decoded)

        # Update PC
        if pc_update:
            self.registers.pc += 4

        if self.trace_enabled:
            regs = self.registers.dump()
            print(f"[STATE] Cycles={self.cycles} Regs={regs}")

        return not self.halted

    def run(self, max_cycles=10000):
        """Run until HALT or max cycles reached."""
        while not self.halted and self.cycles < max_cycles:
            if not self.step():
                break
        return self.cycles


def assemble(instr_str: str) -> int:
    """Simple assembler: convert instruction string to 32-bit binary."""
    parts = instr_str.replace(',', ' ').split()
    if not parts:
        return 0

    op = parts[0].upper()

    # R-type: OP rd, rs1, rs2
    r_ops = {'ADD', 'SUB', 'AND', 'OR', 'XOR', 'SLT', 'SLL', 'SRL'}
    # I-type: OP rd, rs1, imm
    i_ops = {'ADDI', 'ANDI', 'ORI'}
    # J-type: J target
    # Branch: BEQZ rs1, offset

    if op == 'ADD':
        rd, rs1, rs2 = int(parts[1][1]), int(parts[2][1]), int(parts[3][1])
        return (Opcode.ADD << 0) | (rd << 4) | (rs1 << 7) | (rs2 << 10)

    elif op == 'ADDI':
        rd, rs1, imm = int(parts[1][1]), int(parts[2][1]), int(parts[3])
        return (Opcode.ADDI << 0) | (rd << 4) | (rs1 << 7) | ((imm & 0xFFF) << 10)

    elif op == 'J':
        target = int(parts[1])
        return (Opcode.J << 0) | ((target & 0xFFFFFF) << 7)

    elif op == 'BEQZ':
        rs1, offset = int(parts[1][1]), int(parts[2])
        return (Opcode.BEQZ << 0) | (rs1 << 7) | ((offset & 0xFFF) << 10)

    elif op == 'HALT':
        return (Opcode.HALT << 0)

    return 0

Test Program: Sum 1 to N

def test_sum_to_n():
    """Test program: calculate sum of 1 + 2 + ... + 10 = 55"""

    # Assembly program:
    # R1 = 10        ; counter
    # R2 = 0         ; accumulator
    # R3 = 1         ; constant 1
    # R4 = 0         ; temporary for addition
    # loop:
    #   ADD R4, R2, R1   ; R4 = R2 + R1
    #   ADDI R2, R4, 0   ; R2 = R4
    #   ADDI R1, R1, -1  ; R1--
    #   BEQZ R1, done     ; if R1 == 0, exit
    #   J loop
    # done:
    #   HALT

    program = """
        ADDI R1, R0, 10      ; R1 = 10 (counter)
        ADDI R2, R0, 0       ; R2 = 0 (accumulator)
        ADDI R3, R0, 1      ; R3 = 1 (constant)
    loop:
        ADD  R4, R2, R1     ; R4 = R2 + R1
        ADDI R2, R4, 0      ; R2 = R4
        ADDI R1, R1, -1     ; R1--
        BEQZ R1, done       ; if R1 == 0, exit
        J loop
    done:
        HALT
    """

    # Assemble the program
    instructions = []
    for line in program.strip().split('\n'):
        line = line.split(';')[0].strip()  # Remove comments
        if line:
            instructions.append(assemble(line))

    # Load and run
    cpu = CPUSimulator()
    cpu.load_program(instructions)
    cpu.trace_enabled = True

    cycles = cpu.run()
    print(f"\n[RESULT] Program completed in {cycles} cycles")
    print(f"[RESULT] Sum in R2 = {cpu.registers.read(2)}")
    print(f"[RESULT] Expected: 55")


def test_memory_access():
    """Test memory load/store operations."""

    program = """
        ADDI R1, R0, 0       ; R1 = base address
        ADDI R2, R0, 42      ; R2 = 42
        STORE R2, R1, 0      ; MEM[0] = 42
        ADDI R3, R0, 100     ; R3 = 100
        STORE R3, R1, 4      ; MEM[4] = 100
        LOAD R4, R1, 0       ; R4 = MEM[0]
        LOAD R5, R1, 4       ; R5 = MEM[4]
        ADD  R6, R4, R5      ; R6 = R4 + R5 = 142
        HALT
    """

    instructions = []
    for line in program.strip().split('\n'):
        line = line.split(';')[0].strip()
        if line:
            instructions.append(assemble(line))

    cpu = CPUSimulator()
    cpu.load_program(instructions)
    cpu.trace_enabled = True

    cycles = cpu.run()
    print(f"\n[RESULT] Memory test completed in {cycles} cycles")
    print(f"[RESULT] R6 = {cpu.registers.read(6)} (expected 142)")


if __name__ == "__main__":
    test_sum_to_n()
    print("\n" + "="*50 + "\n")
    test_memory_access()

Production Failure Scenarios

Scenario 1: Infinite Loops from Branch Prediction Errors

Failure: Simulator enters infinite loop due to incorrect branch resolution logic.

Example bug:

# BUG: PC update happens incorrectly
elif op == Opcode.BEQZ:
    if operand1 == 0:
        self.registers.pc = self.registers.pc + offset  # Missing -4!

Mitigation:

  • Add cycle counter with hard timeout: run(max_cycles=10000)
  • Implement trace mode that prints every instruction
  • Verify program correctness against known-good reference implementation

Scenario 2: Memory Access Violations

Failure: Loading or storing beyond memory bounds causes crash or silent corruption.

Example bug:

# BUG: No bounds checking in memory access
def load_word(self, address: int) -> int:
    return (self.data[address] |           # Crashes if address out of bounds
            self.data[address + 1] << 8 |
            self.data[address + 2] << 16 |
            self.data[address + 3] << 24)

Mitigation:

  • Always validate addresses before access
  • Implement memory protection with clear error messages
  • Use Python’s bounds checking with try/except for graceful handling

Scenario 3: Signed/Unsigned Confusion in Comparisons

Failure: Sign bit treated incorrectly causes wrong branch decisions.

# BUG: Treating signed immediate as unsigned
immediate = (instruction >> 10) & 0xFFF  # Always positive!

# If we want -1 (0xFFF), this gives 4095 instead!

Mitigation:

  • Explicitly implement signed and unsigned operations
  • Use Python’s unlimited integer precision but simulate 32-bit wraparound
  • Test with edge cases: -1, 0, 1, 0x7FFFFFFF, 0x80000000

Trade-off Table

AspectFunctional SimulatorTiming-AccurateCycle-Accurate
SpeedVery fastMediumSlow
AccuracyInstruction correctness onlyExecution orderFull timing
ComplexityLowHighVery high
Use CaseEducation, ISA designPerformance explorationHardware validation
Cache modelingNoneSimplifiedFull hierarchy
Power modelingNoneSimplifiedAccurate

Observability Checklist

When building and debugging CPU simulators:

  • Instruction trace: Log every instruction with PC, opcode, and operands
  • Register state dump: Print all registers after each cycle
  • Memory access log: Track all loads and stores with addresses and values
  • Cycle counter: Count executed cycles for performance analysis
  • Breakpoint support: Halt on specific PC values or conditions
  • Memory dump: Hex dump of memory regions for debugging
  • Statistics: Count of each instruction type, branches taken, etc.
  • Visualization: State diagram or pipeline visualization for education

Common Pitfalls / Anti-Patterns

Security Considerations

  1. Memory Safety: Our simulator implements bounds checking but production emulators must also check:

    • Alignment requirements (word accesses must be word-aligned)
    • Privilege levels (user mode vs kernel mode simulation)
    • Valid memory regions (no mapping arbitrary addresses)
  2. Instruction Validation: Reject invalid opcode values before attempting execution:

    if opcode > Opcode.HALT:
        raise InstructionError(f"Invalid opcode: {opcode}")
  3. Denial of Service: Infinite loops can hang simulators. Always implement:

    • Maximum cycle limits
    • Maximum memory access limits
    • Instruction count limits per program

Compliance Notes

  • ISA Compliance Testing: Simulators used for hardware validation must pass:

    • riscv-tests for RISC-V
    • Intel Architecture Test Suite (ARTS)
    • ARM Architecture Compliance Suite
  • Floating-Point: Production simulators must implement IEEE 754 compliance for FP operations

Common Pitfalls / Anti-patterns

  1. PC Update Ordering: Forgetting that PC increments after fetch, not before. The instruction at PC is the one fetched; PC+4 is the next sequential instruction.

  2. R0 Hardwiring: In RISC architectures, R0 (or XZR in ARM64) must always read as zero. Forgetting this breaks many programs that assume ADDI R1, R0, 5 puts 5 in R1.

  3. Immediate Sign Extension: 12-bit immediates must be sign-extended to full register width. Missing sign extension causes ADDI R1, R1, -1 to actually add 4095.

  4. Memory Endianness: Our simulator uses little-endian (least significant byte at lowest address). This must match the host machine for debugging or external tools.

  5. Word Alignment: Instructions must be word-aligned (multiple of 4 bytes). Jump targets must be masked to ensure this.

  6. Register-Register vs Register-Immediate: Using the wrong operand interpretation—immediate instructions use the immediate field, not register R2.

  7. Branch Delay Slots: Some architectures (MIPS) have a delay slot where the instruction after branch executes regardless. Our simulator doesn’t model this, but real ISAs may.

Quick Recap Checklist

  • The fetch-decode-execute cycle is the fundamental CPU operation: fetch instruction, decode it, execute it, update state
  • Register files provide fast storage with read/write ports; R0 is typically hardwired to zero in RISC designs
  • The ALU performs arithmetic and logical operations; everything else is either memory access or control flow
  • Memory systems in simulators need bounds checking, alignment validation, and error handling
  • Instruction formats (R-type, I-type, J-type) pack different fields into the same 32-bit word
  • Branch and jump instructions modify PC directly; all other instructions increment PC by 4
  • Sign extension converts smaller immediates to full register width while preserving signed value
  • Infinite loops are a real risk in simulators—always implement cycle or instruction limits
  • Testing with known programs (sum to N, Fibonacci, memory tests) catches bugs early
  • The simulator concept extends to full system simulation with devices, interrupts, and exceptions

Interview Questions

1. Explain the fetch-decode-execute cycle and why it must exist.

The fetch-decode-execute cycle is the fundamental operation of any CPU. Fetch reads the next instruction from memory at the address in the Program Counter (PC) register. Decode interprets the instruction bits to determine the opcode and operands—what operation to perform and what data to use. Execute performs the actual computation: ALU operations, memory accesses, or register transfers. Finally, the CPU updates its state (registers, memory, PC) and repeats.

The cycle must exist because instructions are stored in memory as data, and the CPU needs a systematic way to retrieve and process them. There's no way to "just know" what an instruction should do—the encoding must be interpreted. This sequential, disciplined approach ensures deterministic behavior: given the same initial state and program, the CPU always produces the same results.

2. What is the purpose of the Program Counter (PC), and why does it increment by 4 in a 32-bit architecture?

The Program Counter (PC) holds the memory address of the next instruction to execute. After an instruction is fetched, the PC must be updated to point to what comes next. In a 32-bit architecture with byte-addressable memory, each instruction is 4 bytes (32 bits), so the next instruction is at PC + 4.

It's crucial to understand that the PC points to the next instruction before fetch, but after fetch, the CPU has already read the instruction at that address. For sequential execution, we increment PC by 4. For branch or jump instructions, we set PC to a different value—the branch target instead of PC+4.

Note: Some architectures (like MIPS) use a "branch delay slot"—the instruction immediately after a branch executes regardless. Our simple simulator doesn't model this, but real RISC processors do.

3. Why do we need sign extension for immediate values, and how does it work?

Immediate values in instructions are encoded in fewer bits than the full register width. For example, our I-type instructions use 12 bits for immediates, but registers are 32 bits. When we load this 12-bit value into a 32-bit register, we must decide what goes in the upper 20 bits.

Sign extension copies the sign bit (most significant bit of the immediate) into all the upper bits. If the 12-bit immediate is 0x800 (-1 in 12-bit two's complement), we sign-extend it to 0xFFFFF800 (still -1 in 32-bit). If it were 0x001 (positive 1), we extend to 0x00000001.

This preserves the arithmetic meaning: -1 + 5 should still equal 4, not 4097. Unsigned immediates (like bitwise operations) still need sign extension to fill the register correctly—they just happen to work the same way because the upper bits end up as zeros for positive values.

4. What is the difference between R-type, I-type, and J-type instructions in our simulator?

The instruction type determines how the 32-bit encoding is interpreted.

R-type (Register-type) instructions have two source registers and a destination register—all operands are registers. The format is [opcode(4)][rd(3)][rs1(3)][rs2(3)][unused(19)]. Examples: ADD, SUB, AND, OR, XOR, SLT.

I-type (Immediate-type) instructions have one source register, one immediate value (constant), and a destination register. The format is [opcode(4)][rd(3)][rs1(3)][immediate(22)]. The immediate is sign-extended. Examples: ADDI, ANDI, ORI, LOAD.

J-type (Jump-type) instructions provide an absolute target address for jumps or a PC-relative offset for branches. STORE also uses the I-type format with register value to store. Examples: J (jump), BEQZ (branch if equal to zero).

The opcode field determines which type to use and how the CPU should interpret the remaining bits.

5. How would you extend this simulator to support function calls and returns?

Function calls require three components: a way to save the return address, a mechanism to allocate stack space, and a way to restore state on return.

Call instruction: Use a J-type with a link register (like ARM's LR). CALL target would be: save PC+4 to a designated register (R7), then jump to target. The RET instruction would jump to the address in that register.

Stack management: Reserve a register as a stack pointer (SP). ALLOCATE frame subtracts from SP. DEALLOCATE frame adds back to SP. Push/pop operations adjust SP and store/load registers at [SP].

Callee-saved registers: On entry to a function, save any callee-saved registers (R5, R6 in our 8-register design) to the stack. Restore them before return.

A complete calling convention would also define parameter passing (first few parameters in R1-R3), return values (R1), and stack frame layout. This is exactly how real RISC architectures implement function calls.

6. How would you implement an interrupt handling mechanism in the simulator?

Interrupts require several additions to the basic simulator design:

Interrupt controller: A module that receives hardware interrupt signals (timer, I/O, external) and determines if they should be presented to the CPU based on interrupt enable flags.

Interrupt enable/disable: Add status flags that can globally enable/disable interrupts and per-interrupt-type masks.

Vectored interrupts: When an interrupt fires, the CPU must:

  • Finish the current instruction (atomicity)
  • Push PC and status registers to stack
  • Load new PC from interrupt vector table
  • Switch to kernel/supervisor mode

Return from interrupt: An RTI instruction pops the saved state and resumes execution.

7. What is the difference between Harvard and von Neumann architecture?

In von Neumann architecture, instructions and data share the same memory and bus. The same pathway carries both code and data, creating the "von Neumann bottleneck"—the CPU can either fetch an instruction or access data, but not both simultaneously.

In Harvard architecture, instruction memory and data memory are completely separate. The CPU can fetch an instruction and access data simultaneously on separate buses, doubling effective bandwidth.

Modern CPUs use a modified Harvard architecture with separate L1 instruction and data caches, while presenting a unified view of memory to software. This gets close to Harvard performance while maintaining software compatibility.

DSPs and microcontrollers often use true Harvard architecture for real-time performance in constrained environments.

8. How do you handle exceptions and error conditions in a CPU simulator?

Exceptions require a mechanism to detect the error condition and transfer control to an exception handler:

Exception types:

  • Undefined instruction: Opcode not recognized
  • Illegal memory access: Address out of bounds or alignment fault
  • Arithmetic overflow: Signed/unsigned overflow detection
  • Divide by zero: Division instruction with zero divisor

Handling mechanism:

if (is_undefined_opcode(instr)):
    raise_exception(EXCEPTION_UNDEFINED_INSTRUCTION, saved_pc)
elif is_illegal_memory_access(address):
    raise_exception(EXCEPTION_MEMORY_FAULT, saved_pc)
elif is_overflow(result):
    raise_exception(EXCEPTION_ARITHMETIC_OVERFLOW, saved_pc)

def raise_exception(type, pc): # Save current state push_to_stack(pc) push_to_stack(processor_status) # Load exception vector new_pc = exception_vector_table[type] jump_to_handler(new_pc)

9. How would you implement branch prediction in a CPU simulator?

Branch prediction reduces pipeline stalls by guessing the direction of branches before they're resolved. For a simulator, you can model several strategies:

Always-taken (forward not taken): Simplest — predict backward branches are taken, forward branches are not. Useful baseline.

1-bit saturating counter: Each branch has a counter that predicts taken if 1, not taken if 0. On misprediction, flip the counter. Simple but 2 mispredictions on loop exit.

2-bit saturating counter: Requires two mispredictions to change prediction. Better for loops — only mispredicts on loop exit, not entry.

Two-level adaptive (gshare): Uses global branch history register (BHR) to index into a pattern history table (PHT). Captures correlation between branches.

On a misprediction:

  1. Flush the pipeline of instructions after the branch
  2. Update the branch predictor state
  3. Fetch from the correct target address

Simulator implementation: Track predictor state per branch address, and measure misprediction rate to evaluate predictor effectiveness.

10. What is the difference between a functional simulator and a cycle-accurate simulator?

A functional simulator (like the one we built) models the behavior of a CPU without regard to timing. It executes instructions correctly but doesn't model how long each instruction takes, pipeline effects, or resource conflicts. The simulator is fast—millions of instructions per second—but not suitable for performance optimization.

A cycle-accurate simulator models the exact cycle-by-cycle behavior of the hardware. Every pipeline stage, every bus transaction, every cache access is modeled with its timing. This enables accurate performance prediction but at significant complexity and speed cost—complex cycle-accurate models may simulate only thousands of cycles per second.

Comparison:

  • Functional: Used for ISA design exploration, compiler toolchain development, debugging. Speed >> accuracy for correctness testing.
  • Cycle-accurate: Used for microarchitecture research, performance optimization, hardware validation. Accuracy >> speed.

Many industry simulators sit between these extremes: "timingSimple" models in gem5, for example, add approximate timing to functional simulation without full cycle accuracy.

11. How would you extend the simulator to support out-of-order execution?

Out-of-order execution requires several additions:

Instruction Window: A buffer that holds instructions that have been decoded but not yet executed. Instructions are dispatched to execution units as their dependencies are satisfied, not in program order.

Register Renaming: To avoid WAR (Write After Read) and WAW (Write After Write) hazards, physical registers replace architectural registers. Each instruction gets a destination physical register; older values are preserved until no longer needed.

Reservation Stations: Each execution unit has a queue of pending instructions. When operands are available and the unit is free, the instruction executes.

Reorder Buffer (ROB): Instructions are tracked in program order through the ROB. When an instruction completes, it's retired in program order—updating architectural state in the correct sequence. Speculative results are held until the branch they depend on is resolved.

The data flow: Fetch → Decode → Rename → Dispatch to reservation station → Execute → Write result → ROB Retire

Complex parts: tracking dependencies, managing the ROB size, handling mispredictions (flush the pipeline), recovering from exceptions.

12. What is the purpose of a scoreboard in CPU simulation and how does it work?

A scoreboard is a data structure used in cycle-accurate simulators to track the state of instructions in flight, manage dependencies, and detect hazards. It was popularized by the CDC 6600 scoreboard.

The scoreboard maintains:

  • Instruction status table: For each instruction in flight, which stage it is in (Issue, Read Operands, Execute, Write Result)
  • Functional unit status: Which unit is busy, what instruction it contains, how long until result
  • Register result status: Which instruction will write to each register, and when
  • Physical register file: Current values and validity flags

On each cycle, the scoreboard:

  1. Determines which instructions can issue (no structural hazard)
  2. ChecksRAW hazards (instruction reading register that will be written by earlier instruction)
  3. Monitors execution completion and triggers result writes
  4. Manages write-back order to prevent WAW and WAR hazards

The scoreboard stalls issue when a hazard exists, allowing in-order retirement semantics while enabling out-of-order execution.

13. How would you implement a cache hierarchy in the CPU simulator?

Implementing a cache hierarchy involves adding cache structures and modifying memory access:

Cache structure:

class Cache:
    def __init__(self, size, block_size, associativity):
        self.size = size
        self.block_size = block_size
        self.associativity = associativity
        self.lines = [None] * (size // block_size)
        self.hits = 0
        self.misses = 0

Memory access flow:

  1. Check L1 cache for the address
  2. On hit: return data, update LRU (Least Recently Used)
  3. On miss: check L2 cache
  4. On L2 hit: return data, update L2 and L1
  5. On L2 miss: access main memory, fill both caches

Cache line states: Valid, Modified (dirty), Owned, Invalid. This is needed for cache coherence protocols (MESI, MOESI).

Write policies:

  • Write-back: Write to cache only, write to memory on eviction
  • Write-through: Write to cache and memory simultaneously
  • Write-allocate: On write miss, allocate the cache line
  • No-write-allocate: On write miss, write directly to memory
14. What is the difference between a unified cache and separate I-cache/D-cache?

A unified cache stores both instructions and data in the same cache structure. A separate cache has distinct L1 instruction cache (I-cache) and L1 data cache (D-cache).

Advantages of separate caches:

  • Harvard architecture: Can fetch instruction and access data simultaneously on different buses
  • No conflict misses: Instructions and data don't compete for the same cache space
  • Simpler timing: Different cache parameters for instructions vs data

Advantages of unified cache:

  • Flexibility: All resources available for either instructions or data as needed
  • Higher hit rates: No wasted space if one cache is underutilized while the other is full
  • Simpler design: One cache structure instead of two

Modern CPUs typically use separate L1 caches (I-cache and D-cache) because the fetch and data access are independent critical paths. L2 and L3 are usually unified because by that point the distinction is less important and the cache needs to handle both equally.

15. How does a TLB (Translation Lookaside Buffer) work in a simulator?

A TLB is a cache of virtual-to-physical address translations. In a simulator:

TLB structure:

class TLB:
    def __init__(self, entries=64):
        self.entries = entries  # Typically 64-128
        self.tags = [None] * entries  # Virtual page numbers
        self.translations = [None] * entries  # Physical page numbers
        self.valid = [False] * entries
        self.lru = list(range(entries))  # LRU ordering

Lookup flow:

  1. Extract virtual page number from virtual address
  2. Search TLB for matching tag
  3. On hit: combine physical page number with offset
  4. On miss: walk page tables in simulated memory, add to TLB, evict LRU entry

TLB attributes:

  • ASID (Address Space ID): Identifies which process's address space (prevents flushing on context switch)
  • Protection bits: Read/write/execute permissions
  • Valid bit: Entry is valid

TLB misses are expensive—on a real CPU, they require a page table walk (10-100 cycles). Simulating this accurately adds significant complexity.

16. What is the difference between a microscalar and a superscalar processor?

A scalar processor can complete at most one instruction per cycle. A superscalar processor can complete more than one instruction per cycle by having multiple execution units and dispatching instructions to them in parallel.

A microscalar processor extends scalar by having multiple pipeline lanes but limited parallelism—typically issue width of 2-4 and fewer execution units than a full superscalar.

Superscalar characteristics:

  • Issue width: Number of instructions that can be dispatched per cycle (2, 4, 8)
  • Execution units: ALU, load/store, FP units that can operate in parallel
  • Register files with multiple ports: Needed to support multiple reads/writes per cycle
  • Out-of-order execution: To keep the execution units busy despite hazards

Modern desktop and server CPUs are superscalar (Intel Haswell: 4-wide, AMD Zen: 6-wide). Simulating a superscalar is significantly more complex than scalar because you must model multiple instructions per cycle and their interactions.

17. How do you simulate branch prediction in a CPU simulator?

Branch prediction simulation models how a CPU predicts branch outcomes to avoid pipeline stalls. The simulator needs to track:

Branch History Table (BHT):

class BranchPredictor:
    def __init__(self, size=512):
        self.size = size
        # 2-bit saturating counters: 00=strong not-taken, 01=weak not-taken
        # 10=weak taken, 11=strong taken
        self.counters = [0] * size
def predict(self, pc):
    idx = pc % self.size
    return self.counters[idx] >= 2  # Taken if counter >= 2

def update(self, pc, taken):
    idx = pc % self.size
    if taken and self.counters[idx] < 3:
        self.counters[idx] += 1
    elif not taken and self.counters[idx] > 0:
        self.counters[idx] -= 1</code></pre>
<p><strong>Branch Target Buffer (BTB)</strong>: Caches the destination address of branches to enable faster fetch on subsequent executions.</p>
<p><strong>Prediction flow in simulator</strong>:</p>
<ol>
  <li>When fetching a branch instruction, check BTB for target address</li>
  <li>Use BHT to predict taken/not-taken</li>
  <li>If mispredicted (actual != predicted), flush pipeline and correct PC</li>
  <li>Update BHT with actual outcome</li>
</ol>
<p>More sophisticated predictors: local history (correlates with recent local branches), global history (correlates with overall branch behavior), tournament predictors (choose between local and global).</p>
18. What is the purpose of a pipeline register and how does it affect simulation speed?

Pipeline registers (also called pipeline latches or flip-flops) are storage elements placed between pipeline stages. They capture the output of one stage at the end of a cycle and hold it as input to the next stage at the beginning of the next cycle.

Pipeline registers enable:

  • Timing closure: Each stage has one cycle to complete its work
  • Clocking: All stages synchronize on the same clock edge
  • Isolation: Stages don't interfere with each other mid-cycle

In a cycle-accurate simulator, pipeline registers are modeled explicitly:

class PipelineRegister:
    def __init__(self, name):
        self.name = name
        self.value = None
def clock(self):
    # On clock edge, capture input
    pass

def update(self, new_value):
    # Set value for next cycle
    self.value = new_value</code></pre>
<p>Pipeline registers affect simulation speed because they determine how frequently the simulator must evaluate the pipeline. Deeply pipelined processors (20+ stages) require more frequent updates than simple ones. Cycle-accurate simulators update state every simulated cycle, which is slower than functional simulation but necessary for timing accuracy.</p>
19. How would you add support for exceptions and interrupts to the simulator?

Adding exception and interrupt support requires several additions:

Exception types:

  • Synchronous: Generated by instruction (undefined opcode, overflow, page fault)
  • Asynchronous: External events (timer interrupt, I/O completion)

Exception handling flow:

  1. Detect exception condition (invalid opcode, alignment fault, etc.)
  2. Save current PC to EPC (Exception Program Counter) register
  3. Set cause register with exception type
  4. Disable further interrupts
  5. Transfer to exception handler address (from vector table)

Interrupt controller:

class InterruptController:
    def __init__(self):
        self.pending = []
        self.enabled = True
def raise_interrupt(self, source):
    if self.enabled:
        self.pending.append(source)

def check_interrupts(self, cpu):
    if self.pending and self.enabled:
        # Highest priority interrupt
        irq = self.pending.pop(0)
        cpu.handle_interrupt(irq)</code></pre>
<p>RTI (Return from Interrupt) instruction restores saved state and re-enables interrupts. The key complexity is preserving enough state to resume execution after handling.</p>
20. What is the difference between soft float and hard float emulation in a simulator?

Hard float (hardware floating-point): The simulator leverages the host CPU's floating-point instructions. This is fast but may have subtle differences in rounding, exception handling, and NaN propagation compared to the target architecture.

Soft float: The simulator implements floating-point operations entirely in software. Each FP operation becomes a function call that emulates the operation using integer arithmetic. This is slow (100x slower than hard float) but bit-exact with the target architecture.

Soft float implementation:

def soft_float_add(a_bits, b_bits):
    # Extract sign, exponent, mantissa
    sign_a = (a_bits >> 31) & 1
    exp_a = (a_bits >> 23) & 0xFF
    mantissa_a = a_bits & 0x7FFFFF
# Perform addition in software
# Handle special cases (NaN, infinity, denormals)
# Rounding according to IEEE 754 rules
return result_bits</code></pre>
<p>When to use each:</p>
<ul>
  <li><strong>Soft float</strong>: When bit-exact behavior is required (testing, certification, debugging)</li>
  <li><strong>Hard float</strong>: When speed is critical and minor behavioral differences are acceptable</li>
</ul>
<p>gem5 simulator supports both, selectable via configuration. The RISC-V GNU toolchain supports both soft and hard float ABIs.</p>

Further Reading

Conclusion

Building a CPU simulator gives you concrete insight into how hardware actually works. The fetch-decode-execute cycle, register files, ALU operations, and memory access—all the abstract concepts from computer architecture become tangible when you implement them yourself.

The simulator we built demonstrates the essential concepts without getting bogged down in timing accuracy or cache modeling. It handles real instructions, executes branches correctly, and can run actual programs. Extensions like function calls, interrupts, and exception handling would turn this into a full system simulator.

Take the next step by exploring instruction set architecture to understand what instructions real processors provide, or assembly language basics to see how those instructions are actually written.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science