Build Your Own OS

A hands-on project guide to building a minimal operating system from scratch: boot loader, kernel, scheduler, and file system.

published: reading time: 26 min read author: GeekWorkBench

Introduction

Building your own operating system is the ultimate learning exercise in computer science. It forces you to understand what the “magic” of an OS actually does: how does a computer start? How does code run after power-on? How does the CPU switch between tasks? Every abstraction that seems “obvious” in modern operating systems becomes a concrete implementation choice you must make.

This is not an academic exercise—understanding OS fundamentals makes you a better systems programmer, debugger, and architect. Even if you never ship your OS, the mental model you’ll develop applies to kernel development, embedded systems, firmware, and understanding why things break in production.

When to Use / When Not to Use

Building an OS from scratch is appropriate when:

  • Deep learning is the goal — You want to understand every layer of the stack
  • Embedded constraints require minimal footprint — IoT devices with 64KB RAM
  • Security-critical systems — You need a TCB you can fully verify
  • Academic or personal projects — Self-directed learning with tangible outcomes

This approach is NOT appropriate when:

  • You need production-ready functionality — Use Linux, FreeRTOS, or Zephyr
  • Time constraints are tight — OS development takes months for even minimal features
  • Your team lacks low-level expertise — Missteps can brick hardware or cause data loss
  • Requirements are unclear — Start with requirements before architecture

Architecture or Flow Diagram

flowchart TB
    subgraph "Boot Phase"
        HW[Hardware Reset]
        BIOS[BIOS / UEFI Firmware]
        MBR[MBR Boot Sector<br/>512 bytes @ 0x7C00]
        BOOT[Bootloader Stage 2<br/>GRUB or custom]
    end

    subgraph "Protected Mode"
        PM[Protected Mode Entry<br/>32-bit GDT setup]
        KERNEL[Kernel Loader<br/>Parse multiboot header]
        KERNEL_MAIN[Kernel Main<br/>Initialize subsystems]
    end

    subgraph "Kernel Subsystems"
        GDT[GDT & IDT Setup]
        MEM[Memory Manager<br/>Paging / Physical alloc]
        IRQ[Interrupt Handler<br/>PIC / APIC setup]
        SCHED[Scheduler<br/>Context switching]
        FS[Simple File System<br/>FAT12 / custom]
        DRIVERS[Basic Drivers<br/>VGA, keyboard, ATA]
    end

    HW --> BIOS
    BIOS --> MBR
    MBR --> BOOT
    BOOT --> PM
    PM --> KERNEL
    KERNEL --> KERNEL_MAIN
    KERNEL_MAIN --> GDT
    GDT --> IDT
    IDT --> MEM
    MEM --> IRQ
    IRQ --> SCHED
    SCHED --> FS
    FS --> DRIVERS

    style KERNEL_MAIN stroke:#ff6b6b,stroke-width:3px
    style SCHED stroke:#ffa94d,stroke-width:3px

Core Concepts

Boot Loader (x86 BIOS/MBR)

The first sector (512 bytes) must end with 0xAA55 and fit everything needed to load the second stage:

; boot.asm - Simple 16-bit boot sector
; Assemble: nasm -f bin boot.asm -o boot.bin

BITS 16

ORG 0x7C00        ; BIOS loads us here

start:
    ; Set up segments
    xor ax, ax
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7C00

    ; Set video mode
    mov ah, 0x00
    mov al, 0x03    ; 80x25 color text
    int 0x10

    ; Print message
    mov si, msg
    call print_string

    ; Load kernel from disk
    mov ah, 0x02    ; BIOS disk read
    mov al, 10      ; Sectors to read
    mov ch, 0       ; Cylinder 0
    mov cl, 2       ; Start from sector 2
    mov dh, 0       ; Head 0
    mov dl, 0x80    ; First hard disk
    mov bx, 0x1000  ; Load to 0x1000:0000
    mov es, bx
    xor bx, bx
    int 0x13
    jc disk_error

    ; Jump to kernel
    jmp 0x0000:0x1000

print_string:
    lodsb
    or al, al
    jz .done
    mov ah, 0x0E
    int 0x10
    jmp print_string
.done:
    ret

disk_error:
    mov si, err_msg
    call print_string
    jmp $

msg db 'Loading OS...', 0
err_msg db 'Disk read error!', 0

; Fill remainder with zeros and boot signature
times 510-($-$$) db 0
dw 0xAA55

Protected Mode Transition

Switching from 16-bit real mode to 32-bit protected mode:

; switch_to_pm.asm - Enter protected mode

BITS 16

switch_to_protected_mode:
    ; Disable interrupts
    cli

    ; Load GDT
    lgdt [gdt_descriptor]

    ; Enable A20 line (access extra memory)
    in al, 0x92
    or al, 2
    out 0x92, al

    ; Set CR0 bit to enter protected mode
    mov eax, cr0
    or eax, 1
    mov cr0, eax

    ; Far jump to flush pipeline and load CS
    jmp 0x08:flush_pm

BITS 32

flush_pm:
    ; Set up data segments
    mov ax, 0x10    ; Data segment selector
    mov ds, ax
    mov ss, ax
    mov es, ax
    mov fs, ax
    mov gs, ax

    ; Set stack
    mov esp, 0x90000

    ; Jump to kernel C code
    extern kernel_main
    call kernel_main
    jmp $

; GDT
gdt_start:
    ; Null descriptor (required)
    dq 0

    ; Code segment descriptor
    dw 0xFFFF       ; Limit
    dw 0x0000       ; Base (low)
    db 0x00         ; Base (middle)
    db 10011010b    ; Access byte
    db 11001111b    ; Flags + Limit (high)
    db 0x00         ; Base (high)

    ; Data segment descriptor
    dw 0xFFFF
    dw 0x0000
    db 0x00
    db 10010010b
    db 11001111b
    db 0x00

gdt_end:

gdt_descriptor:
    dw gdt_end - gdt_start - 1
    dd gdt_start

Simple Kernel in C

A minimal kernel with basic output and memory management:

/* kernel.c - Minimal 32-bit kernel */

#define VIDEO_MEMORY 0xB8000
#define WHITE_ON_BLACK 0x0F

/* VGA text mode functions */
void clear_screen(void)
{
    volatile unsigned short *video = (unsigned short *)VIDEO_MEMORY;
    for (int i = 0; i < 80 * 25; i++) {
        video[i] = (WHITE_ON_BLACK << 8) | ' ';
    }
}

void print_string(const char *str)
{
    volatile unsigned short *video = (unsigned short *)VIDEO_MEMORY;
    int i = 0;
    while (str[i] != '\0') {
        *video++ = (WHITE_ON_BLACK << 8) | str[i++];
    }
}

/* Basic memory management */
#define KERNEL_HEAP_START 0x100000
#define KERNEL_HEAP_SIZE 0x100000

static unsigned int next_free = KERNEL_HEAP_START;

void *kmalloc(size_t size)
{
    void *ptr = (void *)next_free;
    next_free += size;
    if (next_free >= KERNEL_HEAP_START + KERNEL_HEAP_SIZE) {
        return 0;  /* Out of memory */
    }
    return ptr;
}

/* Kernel main entry */
void kernel_main(void)
{
    clear_screen();
    print_string("Welcome to MyOS!\n");

    /* Test memory allocation */
    char *test = kmalloc(100);
    test[0] = 'H';
    test[1] = 'i';
    test[2] = '\0';
    print_string(test);
}

/* Define empty interrupt handlers */
void isr0(void) { while(1); }
void isr1(void) { while(1); }
/* ... more ISRs ... */

Simple Round-Robin Scheduler

/* scheduler.c - Round-robin task scheduler */

#include <stdint.h>

#define MAX_TASKS 64
#define TASK_STACK_SIZE 4096

struct task {
    uint32_t id;
    uint32_t esp;          /* Stack pointer */
    uint32_t eip;          /* Instruction pointer */
    uint32_t priority;
    enum { TASK_RUNNING, TASK_READY, TASK_BLOCKED, TASK_DEAD } state;
    uint8_t stack[TASK_STACK_SIZE];
};

static struct task tasks[MAX_TASKS];
static int num_tasks = 0;
static int current_task = 0;

/* TSS for hardware context switching */
struct tss_entry {
    uint16_t link, :16;
    uint32_t esp0;
    uint16_t ss0, :16;
    uint32_t esp1, ss1, esp2, ss2;
    uint32_t cr3;
    uint32_t eip, eflags, eax, ecx, edx, ebx;
    uint32_t esp, ebp, esi, edi;
    uint16_t es, :16, cs, :16, ss, :16, ds, :16;
    uint16_t fs, :16, gs, :16, ldt, :16;
} __attribute__((packed));

static struct tss_entry tss;

int create_task(void (*entry)(void), uint32_t priority)
{
    if (num_tasks >= MAX_TASKS) return -1;

    struct task *t = &tasks[num_tasks++];
    t->id = num_tasks;
    t->state = TASK_READY;
    t->priority = priority;

    /* Set up stack to look like we were interrupted */
    t->esp = (uint32_t)&t->stack[TASK_STACK_SIZE - 1];

    /* Push fake interrupt frame */
    t->esp -= sizeof(uint32_t);  /* EIP placeholder */
    *(uint32_t *)t->esp = (uint32_t)entry;

    return t->id;
}

void schedule(void)
{
    /* Find next ready task */
    int prev = current_task;
    do {
        current_task = (current_task + 1) % num_tasks;
    } while (tasks[current_task].state != TASK_READY
             && current_task != prev);

    if (current_task != prev) {
        tasks[prev].state = (tasks[prev].state == TASK_RUNNING)
                           ? TASK_READY : tasks[prev].state;
        tasks[current_task].state = TASK_RUNNING;
        switch_task(&tasks[prev].esp, tasks[current_task].esp);
    }
}

/* Called from timer interrupt (IRQ0) */
void timer_interrupt_handler(void)
{
    /* Acknowledge PIC */
    outb(0x20, 0x20);

    schedule();
}

Simple FAT12 File System

/* fs_fat12.c - Minimal FAT12 file system */

#include <stdint.h>

#define SECTOR_SIZE 512
#define FAT_START 1
#define ROOT_START 19
#define DATA_START 33

struct fat12_boot {
    uint8_t jmp[3];
    uint8_t OEM[8];
    uint16_t bytes_per_sector;
    uint8_t sectors_per_cluster;
    uint16_t reserved_sectors;
    uint8_t num_fats;
    uint16_t root_entries;
    uint16_t total_sectors;
    uint8_t media_type;
    uint16_t sectors_per_fat;
    uint16_t sectors_per_track;
    uint16_t num_heads;
    uint32_t hidden_sectors;
    uint32_t large_sector_count;
    uint8_t drive_number;
    uint8_t reserved;
    uint8_t boot_signature;
    uint32_t volume_id;
    uint8_t volume_label[11];
    uint8_t fs_type[8];
} __attribute__((packed));

struct fat12_entry {
    uint8_t filename[8];
    uint8_t ext[3];
    uint8_t attributes;
    uint8_t reserved[10];
    uint16_t time;
    uint16_t date;
    uint16_t start_cluster;
    uint32_t file_size;
} __attribute__((packed));

/* Read file from FAT12 */
int read_file(uint16_t cluster, uint8_t *buffer, int max_size)
{
    int bytes_read = 0;
    uint8_t fat[FAT12_MAX_SIZE];

    /* Read FAT into memory */
    read_sectors(FAT_START, FAT12_SECTORS, fat);

    while (cluster < 0xFF8) {  /* Not end-of-file marker */
        if (bytes_read >= max_size) break;

        /* Read data cluster */
        uint32_t data_sector = DATA_START + (cluster - 2);
        read_sectors(data_sector, 1, buffer + bytes_read);
        bytes_read += SECTOR_SIZE;

        /* Get next cluster from FAT */
        uint16_t fat_index = cluster + (cluster / 2);
        cluster = *(uint16_t *)&fat[fat_index];
        if (cluster & 1) {
            cluster >>= 1;
        } else {
            cluster = cluster >> 1;
        }
    }

    return bytes_read;
}

Production Failure Scenarios

Scenario 1: Stack Overflow Corrupts Memory

Problem: Kernel stack is too small or grows into other regions, causing mysterious crashes.

Mitigation:

  • Define proper stack sizes with guard pages
  • Use __attribute__((interrupt)) for ISR functions (handles stack frame)
  • Check stack pointer against bounds before function calls
  • For recursive code, ensure termination conditions are always met

Scenario 2: Triple Fault During Protected Mode Transition

Problem: CPU triple-faults (reset) during mode transition, making debugging difficult.

Mitigation:

  • Ensure GDT is correctly aligned and loaded
  • Verify CS descriptor has proper access byte (0x9A for code)
  • Use a boot ROM or serial debug output before the crash
  • Build with debugging symbols and use a hardware debugger (QEMU + GDB)

Scenario 3: Paging Enabled Incorrectly

Problem: Enabling paging with incorrect page tables causes immediate crash.

Mitigation:

  • Identity-map the kernel first (virtual == physical) before enabling
  • Set CR3 to point to valid page directory
  • Enable paging with mov cr0, eax followed immediately by a jmp to flush pipeline
  • Use QEMU with -d int to trace interrupt/exceptions during boot

Trade-off Table

ComponentBare MetalEmulator (QEMU)Real Hardware
Development SpeedSlowest (flash/reset cycle)Fast (snapshot/restore)Slow
DebuggingLimited (serial output)Full GDB supportLimited
RiskCan brick deviceNonePotential for damage
RealismTrue hardwareMay differ from real HWPerfect
CI/CDDifficultEasyDifficult

Implementation Snippet: Linking the Kernel

/* kernel.ld - Linker script for bare-metal kernel */

OUTPUT_FORMAT("elf32-i386")
ENTRY(kernel_main)

SECTIONS
{
    . = 0x100000;  /* Kernel load address */

    .text : {
        *(.text)
        *(.rodata)
    }

    .data : {
        *(.data)
    }

    .bss : {
        *(.bss)
    }

    /DISCARD/ : {
        *(.comment)
        *(.eh_frame)
    }
}

Building and running:

#!/bin/bash
# build.sh - Build and run MyOS

set -e

# Assemble boot sector
nasm -f bin boot.asm -o boot.bin

# Assemble protected mode switch
nasm -f elf32 switch_to_pm.asm -o switch_to_pm.o

# Compile kernel and scheduler
gcc -m32 -ffreestanding -O2 -c kernel.c -o kernel.o
gcc -m32 -ffreestanding -O2 -c scheduler.c -o scheduler.o

# Link everything
ld -T kernel.ld -m elf_i386 boot.bin switch_to_pm.o kernel.o scheduler.o -o kernel.bin

# Create floppy image with FAT12
mkdir -p tmp
dd if=/dev/zero of=myos.img bs=512 count=2880
./mkfat.sh myos.img  # Create FAT12 filesystem
dd if=boot.bin of=myos.img conv=notrunc

# Run in QEMU
qemu-system-i386 -fda myos.img -display curses
# Or with GDB debugging:
# qemu-system-i386 -fda myos.img -s -S &
# gdb -ex 'target remote localhost:1234' -ex 'break kernel_main'

Observability Checklist

For OS development, instrument these early:

  • Serial output — The most reliable debug channel before VGA/text is set up
  • QEMU logging-d int for interrupt/debug output, -D log.txt for all
  • GDB stub — QEMU’s -gdb tcp::1234 enables full source-level debugging
  • QMP (QEMU Machine Protocol) — Programmatic control and inspection
  • Ranchu logsmake V=1 to see all command invocations during build

Common Pitfalls / Anti-Patterns

  • UEFI secure boot — Custom OS may not boot on new hardware without signing
  • DMA attacks — Without IOMMU, any device can access all memory
  • No ASLR — Simple kernels typically don’t implement address space layout randomization
  • No userspace isolation — Bugs in kernel code affect entire system

Common Pitfalls / Anti-patterns

  1. Starting without a bootable environment — Set up QEMU + GDB first, verify it works
  2. Skipping protected mode — 16-bit real mode limitations will cripple your kernel
  3. Not handling interrupts early — Unhandled interrupts cause immediate triple-fault
  4. Assuming hardware behaves correctly — Test on real hardware early; emulators differ
  5. Not using a linker script — You’ll spend days debugging random crashes without proper memory layout

Quick Recap Checklist

  • Boot process: BIOS/UEFI -> MBR -> Bootloader -> Kernel
  • 16-bit real mode transitions to 32-bit protected mode via GDT
  • VFS abstraction separates file system implementation from system calls
  • A simple scheduler needs: task structures, context switching, timer interrupts
  • Use QEMU for development; test on real hardware for validation
  • Debug output early and often—serial console is your friend
  • Start minimal: get to “Hello World” in protected mode before adding features

Real-World Case Study: xv6 Educational OS

MIT’s xv6 is a teaching OS derived from Unix v6, rewritten for modern x86. It demonstrates:

  1. Minimal but complete - ~8,000 lines covering process management, scheduler, syscalls, file system
  2. Pedagogical clarity - Each component is small enough to understand in isolation
  3. Modernized implementation - While inspired by v6, xv6 uses modern structures like spinlocks and a proper file system layer

Studying xv6 before building your own OS provides a reference implementation that shows how components interact without the complexity of production kernels.

Advanced Topic: UEFI Boot Process

Modern systems boot via UEFI (Unified Extensible Firmware Interface) instead of legacy BIOS:

UEFI advantages:

  • Support for GPT partitioned disks (no MBR limitations)
  • Driver support (can access disks before loading OS)
  • Secure Boot (verifies bootloader signatures)
  • Network boot via PXE
  • Modular shell and applications

UEFI boot sequence:

  1. Power-on → UEFI initialization → Load bootloader from ESP (EFI System Partition)
  2. Bootloader (GRUB2, systemd-boot) loads kernel with UEFI runtime services
  3. Kernel calls ExitBootServices() to release firmware control
  4. OS has full control of hardware

Your OS can be built for UEFI by: (1) declaring EFI boot_stub section, (2) linking with Gnu EFI toolchain, (3) placing .efi file in ESP/EFI/BOOT/BOOTX64.EFI

Interview Questions

1. What is a multiboot header and why does it matter for Linux-compatible bootloaders?

The multiboot header is a magic structure that GRUB (and other multiboot-compliant bootloaders) look for in the first 8KB of a kernel image. It tells the bootloader: where to find the memory map, what modules to load, and what video mode to set. If your kernel has this header, GRUB can boot it without custom boot sector code. The header contains magic numbers (0x1BADB002, 0xE4524FFB), flags, and checksum fields that GRUB validates before transferring control.

2. What is the difference between BIOS interrupt 0x13 and accessing ATA/IDE directly?

BIOS int 0x13 is a legacy BIOS service for disk I/O—it works in real mode with 16-bit addressing and has limited capabilities (no DMA, limited sector counts). Once you switch to protected mode, BIOS interrupts are unavailable. ATA/IDE registers (I/O ports 0x1F0-0x1F7, 0x3F6-0x3F7) can be accessed directly from protected mode, enabling DMA transfers, LBA48 addressing (>137GB), and full speed. Modern systems use AHCI (I/O ports 0x1F0-0x1F7, MSI-X) for SATA with command queuing.

3. How does hardware context switching differ from software context switching?

Hardware context switching uses the Task State Segment (TSS) and `CALL`/`JMP` to a task gate—the CPU automatically saves the full task state (registers, selectors) when switching. This is slow and rarely used in modern OSes. Software context switching (what Linux and most OSes use) manually saves only the necessary registers, stack pointer, and program counter in a struct, then loads the next task's state. It's faster and more flexible. The `sysenter`/`syscall` and `iret` instructions are optimized for user/kernel transitions, not full task switches.

4. What is the purpose of the Global Descriptor Table (GDT) in x86 protected mode?

The GDT defines memory segments and their access privileges. Each descriptor specifies: base address (32-bit), limit (20-bit, describing segment size), access rights (present, privilege level, type, direction). The CPU uses selectors (offsets into the GDT) to reference segments. Unlike real mode's fixed segment registers, protected mode descriptors are software-defined and can enforce privilege levels (kernel ring 0 vs user ring 3). A minimal GDT needs: null descriptor, kernel code segment (0x08), kernel data segment (0x10).

5. What causes a "triple fault" and how do you debug it?

A triple fault occurs when the CPU encounters an exception while trying to call the exception handler for a double fault (which itself is triggered by another exception). This typically happens during boot when an interrupt or exception occurs before the IDT (Interrupt Descriptor Table) is properly set up, or when the GDT/IDT is misconfigured. To debug: (1) boot in an emulator like QEMU with GDB stub enabled, (2) set breakpoints before the problematic code, (3) use QEMU's `-d int` flag to log interrupts and exceptions, (4) ensure you have exception handlers in place before enabling interrupts.

6. What is the difference between real mode and protected mode in x86?

Real mode is the CPU's original state after reset—16-bit addressing with segment:offset pairs, allowing access to only 1MB of memory (20-bit addresses). There's no memory protection, no privilege levels, and BIOS calls are available for hardware access. Protected mode enables: (1) 32-bit addressing (up to 4GB), (2) memory protection via privilege levels (ring 0-3), (3) paging for virtual memory, (4) hardware task switching. Operating systems switch to protected mode early in boot to access more memory and enable protection.

7. What is the role of the Interrupt Descriptor Table (IDT)?

The IDT tells the CPU where to go when an interrupt or exception occurs. Each entry (gate descriptor) contains: (1) the offset of the handler function, (2) the code segment selector to use, (3) flags indicating privilege level required and whether it's an interrupt gate (clears IF) or trap gate (doesn't). When an interrupt fires, the CPU validates the descriptor, switches to the specified privilege level if needed, and jumps to the handler. Setting up the IDT is one of the first things a kernel does after entering protected mode—unhandled interrupts cause triple faults.

8. What is paging and why do operating systems use it?

Paging divides memory into fixed-size pages (typically 4KB) and uses page tables to map virtual addresses to physical addresses. Benefits: (1) Isolation—each process has its own page tables, preventing unauthorized access to other processes' memory; (2) Virtual memory—pages can be swapped to disk when not in use, allowing more processes than physical RAM; (3) Protection—hardware enforces read/write/execute bits per page; (4) Flexibility—code can assume contiguous virtual addresses while physical memory is fragmented. The TLB (Translation Lookaside Buffer) caches recent translations for speed.

9. How does a bootloader hand over control to the kernel?

The bootloader (GRUB or custom) loads the kernel image into memory, sets up basic state (registers, stack), and jumps to the kernel's entry point. With GRUB and multiboot: (1) GRUB loads kernel to address specified in multiboot header, (2) sets up minimal GDT with flat segments, (3) puts the system in protected mode with 32-bit code, (4) passes a multiboot info structure (memory map, modules, etc.) in a register (usually EBX). The kernel then initializes its own GDT, sets up page tables, enables paging, and proceeds with subsystem initialization. This handoff follows a well-defined contract documented in the multiboot specification.

10. What is the purpose of a linker script in OS development?

A linker script specifies how sections from input object files are arranged in the output binary and where the kernel loads in memory. Without a custom linker script, the linker uses defaults that may place code and data at addresses incompatible with your OS's expectations. A typical linker script for an OS specifies: (1) output format (e.g., elf32-i386), (2) entry point address, (3) memory layout (which addresses are valid), (4) section placement (`.text` at 0x100000). This is critical for position-independent code that needs to run at specific addresses before the MMU is fully set up.

11. How does segmentation differ from paging in x86 protected mode?

Segmentation: Each memory reference uses a segment selector (in GDT/LDT) + offset. The selector indexes into descriptor table to get base address, limit, and access rights. Segmentation provides flat or hierarchical memory models, privilege levels (ring 0-3), and access control. In real mode, segments overlap (segment << 4 + offset); in protected mode, descriptors define arbitrary base/limit.

Paging: All addresses go through page table translation. Linear address → page directory/table → physical address. Provides virtual memory, process isolation, page-level protection (r/w/x bits), and can simulate segmentation (by identity-mapping large pages). Most modern OSes use paging exclusively, with segments providing privilege levels only.

Linux uses segmentation minimally: flat segments (base=0, limit=4GB) for both kernel and user mode. This simplifies memory management while relying on paging for protection.

12. What is the APIC and how does it differ from the 8259 PIC for interrupt handling?

8259 PIC (Programmable Interrupt Controller): 8 IRQ lines cascaded in master/slave (14 usable IRQs). Uses edge-triggered or level-triggered modes. Configured via I/O ports 0x20/0xA0. Limited to 15 devices, requires interrupt acknowledgement via EOI command.

APIC (Advanced PIC): Designed for SMP systems. Each CPU has a local APIC; I/O APIC routes interrupts to specific CPUs. Supports 24+ IRQ lines, interrupt affinity (route IRQ to specific cores), and more sophisticated priority handling. APIC is memory-mapped (typically at 0xFEE00000). Modern systems use APIC even for single-core for compatibility.

OS developers must handle both if supporting legacy hardware: switch from PIC to APIC by masking all PIC interrupts, programming APIC, then enabling local APIC. The transition is complex but necessary for multi-core support.

13. What is the SMP boot process and how do you bring up additional CPUs?

SMP boot sequence:

  1. BIOS/UEFI detects CPUs and configures MP tables or ACPI tables (MADT, DSDT)
  2. Bootstrap processor (BSP) starts first, others are Application Processors (APs)
  3. OS reads ACPI tables to find APIC entries for each CPU
  4. OS sends INIT IPI to all APs, then Startup IPI with trampoline code address
  5. APs jump to trampoline code in low memory (below 1MB), enter protected mode
  6. APs set up their own GDT, IDT, page tables, then jump to kernel's CPU initialization
  7. OS creates per-CPU data structures (task state segments, per-CPU variables via segment registers)

The trampoline code is critical — APs need real-mode entry point before paging is enabled on them. Once AP is in kernel, it can join the scheduler's run queue.

14. How does a system call transition from user mode to kernel mode on x86?

System call mechanism on x86:

  1. User application calls library wrapper (e.g., read(fd, buf, count) in libc)
  2. Library puts syscall number in eax, args in ebx, ecx, edx, then executes int 0x80 (legacy) or syscall instruction (SSE) or sysenter
  3. CPU checks CPL (Current Privilege Level) — if user mode (ring 3), allows transition to kernel (ring 0)
  4. CPU looks up gate descriptor in IDT (vector 0x80 maps to kernel entry point)
  5. CPU switches to kernel stack (from TSS), pushes error code, interrupt number, EFLAGS, CS, EIP, SS:SP
  6. CPU loads kernel's code segment selector into CS, jumps to kernel syscall handler
  7. Kernel validates parameters, executes syscall, returns via iret or sysret

Linux uses syscall (fast, since Skylake) and returns via sysret to avoid iret's stack switch overhead.

15. What is the structure and purpose of the Task State Segment (TSS)?

TSS structure: Fixed-size (104 bytes on 32-bit x86) segment containing:

  • I/O map base address (where I/O permission bitmap starts)
  • SS0, ESP0: Stack selector and pointer for ring 0 (kernel stack)
  • SS1, ESP1, SS2, ESP2: Stacks for rings 1-2 (usually unused)
  • CR3: Page directory pointer (for kernel stack when using separate stacks)
  • EIP, EFLAGS, EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI: Saved registers for hardware task switching
  • ES, CS, SS, DS, FS, GS, LDTR: Segment selectors

Purpose: When an interrupt occurs in user mode, the CPU uses the TSS's ESP0 to switch to the kernel stack before calling the interrupt handler. The OS sets TSS.SS0 and TSS.ESP0 to point to the kernel stack for the current process. On syscall, the kernel updates ESP to point to valid kernel stack before executing kernel code.

16. How does GRUB's stage 1, stage 1.5, and stage 2 boot process work?

GRUB boot stages:

  1. Stage 1: 446-byte MBR boot code — very limited (can only load 512 bytes), knows how to load stage 1.5 or stage 2 from a known disk location
  2. Stage 1.5: Lives in the sectors immediately after MBR (typically 30KB). Contains filesystem drivers (e2fs, fat, iso9660) so GRUB can understand boot filesystem structure. Loads stage 2 from /boot/grub/
  3. Stage 2: The actual GRUB menu and runtime — reads menu.lst/grub.cfg, displays boot menu, loads kernel. Can be anywhere on the filesystem

If using GRUB2 (GRUB with multiboot2 support), stage 1.5 is replaced by embedded area in disk gaps or a dedicated partition. The evolution reflects the need for better filesystem support and Secure Boot compatibility.

17. What is the difference between multiboot and multiboot2 specifications?

Multiboot (original): Defined by grub developers, kernel header at offset 0x10000 in kernel image:

  • Magic: 0x1BADB002
  • Flags: bit 16 = align all modules
  • Checksum: -(magic + flags)
  • header_addr, load_addr, load_end_addr, bss_end_addr, entry_addr

Multiboot2: Extensible, supports more architectures (including x86_64):

  • Magic: 0xE4524FFB (different from v1)
  • Header format uses tags (type-length-value) instead of fixed offsets
  • Adds support for: framebuffer tables, ELFHDR presentation, network boot info, ACPI tables
  • GRUB2 supports both; simpler kernels often implement only v1
18. What is the VFS (Virtual File System) layer and how does it enable multiple filesystems?

VFS is an abstraction layer that provides a common interface to filesystem implementations. It defines:

  • struct super_block: Mounted filesystem metadata (block size, operations, inode)
  • struct inode: Represents file/directory (mode, size, operations, block pointers)
  • struct dentry: Parsed path component (name, parent, inode reference)
  • struct file: Open file instance (position, operations)

Concrete filesystems (ext4, FAT, NTFS, network filesystems) implement these structures. When you access a path, VFS: (1) resolves path through dentry cache, (2) calls inode operations for read/write/create, (3) buffers results through page cache. Adding a new filesystem only requires implementing VFS operations — existing code works without modification.

19. How does identity paging work and why is it used during kernel initialization?

Identity paging: Virtual address equals physical address (1:1 mapping). Page 0x1000 maps to physical 0x1000.

Usage during kernel init:

  1. Bootloader loads kernel at physical address 0x100000 (1MB)
  2. Before enabling paging, kernel sets up page tables where entry 0x100000 maps to 0x100000 and entry 0x0 maps to 0x0 (identity)
  3. Kernel enables paging with CR0.PG bit
  4. First instruction after enabling paging must be a jump to flush instruction prefetch buffer
  5. Because of identity mapping, kernel can still access its own code/data at the same addresses
  6. After full initialization, kernel often switches to virtual addresses (higher half kernel)

Identity paging is a bootstrap technique — it allows the same code to work before and after paging is enabled without complex address translation during the critical boot transition.

20. What is the role of the PIC (Programmable Interrupt Controller) and how do you configure it in protected mode?

PIC configuration in protected mode:

  1. Remap IRQ vectors: legacy PIC uses vectors 0x20-0x2F for hardware interrupts, but protected mode IDT starts at 0. Must reprogram PIC to avoid conflicts
  2. Write ICW1 (0x11) to port 0x20/0xA0 to start initialization sequence
  3. Write ICW2 to set base vector (e.g., 0x20 for master, 0x28 for slave)
  4. Write ICW3 to configure cascading (master IRQ2 connected to slave)
  5. Write ICW4 for 8086 mode
  6. Write OCW1 to unmask only used IRQ lines
  7. Send EOI to PIC after handling each interrupt

The 8259 PIC is edge-triggered, limited to 15 IRQ lines, and doesn't support multi-core interrupt routing. Modern systems use APIC instead, but PIC compatibility mode must be disabled during APIC setup. If you skip PIC remapping, keyboard and timer interrupts (IRQ1, IRQ0) will fire at wrong vector numbers, causing crashes.

Further Reading


Conclusion

Building your own OS provides the deepest understanding of how computers actually work—from the moment power is applied through boot loader execution, protected mode transition, and into kernel initialization. The boot process (BIOS/UEFI -> MBR -> bootloader -> kernel) requires understanding real-mode addressing, GDT setup, and protected mode entry. A minimal kernel needs memory management (physical allocation or paging), interrupt handling (IDT setup, PIC/APIC), and a scheduler with context switching capability.

QEMU with GDB debugging transforms OS development from blind guessing into systematic investigation. Serial console output provides reliable debugging before VGA/text initialization. Start minimal: get to “Hello World” in protected mode before adding features, and debug output early and often.

For continued learning, explore advanced boot concepts (UEFI, multiboot2), virtual memory with page tables and TLB management, user space process creation with fork/exec, and system call interface design. Consider implementing a simple shell, filesystem (FAT12 or a custom minimal FS), and basic device drivers as next steps beyond the minimal kernel.


Category

Related Posts

System Calls Interface

System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.

#operating-systems #system-calls #kernel

What Is an Operating System?

An operating system sits between hardware and applications, managing resources so programs don't have to. This guide explains what an OS does, its architecture, and why it matters.

#operating-systems #os-basics #kernel

CPU Affinity & Real-Time Operating Systems

CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.

#operating-systems #cpu-affinity #scheduling