Memory Allocation: Kernel Allocators, Slab, Buddy System

How the Linux kernel allocates memory internally — from the slab allocator and buddy system to memory zones and the subtle differences between kmalloc and vmalloc.

published: reading time: 33 min read author: GeekWorkBench

Memory Allocation: Kernel Allocators, Slab, Buddy System

User-space programs call malloc and free without ever thinking about who fulfills the request. The kernel faces the same problem — it needs to allocate and free memory constantly, for structures as small as a task_struct (a few KB) and as large as a kernel buffer (many MB). But the kernel cannot use the same allocator as user programs. It runs in a privileged context where page faults are catastrophic, memory fragmentation is permanent (no virtual memory backing for the kernel heap), and allocation latency directly affects system performance.

Linux solves this with a layered allocator architecture: the buddy system at the low end handles physical page allocation, and the slab allocator sits above it to satisfy small, frequent kernel allocations efficiently.

Introduction

When to Use / When Not to Use

Kernel memory allocators are used exclusively by the kernel and kernel modules — user-space code does not call kmalloc or alloc_pages(). However, understanding these allocators helps in several scenarios:

When this knowledge is relevant:

  • Writing Linux kernel modules or device drivers
  • Debugging kernel OOM conditions or memory leaks
  • Reading kernel crash dumps (oops/panic output showing slab cache names)
  • Performance tuning with vmstat, /proc/slabinfo, slabtop
  • Operating system course projects

When to use user-space equivalents:

  • malloc/free for C applications in user space
  • mmap, brk for larger or memory-mapped allocations
  • High-level language allocators (Go’s make/new, Python’s garbage collector) for managed runtime languages

Architecture or Flow Diagram

The kernel’s memory allocation stack flows from high-level requests (kmalloc, vmalloc) down to the buddy system’s physical page allocator, with per-CPU caches and memory zones in between.

flowchart TD
    KMALLOC["kmalloc()<br/>~8 bytes to 128 KB"]
    VMALLOC["vmalloc()<br/>Arbitrary size, non-contiguous"]
    SLAB["Slab Allocator<br/>Object caches (task_struct, etc.)"]
    SLUB["SLUB Allocator<br/>Default Linux slab implementation"]
    SLOB["SLOB Allocator<br/>Embedded/small system allocator"]
    PAGE["alloc_pages()<br/>Buddy System<br/>Physical page allocation"]
    ZONES["Memory Zones<br/>DMA, Normal, HighMem"]
    PHYMEM["Physical Memory<br/>DRAM"]

    KMALLOC --> SLUB
    KMALLOC -->|"small allocations"| SLAB
    VMALLOC --> PAGE
    SLUB --> SLAB
    SLAB --> PAGE
    PAGE --> ZONES
    ZONES --> PHYMEM

    style KMALLOC stroke:#ff00ff,stroke-width:2px
    style VMALLOC stroke:#000,stroke-width:1px
    style PAGE stroke:#ff00ff,stroke-width:2px

The DMA zone (first 16 MB on x86) exists for devices that cannot address full physical memory. The Normal zone (16 MB to 896 MB on x86) is directly mapped. The HighMem zone (above 896 MB on 32-bit x86) requires explicit mapping before use — 64-bit systems generally do not need HighMem.

Core Concepts

The Buddy System Allocator

The buddy system is the foundational physical page allocator in Linux. It maintains free lists for each order (power-of-two page counts): order 0 = 1 page (4 KB), order 1 = 2 pages (8 KB), order 2 = 4 pages (16 KB), up to order 10 or 11 (1-4 GB depending on configuration).

When a request for n pages comes in:

  1. Round up to the next power of two (the “order”)
  2. Check the free list for that order
  3. If a block is available, split it in half and put the halves on the next-lower free list until the correct size is reached
  4. If no block is available, request a larger block from the next higher order and split it

The “buddy” is the adjacent half of the split block — when a block is freed, the allocator checks if its buddy is also free and, if so, coalesces them back into a larger block. This coalescing is what gives the buddy system its name and its resistance to external fragmentation.

The buddy system operates at the page level. Requests for arbitrary byte counts (like kmalloc) cannot be served directly by the buddy system — they need an intermediate allocator that carves page-sized chunks into smaller objects.

Slab Allocator: Motivations

The kernel allocates and frees many objects of the same type repeatedly: task_struct when a process is created, struct file when a file is opened, struct dentry for each directory entry, struct buffer_head for block I/O buffers. Allocating and freeing these through the buddy system would be prohibitively expensive:

  • Fragmentation: A task_struct (2 KB) allocated from an 8 KB buddy block leaves 6 KB wasted
  • Cache effects: Buddy-allocated pages are not cache-aligned — critical kernel data structures benefit from cache line alignment
  • Latency: The buddy system requires searching free lists, potentially triggering page allocation from the zone allocator

The slab allocator solves this by maintaining per-type caches (called slab caches). Each cache holds objects of one type. When a cache needs objects, it obtains pages from the buddy system and carves them into equal-sized objects. When an object is freed, it is returned to the cache (not the buddy system) — the next allocation of the same type reuses the cached object without touching the buddy system.

SLUB: The Default Slab Allocator

Linux has had three slab implementations:

  • Slab (original): Slightly obese for large systems, good debugging
  • SLUB (Unqueued Slab Allocator, default since 2.6.23): Simplified, better performance, excellent for large NUMA systems
  • SLOB (Simple List of Blocks): Minimal allocator for embedded systems with very limited memory

SLUB is the default on all desktop and server kernels. It removes the per-CPU queues of Slab and uses page structs directly, reducing overhead. It is highly NUMA-aware, distributing slab caches across memory nodes.

Key SLUB concepts:

  • Partial slabs: A page with some objects allocated and some free
  • Per-CPU freelists: Each CPU has a private list of free objects, eliminating locking for the common case
  • Object alignment: Objects are aligned to cache lines by default, eliminating false sharing

kmalloc vs vmalloc

kmalloc() allocates from the kernel’s direct-mapped linear address range (the Normal zone on x86). Addresses returned by kmalloc are contiguous in physical memory (and virtually contiguous). It can only allocate up to a maximum of 128 KB (one page order 7 block on x86). It is fast — the allocator is a slab cache with size-specific objects.

vmalloc() allocates from the virtual address space reserved for vmalloc (VMALLOC_START to VMALLOC_END on x86-64). The allocated regions are contiguous in virtual address space but may be fragmented across multiple non-contiguous physical pages. This is necessary for allocating large buffers (multi-page) that do not need physical contiguity.

Propertykmallocvmalloc
Physical memoryContiguousNon-contiguous (scatter-gather)
Virtual address spaceContiguousContiguous
Maximum allocation128 KB (x86)~1.5 GB (x86-64, tunable)
LatencyLowHigher (page table updates needed)
Use caseSmall, frequent kernel objectsLarge buffers (module code, large I/O)
Fault contextSafe (directly mapped)Not safe from interrupt context (may sleep)

Memory Zones

Linux divides physical memory into zones, each serving different purposes:

ZONE_DMA (first 16 MB on x86): Required for ISA DMA — legacy devices that can only address the first 16 MB of RAM. Allocations from this zone are expensive because they must be below the 16 MB boundary.

ZONE_NORMAL (16 MB to 896 MB on 32-bit x86): Directly mapped to the kernel’s linear address space. Allocations here are the fastest — no special mapping needed. Most kernel allocations come from this zone.

ZONE_HIGHMEM (above 896 MB on 32-bit x86): Not directly mapped. Pages must be explicitly mapped (using kmap/kunmap) before the kernel can access them. 64-bit systems typically have no HIGHMEM because they can map the entire physical address space.

On NUMA systems, each node has its own set of zones. The kernel prefers allocating from the node local to the CPU making the request, falling back to remote nodes when local memory is exhausted.

Production Failure Scenarios

Kernel Memory Leak (kmalloc without free)

Failure: A kernel module allocates memory via kmalloc or kzalloc and never frees it before unloading. Over time, leaked memory accumulates, reducing the amount available for legitimate kernel allocations. Eventually the kernel’s memory allocator exhausts available pages, triggering the OOM killer — which may kill user-space processes unpredictably.

Mitigation: Always pair allocations with frees. Use kunmap() for kmap(). Use module_init/module_exit lifecycle hooks to clean up. Use the kernel’s kmemleak tool (echo 1 > /sys/kernel/debug/kmemleak and read the results) to detect leaks during development. Run slabtop to watch for slab caches that grow unboundedly.

Slab Fragmentation Under Heavy Module Loading

Failure: Loading and unloading many kernel modules of different sizes creates slab fragmentation — many partially-filled slab pages with objects of one type, while other object types have full slabs. Physical memory becomes inefficiently used despite reasonable overall free memory counts.

Mitigation: Use the slabinfo tool to analyze slab utilization. On embedded systems, use SLOB instead of SLUB to reduce fragmentation overhead. Consider keeping modules loaded rather than repeatedly loading/unloading. Monitor /proc/slabinfo for increasing objperslab / pagesperslab ratios.

vmalloc Exhaustion

Failure: A driver or module requests a very large vmalloc allocation (e.g., for a frame buffer or scatter-gather buffer). The vmalloc area has a limited size (~1.5 GB on x86-64). Exhausting it causes vmalloc to fail, returning NULL. The calling code may not check for NULL, leading to a NULL pointer dereference.

Mitigation: Check all vmalloc return values. On x86-64, increase the vmalloc area by adjusting vmalloc in the kernel command line (vmalloc=2G). Use alloc_pages() (buddy system) for large physically-contiguous allocations instead of vmalloc.

Trade-off Table

Allocator AspectBuddy SystemSlab/SLUBvmalloc
Allocation unitPower-of-2 pagesIndividual objects (bytes to KB)Virtual pages
Physical contiguityAlways guaranteedAlways (via kmalloc)Not guaranteed
Internal fragmentationUp to 50% per allocationLow (per-object size matching)Low
Allocation latencyModerateVery low (per-CPU caches)Higher (page table setup)
Can allocate from interrupt contextYesYes (SLUB per-CPU)No (may sleep)
Suitable forPage-level requestsFrequent small kernel objectsLarge buffers, module memory
NUMA awarenessPartialFull (SLUB)Yes

Implementation Snippets

Kernel Module with Proper Allocation/Free (C)

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>      /* for kmalloc, kfree, kzalloc */
#include <linux/gfp.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Example");
MODULE_DESCRIPTION("Memory allocation example — proper kmalloc/kfree pairing");

struct my_data {
    unsigned long id;
    char name[64];
    void *buffer;
};

/* Module parameter: buffer size in KB */
static int buffer_size_kb = 64;
module_param(buffer_size_kb, int, 0644);

static int __init my_module_init(void) {
    struct my_data *data;

    printk(KERN_INFO "Loading my_module: buffer_size=%d KB\n", buffer_size_kb);

    /* Allocate a single structure using kmalloc */
    data = kmalloc(sizeof(*data), GFP_KERNEL);
    if (!data) {
        printk(KERN_ERR "my_module: kmalloc failed\n");
        return -ENOMEM;
    }

    /* Zero-initialize with kzalloc */
    data->id = 0;
    memset(data->name, 0, sizeof(data->name));

    /* Allocate the buffer separately with size from parameter.
     * kmalloc can only allocate up to 128 KB on x86 (order 7) — for larger
     * buffers, use vmalloc or __get_free_pages */
    if (buffer_size_kb > 128) {
        printk(KERN_WARNING "my_module: size %d KB exceeds kmalloc max, using vmalloc\n",
               buffer_size_kb);
        data->buffer = vmalloc(buffer_size_kb * 1024);
    } else {
        data->buffer = kmalloc(buffer_size_kb * 1024, GFP_KERNEL);
    }

    if (!data->buffer) {
        printk(KERN_ERR "my_module: buffer allocation failed\n");
        kfree(data);
        return -ENOMEM;
    }

    printk(KERN_INFO "my_module: allocated %d KB buffer at %p\n",
           buffer_size_kb, data->buffer);

    /* Store data pointer in module's per-cpu or global structure for later use */
    return 0; /* success */
}

static void __exit my_module_exit(void) {
    /* In production: always free what you allocated, in reverse order.
     * This is the critical part that prevents memory leaks on unload. */
    printk(KERN_INFO "Unloading my_module\n");
    /* Cleanup code would free data->buffer first, then data itself */
}

module_init(my_module_init);
module_exit(my_module_exit);

/* GFP_KERNEL: can sleep, suitable for process context
 * GFP_ATOMIC: cannot sleep, for interrupt context
 * GFP_USER: for user-space-requested kernel allocations
 * __GFP_DMA: force DMA zone
 * __GFP_HIGHMEM: prefer HighMem zone
 */

Inspecting Kernel Slab Caches (bash)

#!/bin/bash
# Show top slab caches by memory consumption

echo "=== Top 15 slab caches by memory usage ==="
# Requires root privileges to read /proc/slabinfo
cat /proc/slabinfo 2>/dev/null | \
    awk 'NR==1 {print $0} NR>1 {print $0 | "sort -k3 -n -r"}' | \
    head -16

echo ""
echo "=== Detailed view of specific caches ==="
echo "--- task_struct cache ---"
cat /proc/slabinfo | grep "^task_struct"

echo "--- buffer_head cache ---"
cat /proc/slabinfo | grep "^buffer_head"

echo ""
echo "=== Using slabtop (live view, needs root) ==="
echo "Run 'slabtop -o' to display once, or 'slabtop' for live updates"

Observability Checklist

  • Slab cache statistics: /proc/slabinfo or slabtop -o — shows active objects, num objects, object size per cache
  • Kernel memory overall: cat /proc/meminfo | grep -E "MemFree|MemAvailable|Slab|Cached|Active|Inactive"
  • kmalloc failures: These appear as kernel: kmalloc: allocation failed messages in dmesg; enable CONFIG_DEBUG_SLAB_LEAK for detailed traces
  • OOM killer for kernel memory: dmesg | grep -i "out of memory\|oom" | tail
  • buddy info per zone: cat /proc/buddyinfo — shows free pages per zone per NUMA node for each order
  • vmalloc usage: cat /proc/vmallocinfo — shows all active vmalloc allocations (useful for tracking down a large vmalloc consumer)
  • Per-NUMA node memory: numactl --hardware or cat /proc/nodeinfo
  • High memory usage (32-bit only): cat /proc/meminfo | grep High

Common Pitfalls / Anti-Patterns

Use-After-Free (UAF) in Kernel Modules: UAF occurs when kernel code frees an object (returns it to the slab cache) but a dangling pointer in another part of the code still references it. A subsequent allocation of the same type can reuse that memory, and the original code can read/write the new object’s data, corrupting kernel state. UAF in the kernel is more severe than in user space — it can lead to privilege escalation. Mitigations include:

  • UAF detectors: KASAN (Kernel Address Sanitizer) detects use-after-free at runtime by poisoning freed memory
  • SLAB_FREELIST_HARDENED: Randomizes the freelist to make UAF exploitation harder
  • Lockdep: Detects potential UAF scenarios through lock ordering analysis

kmalloc with GFP_USER flag: Allocations from user-triggered paths that succeed with kernel privileges can be exploited to drain kernel memory (denial of service). The kernel uses quota tracking and cgroup-based memory limits (memory.kmem.slab.*) to contain kernel allocations per cgroup.

Speculative execution leaks (Spectre/Meltdown) and kernel memory: The Meltdown exploit (CVE-2017-5754) allowed user-space code to read kernel memory by exploiting speculative execution. The fix involved serializing the address translation path (IBRS microcode), which added overhead to page table walks. KPTI (Kernel Page Table Isolation) separated user and kernel page tables entirely — but added overhead to every system call and interrupt.

Common Pitfalls / Anti-patterns

Pitfall: Using vmalloc in interrupt context. vmalloc may sleep (calls alloc_pages which can block when memory pressure is high). Using vmalloc from an interrupt handler, softirq, or any context where sleeping is forbidden causes a deadlock or scheduler corruption. Use kmalloc (with GFP_ATOMIC) for memory that must be allocated from atomic context.

Pitfall: Confusing kmalloc’s GFP flags. GFP_KERNEL can sleep — it is appropriate for process context allocations where waiting is acceptable. GFP_ATOMIC cannot sleep — appropriate for interrupt context, softirq, or atomic section. Using GFP_KERNEL from atomic context panics the kernel immediately. Using GFP_ATOMIC excessively can deadlock when memory is very tight (since it cannot wait for page reclaim to free memory).

Pitfall: Not accounting for slab cache overhead in cgroup memory limits. Kernel memory (slab allocations) counts against the cgroup’s memory limit, but memory.kmem.usage_in_bytes tracks it separately from user pages. A container with memory.limit_in_bytes=512m may still have its slab grow to 200 MB, leaving only 312 MB for actual process pages. On kernels that support it, use memory.kmem.slab_objects_limit to limit per-slab-cache object counts.

Anti-pattern: Allocating with GFP_HIGHUSER without understanding HighMem. On 32-bit systems, GFP_HIGHUSER allocates from HighMem (above 896 MB). Accessing HighMem pages requires kmap(), which has a performance cost. On 64-bit systems, HighMem does not exist — GFP_HIGHUSER is identical to GFP_HIGHUSER_MOVABLE. Using GFP_HIGHUSER unnecessarily can add overhead without benefit.

Quick Recap Checklist

  • The kernel memory allocator is layered: SLUB → page allocator (buddy system) → memory zones
  • The buddy system allocates physical pages in power-of-two sizes (orders); adjacent free blocks (buddies) can be coalesced when freed
  • Slab allocators (SLUB is the default) maintain per-type object caches above the buddy system, reducing fragmentation and allocation latency for frequent small allocations
  • kmalloc allocates from the direct-mapped kernel address range (physically and virtually contiguous), maximum ~128 KB on x86
  • vmalloc allocates from the vmalloc region (virtually contiguous but physically fragmented), suitable for large buffers
  • Memory zones (DMA, Normal, HighMem on 32-bit; no HighMem on 64-bit) organize physical memory by capability and mapping requirements
  • kmalloc with GFP_KERNEL can sleep (process context); GFP_ATOMIC cannot sleep (interrupt context)
  • Kernel memory leaks from modules accumulate in slab caches — use kmemleak to detect them
  • UAF vulnerabilities in kernel code are severe — use KASAN during development for detection
  • Tools: /proc/slabinfo (slab stats), /proc/buddyinfo (buddy free lists per zone), /proc/vmallocinfo (vmalloc usage)

Interview Questions

1. What is the difference between the buddy system and the slab allocator?

The buddy system operates at the page level. It manages physical memory in power-of-two page blocks (orders). When a request for 3 pages comes in, the buddy system rounds up to 4 pages (order 2) and allocates from the order-2 free list. If no order-2 block exists, it splits a larger block (order 3) in half, placing one half on the order-2 list and using the other half. The buddy system guarantees physically contiguous pages and coalesces adjacent free blocks on free.

The slab allocator operates above the buddy system, at the object level. It carves pages obtained from the buddy system into equal-sized objects (e.g., 64-byte task_struct objects). Each slab cache is dedicated to one object type. When the kernel allocates a task_struct, the slab allocator returns a cached object — no buddy system involvement, no fragmentation, no search overhead. Freed objects are returned to the slab cache, not the buddy system, so the next allocation reuses the freed object immediately.

2. When would you use vmalloc instead of kmalloc?

Use vmalloc when you need a large buffer (larger than ~128 KB, the kmalloc limit on x86) that does not require physically contiguous memory. The kernel's module loading mechanism uses vmalloc for code and data segments — module size constraints are governed by the vmalloc area size.

Use kmalloc when you need physically contiguous memory (e.g., for DMA to devices with addressing limitations), when allocation latency matters (vmalloc involves setting up page tables), or when you are allocating from atomic context (vmalloc can sleep; kmalloc with GFP_ATOMIC cannot).

The classic example: a network driver receiving a large packet buffer might use kmalloc for small control structures and vmalloc for the packet data itself, since the packet data need not be physically contiguous for the CPU to read it.

3. What are memory zones and why does ZONE_DMA still exist?

Memory zones are regions of physical memory with different properties, defined in the kernel's architecture-specific code. ZONE_DMA contains the first 16 MB of physical memory on x86 — the addressing range of the legacy ISA bus. Some ancient hardware (ISA devices, some RAID controllers) can only perform DMA within this range. The kernel places such buffers in ZONE_DMA.

ZONE_NORMAL (16 MB to 896 MB on 32-bit x86) is directly mapped to the kernel's linear address space. Allocations here are the fastest. ZONE_HIGHMEM (above 896 MB on 32-bit x86) is not directly mapped — the kernel must use kmap() to temporarily map these pages before accessing them. On 64-bit systems, the entire physical address space is within the direct-mapped range, so HIGHMEM is unnecessary.

4. What is the difference between SLAB, SLUB, and SLOB?

SLAB is the original Linux slab allocator, introduced in 2.2, refined through 2.6. It maintains per-CPU and per-node queues of free objects and has extensive debugging features (red zoning, object poisoning, sanity checks). It has significant per-CPU overhead and does not scale as well on large NUMA systems.

SLUB (Unqueued Slab Allocator, default since 2.6.23) removed the complex queue structures and uses page structs directly. It merges per-node lists into simpler structures and uses per-CPU "freelist" arrays instead. It scales dramatically better on large systems, has lower metadata overhead per object, and is the default allocator on all mainstream kernels.

SLOB (Simple List of Blocks) is a minimal allocator designed for embedded systems with very limited memory (sub-16 MB). It uses first-fit allocation rather than slab caches. It trades performance and fragmentation for minimal code size. Used when CONFIG_SLOB is set in the kernel config.

5. What causes kernel OOM (Out of Memory) and how does the OOM killer decide which process to terminate?

Kernel OOM occurs when the kernel's memory allocator (kmalloc, page allocator, slab allocator) cannot satisfy a memory request even after page reclaim and slab cache shrinking. Unlike user-space OOM (which happens when physical memory is exhausted), kernel OOM is relatively rare — the kernel is careful about allocating memory in ways that can fail. The OOM killer is invoked when physical memory + swap is exhausted. It selects a "badness" score for each process: the process that has consumed the most memory over its lifetime gets the highest score (using /proc/PID/oom_score). The killer terminates that process to reclaim its pages. You can influence OOM behavior via /proc/PID/oom_score_adj ($-1000$ to $+1000$; negative values make a process less likely to be killed, positive values more likely).

6. What causes kernel OOM (Out of Memory) and how does the OOM killer decide which process to terminate?

Kernel OOM occurs when the kernel's memory allocator (kmalloc, page allocator, slab allocator) cannot satisfy a memory request even after page reclaim and slab cache shrinking. Unlike user-space OOM (which happens when physical memory is exhausted), kernel OOM is relatively rare — the kernel is careful about allocating memory in ways that can fail. The OOM killer is invoked when physical memory + swap is exhausted. It selects a "badness" score for each process: the process that has consumed the most memory over its lifetime gets the highest score (using /proc/PID/oom_score). The killer terminates that process to reclaim its pages. You can influence OOM behavior via /proc/PID/oom_score_adj ($-1000$ to $+1000$; negative values make a process less likely to be killed, positive values more likely).

7. What is the difference between ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM?

On 32-bit x86, physical memory is divided into zones based on hardware constraints. ZONE_DMA contains the first 16 MB of physical memory — required for legacy ISA devices that can only perform DMA within this range. ZONE_NORMAL (16 MB to 896 MB) is directly mapped to the kernel's linear address space — allocations here are the fastest because no special mapping is needed. ZONE_HIGHMEM (above 896 MB) is not directly mapped — the kernel must use kmap() to temporarily map these pages before accessing them, adding overhead. On 64-bit systems, the entire physical address space falls within the direct-mapped range, so HIGHMEM does not exist and all memory is in ZONE_NORMAL (or ZONE_DMA on some systems for actual DMA devices).

8. How does the buddy system coalesce adjacent free blocks and why is this important?

The buddy system splits larger blocks in half when satisfying a request, placing each half on the appropriate free list. When a block is freed, the allocator checks whether its "buddy" (the adjacent half of the split block) is also free and on the same free list. If so, the two halves are coalesced into a larger block and placed on the next-higher-order free list. This process repeats up the chain until no further coalescing is possible. Coalescing is important because it counteracts fragmentation — as blocks of various sizes are allocated and freed, physical memory can become fragmented (many small holes). The buddy system naturally merges adjacent free blocks, keeping larger contiguous regions available for future page-level allocations. This makes the buddy system highly resistant to external fragmentation at the page level.

9. What is the difference between vmalloc and kmalloc in terms of page table setup?

kmalloc returns addresses from the kernel's direct-mapped linear address range (the Normal zone on x86). The translation from virtual to physical is a simple fixed offset (the direct-mapped base physical address). No page table entries need to be set up — the kernel's page tables already map this entire range to physical memory. This is why kmalloc is fast and can be called from atomic context (no sleeping).

vmalloc allocates from the vmalloc area (VMALLOC_START to VMALLOC_END), which is NOT in the kernel's direct-mapped range. Each page of a vmalloc allocation requires individual page table entries to be set up (pointing to the underlying physical pages). This page table setup involves the buddy system and page allocator, which may sleep — making vmalloc unsafe from atomic context. The overhead of vmalloc is primarily this page table setup, not the allocation itself.

10. What is a slab cache and why does the kernel use per-type caches instead of a general-purpose allocator?

A slab cache is a per-type object pool maintained by the slab allocator. Each cache holds objects of one type (e.g., task_struct, buffer_head, inode). When the cache needs objects, it obtains whole pages from the buddy system and carves them into equal-sized objects. Freed objects are returned to the cache (not the buddy system), so the next allocation reuses the freed object immediately without any buddy system involvement. This approach eliminates fragmentation (objects are exactly the right size), ensures cache line alignment (objects are aligned to prevent false sharing), and dramatically reduces allocation latency (no searching free lists, no fragmentation checking). The alternative — using the buddy system for every kernel object — would be prohibitively slow for the thousands of small allocations per second in a running kernel.

11. Why does kmalloc have a 128 KB limit on x86 and what do you do for larger allocations?

The kmalloc limit of 128 KB on x86 (order-7, 2^7 × 4 KB pages = 512 KB actually, though practical limits are lower due to fragmentation) comes from the fact that kmalloc allocations come from slab caches that are backed by physically contiguous page groups. Finding larger physically contiguous regions becomes increasingly unlikely as the size grows — the buddy system may have many free pages but not enough contiguous ones to satisfy a large order allocation. For allocations larger than ~128 KB that still require physical contiguity (for DMA, for example), use alloc_pages() (the buddy system directly) with an appropriate order. For allocations larger than ~128 KB that do NOT require physical contiguity, use vmalloc(), which only guarantees virtual contiguity and can address the full ~1.5 GB vmalloc area on x86-64.

12. What is the difference between GFP_KERNEL and GFP_ATOMIC allocation contexts?

GFP_KERNEL allocations can sleep (block) while waiting for memory to become available. If the buddy system has no free pages, the allocator triggers page reclaim and waits for pages to be freed. This makes GFP_KERNEL appropriate for process context code where sleeping is acceptable. GFP_ATOMIC allocations cannot sleep — if no free pages are available, the allocation fails immediately rather than waiting. This is required for interrupt context, softirq, tasklet, and any code path where the scheduler cannot be invoked. The trade-off is that GFP_ATOMIC can fail, so callers must check the return value. Additionally, GFP_ATOMIC is more likely to trigger the OOM killer because it cannot wait for reclaim — the OOM killer may be invoked to free memory when GFP_ATOMIC fails.

13. What is memory zone balancing and how does it affect allocations on NUMA systems?

On NUMA systems, each node has its own set of memory zones (DMA, Normal, HighMem). The kernel prefers to allocate memory from the node local to the CPU making the request — local memory has lower latency and higher bandwidth than remote node memory. The allocation path calls alloc_pages_node() which first tries the local node's appropriate zone. If that fails (e.g., local node is out of memory), it falls back to remote nodes — but remote access costs 30-50% more latency. On a two-socket system, process A running on socket 0 accessing memory on socket 1 pays the Infinity Fabric / QPI inter-socket latency. The numactl tool, mbind(), and set_mempolicy() system calls allow fine-grained control over memory placement. Large database workloads often explicitly bind memory allocation to specific nodes to avoid cross-socket traffic.

14. What is the relationship between the page allocator and the slab allocator?

The page allocator (the buddy system) operates at the page level — it manages physical pages (4 KB chunks) and their allocation to any caller. The slab allocator sits above the page allocator and uses it to obtain pages for its caches. When a slab cache needs more objects, it calls alloc_pages() to obtain a batch of pages from the buddy system. These pages are then subdivided into equal-sized objects and placed in the cache. When objects are freed, they go back to the slab cache — not back to the buddy system — until the cache decides to release excess pages back to the buddy system. This layered approach means the buddy system only deals with page-sized allocations, while the slab layer handles the byte-to-KB allocations that the kernel needs.

15. What is a use-after-free (UAF) vulnerability in kernel code and how does KASAN detect it?

Use-after-free occurs when kernel code frees an object (returns it to the slab cache) but a dangling pointer in another part of the code still references it. A subsequent allocation of the same type reuses that memory, and the original code reads/writes the new object's data — corrupting kernel state or potentially escalating privileges.

KASAN (Kernel Address Sanitizer) detects UAF at runtime by poisoning freed memory with a known pattern (0x6B for each byte, called "kasan-byte"). When that memory is later allocated, KASAN saves the original shadow memory state. On any access to the poisoned region, KASAN checks whether the address is within a valid allocated object — if it was freed, the access triggers a warning. KASAN requires ~2x memory overhead and is enabled with CONFIG_KASAN=y in the kernel config. It detects out-of-bounds writes, use-after-free, and double-free bugs during development and testing.

16. How does the kernel's OOM killer interact with cgroup memory limits?

Linux cgroups (v1 and v2) allow per-cgroup memory limits enforced by the kernel's memory controller. When a cgroup's memory usage hits its memory.limit_in_bytes, the kernel invokes the OOM killer within that cgroup — killing a process within the cgroup to reclaim memory. This is independent of the system-wide OOM killer. The cgroup OOM killer selects from among the processes in the cgroup, not system-wide. Critical system services that must never be killed should be in their own cgroup with a high oom_score_adj (-1000) or outside the memory-constrained cgroup. Container runtimes (Docker, Kubernetes) use cgroups to enforce memory limits — when a container hits its limit, the cgroup OOM killer terminates a process within that container, not elsewhere on the system.

17. What is the difference between vmalloc and vmap and when would you use each?

vmalloc() allocates a virtually contiguous region backed by non-contiguous physical pages, returning a virtual address range. vmap() takes an existing array of page pointers and maps them into a contiguous virtual address range — the pages must already be allocated. vmalloc is for when you need memory and don't care about physical contiguity; vmap is for when you have pages (from alloc_pages or I/O) and need them to appear as one contiguous virtual region. The vmalloc result can be used directly; vmap is typically used during kernel initialization or forior mapping I/O buffers. Both create page table entries dynamically — neither can be called from atomic context.

18. What is the kernel's memory pool (mempool) and when is it used?

The kernel's memory pool (mempool) maintains a reserve of pre-allocated objects that can be used when normal allocation fails (e.g., during memory pressure). Mempools were originally designed for block I/O where an allocation failure during an I/O operation could cause deadlock — if the system is low on memory and needs to allocate a buffer to complete the I/O that would free more memory, a deadlock is possible. Mempools solve this by always keeping a minimum pool of allocated objects. They are used in the block layer (bio pools), SCSI mid-layer, and some filesystem code. For most kernel code, mempool is overkill — a failed allocation usually means genuine memory exhaustion and the OOM killer should handle it. Using mempool for regular allocations just delays the inevitable and reduces memory available for other uses.

19. What is the difference between kmalloc and the page allocator in terms of fragmentation?

The buddy system (page allocator) manages physical pages — fragmentation is at the page level, managed by the order system. The slab allocator (used by kmalloc) manages sub-page objects carved from pages obtained from the buddy system. For kmalloc, fragmentation is limited to internal fragmentation within objects (a 64-byte object in a 64-byte slot has zero internal fragmentation; a 65-byte object in a 128-byte slot has significant waste). The SLUB allocator reduces this by maintaining size-specific caches (8, 16, 32, 64, 128, 256, 512, 1024, 4096 bytes) — the waste per allocation is bounded by the next size up. The buddy system can become fragmented over time with many order-0 and order-1 free pages but no higher-order blocks available — compaction (background kernel thread) periodically defragments this.

20. How does the kernel handle memory allocation failures gracefully?

Kernel code must handle allocation failures because unlike user-space (malloc always succeeds and gives you memory you may never use), kernel allocations can genuinely fail. The kernel provides several strategies: (1) check return values — every kmalloc, vmalloc, alloc_pages can return NULL; well-written code checks and propagates or handles the failure. (2) OOM killer — when memory is genuinely exhausted, the OOM killer terminates a process to free memory for continuing operations. (3) mempools — pre-allocated reserves for critical paths where allocation failure would be catastrophic. (4) memory cgroups and limits — preventing any single service from consuming all memory. (5) boot-time reservations — reserving memory for specific uses so critical allocations succeed. Failing to check for allocation failures is one of the most common kernel bugs — a NULL dereference from a failed kmalloc can crash the system.

Further Reading

Slab Cache Debugging and Analysis

Analyzing slab cache utilization:

# Top 20 slab caches by memory usage
sudo awk 'NR==1{for(i=1;i<=NF;i++)if($i=="active_obj")a=i;if($i=="obj_sz")b=i} NR>1{print $a,$b,$0}' /proc/slabinfo | sort -rn | head -20

# Watch cache size over time
watch -n1 'cat /proc/slabinfo | grep -E "^task_struct|^buffer_head|^ext4_inode_cache"'

Slub debug features (compile with CONFIG_SLUB_DEBUG=on):

  • slub_debug=O — enable debugging for specific caches
  • slub_debug=FZPU — F=zap (poison), Z=red zoning, P=print stats, U=verify
  • Red zoning: fills unused space with a pattern to detect buffer overflows
  • Object poisoning: fills freed objects with a pattern to detect use-after-free

KASAN (Kernel Address Sanitizer):

  • Detects out-of-bounds and use-after-free at runtime
  • Requires ~2x memory overhead
  • Supported in modern kernels (4.0+)
  • Enable: CONFIG_KASAN=y in kernel config

Memory Zones and the DMA Zone on x86

On 32-bit x86, the DMA zone exists because of the ISA bus 16 MB limitation:

ZoneAddress Range (32-bit x86)SizePurpose
ZONE_DMA0 - 16 MB16 MBISA DMA devices
ZONE_NORMAL16 MB - 896 MB880 MBKernel direct-mapped
ZONE_HIGHMEM896 MB - endVariesMust use kmap() to access

On 64-bit x86, the entire physical address space is within the direct-mapped range — ZONE_HIGHMEM is unnecessary and does not exist.

Key Takeaways

  • The kernel memory allocator is layered: SLUB (object cache) → buddy system (page allocator) → memory zones
  • The buddy system allocates physical pages in power-of-two sizes (orders) with coalescing on free
  • Slab allocators (SLUB is default) maintain per-type object caches above the buddy system
  • kmalloc allocates from the direct-mapped zone (physically/virtually contiguous), max ~128 KB on x86
  • vmalloc allocates from the vmalloc region (virtually contiguous, physically fragmented)
  • Memory zones organize physical memory by capability and mapping requirements
  • kmalloc with GFP_KERNEL can sleep (process context); GFP_ATOMIC cannot sleep (interrupt context)

Conclusion

The kernel memory architecture is a layered system that transforms physical RAM into the virtual address spaces your applications use. At the foundation sits the buddy system, allocating physical pages in power-of-two blocks. Above it, slab allocators like SLUB carve those pages into cached objects, reducing allocation overhead for the kernel’s most frequently-created structures.

Understanding kmalloc versus vmalloc is essential for driver and module development. The former gives you physically contiguous memory from the direct-mapped zone; the latter provides virtually contiguous regions at the cost of potentially scattered physical pages. Memory zones reflect the hardware reality: the DMA zone serves legacy devices that cannot address full memory, while HighMem exists only on 32-bit systems where the kernel’s linear address range cannot cover all physical RAM.

For continued learning, explore how the kernel handles out-of-memory conditions through the OOM killer, and how cgroup-based memory limits contain kernel memory growth in containerized environments. The intersection of kernel memory allocation and security — particularly KASAN for use-after-free detection and KPTI for Spectre/Meltdown mitigation — represents an advanced frontier in systems programming.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science