Virtualization Basics
Explore hypervisors, virtual machines, containers, and OS-level virtualization — understanding the technologies powering cloud computing.
Virtualization Basics
The cloud computing revolution runs on virtualization. Every AWS instance, every Docker container, every Kubernetes pod exists because of a fundamental idea: you can make one physical computer look like many computers, or conversely, make many physical computers look like one. Understanding virtualization isn’t optional anymore—it’s foundational infrastructure knowledge for anyone building, deploying, or operating modern software systems.
Whether you’re debugging a container networking issue, architecting a multi-tenant SaaS platform, or trying to understand why your Kubernetes pod behaves differently than expected, virtualization concepts are essential.
Introduction
Virtualization is the simulation of hardware or software resources. In computing, it typically means creating multiple isolated environments on a single physical machine, where each environment believes it has exclusive access to its own set of resources.
The key benefits that drove adoption:
- Server consolidation — Run multiple “servers” on one physical machine, improving hardware utilization from typical 15% to 70%+
- Isolation — A bug or security issue in one VM/container doesn’t affect others
- elasticity — Create and destroy environments on demand
- portability — VMs and containers package an environment that runs identically anywhere
Modern virtualization exists on a spectrum, from full hardware emulation (virtual machines) to lightweight process isolation (containers), each with different tradeoffs.
When to Use / When Not to Use
When Virtualization Is Essential
- Cloud computing — All major cloud providers (AWS, GCP, Azure) run workloads in virtualized environments
- Multi-tenant SaaS — Isolating customer data and compute resources
- Legacy application hosting — Running old OS versions on modern hardware
- Development and testing — Reproducing production environments locally
- Microservices architecture — Containers provide the process isolation and deployment model microservices need
When Virtualization May Be Overkill
- Simple scripts — Native execution is faster and simpler
- Single-tenant high-performance workloads — Bare metal may provide better performance
- Resource-constrained environments — Containers add overhead in memory and CPU
- Real-time systems with hard latency guarantees — Virtualization introduces unpredictable latency
Virtualization Architecture
Hypervisor Types
graph TB
subgraph "Type 1 Hypervisor (Bare Metal)"
A[Hardware] --> B[Xen / VMware ESXi / Hyper-V]
B --> C[VM 1]
B --> D[VM 2]
B --> E[VM N]
end
subgraph "Type 2 Hypervisor (Hosted)"
F[Hardware] --> G[Host OS]
G --> H[VMware Workstation / VirtualBox]
H --> I[VM 1]
H --> J[VM 2]
end
style A stroke:#00fff9,stroke-width:2px
style F stroke:#00fff9,stroke-width:2px
Container Architecture
graph TB
subgraph "Host Kernel"
A[Host OS Kernel]
A --> B[Namespaces]
A --> C[cgroups]
A --> D[Overlay Filesystem]
end
subgraph "Containers"
E[Container 1]
F[Container 2]
G[Container N]
end
subgraph "Container Runtime"
H[containerd / CRI-O]
H --> I[runc]
end
B --> E
B --> F
C --> E
C --> F
D --> E
D --> F
style A stroke:#ff00ff,stroke-width:2px
style H stroke:#00fff9,stroke-width:2px
Core Concepts
Type 1 Hypervisors (Bare Metal)
Type 1 hypervisors run directly on hardware without a host operating system. They are the foundation of enterprise virtualization and cloud computing.
VMware ESXi — Commercial hypervisor with vSphere management, known for reliability and enterprise features.
Microsoft Hyper-V — Windows-native hypervisor that also runs Linux VMs; integrated with Windows Server.
Xen — Open-source hypervisor used by AWS. AWS’s Nitro system is a specialized variant that offloads virtualization tasks to dedicated hardware.
KVM — Kernel-based Virtual Machine. Linux kernel module that turns Linux into a Type 1 hypervisor. Combined with QEMU for device emulation, KVM powers many cloud providers and is the foundation of Red Hat Virtualization.
Type 2 Hypervisors (Hosted)
Type 2 hypervisors run as an application within a host operating system. They’re primarily used for desktop virtualization.
VirtualBox — Oracle’s open-source hypervisor, popular for development and testing.
VMware Workstation/Fusion — Commercial products for Windows/Linux (Workstation) and macOS (Fusion).
QEMU — Open-source emulator and hypervisor. Can run as Type 2 or (with KVM) as Type 1.
Virtual Machines vs Containers
Virtual machines emulate entire hardware platforms, including CPU, memory, storage, and network. Each VM runs a complete operating system (the guest OS), making them fully isolated but resource-heavy.
Containers share the host kernel but provide process isolation via Linux namespaces and resource limits via cgroups. They package an application and its dependencies but not a full OS. This makes them:
- Lighter — No guest OS overhead (typically 10-100MB vs GB for VMs)
- Faster to start — Seconds vs minutes for VMs
- More efficient — Higher density per host
The tradeoff is that containers on the same host share the kernel—kernel vulnerabilities can potentially escape container isolation in ways that VM isolation prevents.
Linux Namespace Types
Namespaces partition kernel resources so that processes in different namespaces see different views:
| Namespace | Flag | Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs |
| Network | CLONE_NEWNET | Network devices, ports, routes |
| Mount | CLONE_NEWNS | Mount points, filesystem views |
| UTS | CLONE_NEWUTS | Hostname, domain name |
| IPC | CLONE_NEWIPC | System V IPC, POSIX queues |
| User | CLONE_NEWUSER | User and group IDs |
| Cgroup | CLONE_NEWCGROUP | Cgroup root directory |
Control Groups (cgroups)
cgroups limit and isolate resource usage (CPU, memory, I/O, network) for process groups. They prevent any single container from consuming all host resources and ensure fair sharing across containers.
Key controllers:
cpu— CPU time allocationmemory— Memory limits and swapio— Block device I/O throttlingpids— Process count limitscpuset— CPU core pinning
Container Runtime Standards
OCI (Open Container Initiative) defines standards for container formats and runtimes:
- runc — Reference implementation of OCI runtime spec; creates and runs containers
- containerd — Industry-standard container runtime that manages container lifecycle (start, stop, pause)
- CRI-O — Kubernetes-specific container runtime implementing CRI
Docker Architecture
Docker popularized containerization with its integrated platform:
- Docker client — CLI for user commands
- Docker daemon (dockerd) — Background service managing images, containers, networks, volumes
- containerd — Container runtime (Docker abstracted away runc)
- runc — Low-level container creation
Modern Docker uses containerd as its runtime, with containerd-shim processes that decouple container lifecycle from the daemon.
Production Failure Scenarios
Scenario 1: Container Memory Exhaustion (OOM Kill)
What happens: A container exceeds its memory limit, triggering the kernel’s OOM killer to terminate processes. Application crashes, logs show “Killed” or OOM-related errors.
Detection:
# Check container memory usage
docker stats
# or
crictl stats | grep memory
# Check kernel OOM logs
dmesg | grep -i "killed process"
journalctl -b | grep -i oom
Mitigation:
# docker-compose.yml
services:
myapp:
mem_limit: 512m
mem_reservation: 256m
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 256M
# Kubernetes pod resource limits
resources:
limits:
memory: "512Mi"
requests:
memory: "256Mi"
Scenario 2: VM Live Migration Failure
What happens: During live migration of a VM between hosts (for maintenance or load balancing), the VM pauses, memory copies to the destination, but the VM fails to resume properly or network connectivity drops.
Mitigation:
- Pre-copy migration sends memory pages before pausing (faster network, longer downtime)
- Post-copy migration pauses first, then copies memory (faster migration, higher risk)
- Use shared storage (SAN/NFS) to avoid disk migration
- Test migration during maintenance windows
- Monitor network latency between hosts
Scenario 3: Container Escape Vulnerability
What happens: A vulnerability in container runtime or misconfiguration allows an attacker to escape container isolation and access the host or other containers. Notable examples: containerd CVE-2022-41723, runc CVE-2021-30465.
Mitigation:
# Never run containers with --privileged
# Use read-only root filesystems where possible
docker run --read-only --tmpfs /tmp myapp
# Drop all capabilities, add only what's needed
docker run --cap-drop all --cap-add NET_BIND_SERVICE myapp
# Prevent privilege escalation
docker run --security-opt=no-new-privileges:true myapp
# Use seccomp to restrict syscalls
docker run --security-opt seccomp:default myapp
# Keep container runtimes updated
apt update && apt upgrade containerd
Trade-off Table
| Aspect | VM | Container | Bare Metal |
|---|---|---|---|
| Isolation | Full (separate kernel) | Process (shared kernel) | Full |
| Boot time | Minutes | Seconds | Instant |
| Resource overhead | GB (guest OS) | MB (app + deps) | None |
| Max density | Low (10s/hypervisor) | High (100s/host) | N/A |
| Security boundary | Strong | Moderate | Strongest |
| Live migration | Yes | Limited | N/A |
| Snapshot/clone | Yes | Image layers | No |
| Persistence | Disk image | Image + volumes | Local disk |
| Hypervisor | Type | Performance | Management | Use Case |
|---|---|---|---|---|
| KVM | 1 | Near-native | Open, complex | Cloud providers |
| ESXi | 1 | Near-native | vSphere | Enterprise |
| Hyper-V | 1 | Near-native | SCVMM | Windows shops |
| Xen | 1 | Near-native | Complex | AWS legacy |
| QEMU | 2 | Emulated | Manual | Emulation, testing |
Implementation Snippets
Creating and Running a Simple Container
# Dockerfile for a minimal container
FROM ubuntu:22.04
# Don't run as root
RUN useradd -m appuser
USER appuser
# Only copy what you need
COPY --chown=appuser:appuser app /home/appuser/app
WORKDIR /home/appuser/app
CMD ["./app"]
#!/bin/bash
# Build and run a container
docker build -t myapp:latest .
docker run --rm -it myapp:latest /bin/sh
# Inspect container internals
docker inspect myapp:latest
docker exec -it $(docker ps -q) ls /
# Resource limits
docker run --memory=512m --cpus=0.5 myapp:latest
Working with Linux Namespaces Directly
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <unistd.h>
#include <sys/wait.h>
int main(void) {
printf("Parent PID: %d\n", getpid());
// Create child in new PID namespace
pid_t pid = clone(
[](void* arg) -> int {
printf("Child PID: %d (in new namespace)\n", getpid());
printf("Parent of child: %d\n", getppid());
// Sleep and exit
sleep(60);
return 0;
},
// Stack for child
malloc(65536),
// Flags: new PID, network, mount namespaces
CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS | SIGCHLD,
NULL
);
if (pid == -1) {
perror("clone failed");
return 1;
}
printf("Created child with PID: %d\n", pid);
waitpid(pid, NULL, 0);
printf("Child exited\n");
return 0;
}
Inspecting cgroup Hierarchy
#!/bin/bash
# Explore cgroup structure
echo "=== Cgroup Version ==="
if [ -f /sys/fs/cgroup/cgroup.controllers ]; then
echo "cgroup2 (unified hierarchy)"
else
echo "cgroup1 (legacy hierarchy)"
fi
echo -e "\n=== Current Process Cgroups ==="
cat /proc/self/cgroup
echo -e "\n=== Memory Cgroup Limits ==="
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
cat /sys/fs/cgroup/memory/memory.soft_limit_in_bytes
cat /sys/fs/cgroup/memory/memory.swappiness
echo -e "\n=== CPU Cgroup Limits ==="
cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
Kubernetes Pod Spec with Resource Limits
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
seccompProfile:
type: RuntimeDefault
containers:
- name: myapp
image: myapp:1.0.0
resources:
limits:
memory: "512Mi"
cpu: "500m"
requests:
memory: "256Mi"
cpu: "100m"
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Observability Checklist
VM Monitoring
# VMware esxtop equivalents
vmstat 1
esxtop # VMware specific
# KVM/QEMU
virsh list
virsh dominfo <vm-name>
virsh dommemstat <vm-name>
# VM performance
top -b -n 1 | grep qemu
cat /proc/interrupts
Container Monitoring
# Docker stats (real-time)
docker stats
# Kubernetes pod metrics
kubectl top pods
kubectl top nodes
# Container runtime metrics (Prometheus format)
curl http://localhost:9323/metrics
# cAdvisor for container metrics
docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
google/cadvisor:latest
Network Namespace Debugging
#!/bin/bash
# Inspect container networking
# List network namespaces
ip netns list
# Create a network namespace (simulates container)
ip netns add testns
ip netns exec testns ip addr
# Connect namespaces with veth pair
ip link add veth0 type veth peer name veth1
ip link set veth1 netns testns
Common Pitfalls / Anti-Patterns
VM Security Considerations
- Hypervisor vulnerabilities — A flaw in the hypervisor can expose all VMs to attack; keep hypervisors updated
- Side-channel attacks — Spectre/Meltdown variants affect hypervisors; enable hypervisor-specific mitigations
- VM escape — Exploits that break out of VM isolation to access host; less common but severe
- Storage security — VM disks may contain sensitive data; encrypt at rest and in transit
Container Security Best Practices
# Kubernetes security context examples
securityContext:
runAsNonRoot: true
runAsUser: 10000
fsGroup: 10000
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
Defense in depth for containers:
- Run containers with minimal privileges (drop ALL capabilities)
- Use read-only root filesystems where possible
- Never expose the Docker socket to containers
- Scan images for vulnerabilities (Trivy, Grype)
- Implement network policies to restrict pod-to-pod communication
- Use admission controllers (like OPA/Gatekeeper) to enforce policies
Compliance Considerations
- PCI-DSS — Requires virtualization technology to provide isolation between cardholder data and other systems
- HIPAA — Virtualized environments must ensure PHI isolation
- SOC 2 — Requires evidence of proper isolation and access controls
Common Pitfalls / Anti-patterns
-
Running containers with —privileged — Gives the container full access to host devices; attackers can escape to host. Use specific capabilities instead
-
Exposing the Docker socket — Mounting
/var/run/docker.sockinto a container gives that container root access to the host. Use Docker-in-Docker alternatives instead -
Not setting resource limits — Containers without limits can consume all host memory/CPU, affecting other workloads. Always set explicit limits
-
Using
latesttag — Makes it impossible to roll back or audit which version ran. Always use specific tags -
Running as root inside containers — If compromised, attacker has root on host (with certain namespace configurations). Use
runAsUserin security context -
Ignoring VM sprawl — Unused VMs consume resources and become security liabilities. Implement VM lifecycle management
-
Not testing live migration — Assuming migration works without testing can cause production outages. Test during maintenance windows
Quick Recap Checklist
- Type 1 hypervisors run directly on hardware; Type 2 run within a host OS
- Virtual machines provide full hardware emulation with isolated guest OS
- Containers share the host kernel but provide process isolation via namespaces
- Linux namespaces partition kernel resources (PID, network, mount, UTS, IPC, user, cgroup)
- cgroups limit and meter resource usage (CPU, memory, I/O) for process groups
- Docker/containers package applications; VMs package entire operating systems
- Container security requires defense in depth: minimal privileges, read-only filesystems, capability dropping
- Kubernetes uses containerd/CRI-O as the container runtime interface
- VM live migration enables maintenance without downtime; test migration beforehand
- OOM kills in containers happen when memory limits are exceeded; always set limits
Interview Questions
A virtual machine emulates an entire hardware platform—a complete CPU, memory, storage, and network subsystem. Each VM runs a full operating system (guest OS) from its own bootloader. VMs provide strong isolation because each has its own kernel. Starting a VM takes minutes and it consumes gigabytes of RAM for the guest OS alone.
A container shares the host kernel but provides isolated views of process trees, network ports, mount points, and other kernel resources through Linux namespaces. Containers package an application and its dependencies but not a full OS. They start in seconds and consume megabytes because there's no duplicated OS.
The tradeoff is security isolation strength. A kernel vulnerability in a container can potentially affect the host and other containers on the same host—a VM's separate kernel prevents this. For high-security workloads, VMs provide stronger isolation at the cost of more resources.
Linux namespaces partition kernel resources so that processes in different namespaces see different system-wide resources. When a process calls clone() with namespace flags, the child gets a new namespace view:
- PID namespace — Processes in the container see different PIDs; PID 1 inside is not PID 1 on the host
- Network namespace — Each container gets its own network stack with its own interfaces, routing tables, and port numbers
- Mount namespace — Each container can mount different filesystems; mounts don't propagate to host
- UTS namespace — Containers can have different hostnames
- IPC namespace — System V message queues and shared memory are isolated
Namespaces are the fundamental mechanism that makes containers possible—they're what Docker and Kubernetes build on top of.
Control groups (cgroups) are a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, I/O, network) of process groups. While namespaces provide isolation (different views of resources), cgroups provide control (limits on resources).
Without cgroups, a single container could consume all available memory, starving other containers and the host. With cgroups, you can say "this container gets maximum 512MB RAM and 0.5 CPUs." The kernel enforces these limits, killing processes or throttling as needed.
Docker uses cgroups to implement its --memory and --cpus flags. Kubernetes pod resource limits map directly to cgroup settings on the node. Without cgroups, containerization wouldn't be safe for multi-tenant workloads.
A hypervisor is the software that creates and runs virtual machines. It sits between the hardware and the VMs, presenting virtual hardware to each VM and managing the actual hardware allocation.
Type 1 (bare metal) hypervisors run directly on hardware without a host OS. Examples: VMware ESXi, Microsoft Hyper-V, KVM, Xen. These are used in data centers and cloud providers because they have minimal overhead and are purpose-built for virtualization.
Type 2 (hosted) hypervisors run as an application within a regular operating system. Examples: VirtualBox, VMware Workstation. These are used primarily for development and testing on desktops where full data center infrastructure isn't needed.
KVM (Kernel-based Virtual Machine) is interesting because it runs as a Linux kernel module but functions as a Type 1 hypervisor—when KVM is loaded, Linux itself becomes the hypervisor. This gives KVM the performance of Type 1 with the flexibility of Linux.
Docker's architecture has evolved significantly. Originally, Docker used its own runtime (docker-containerd), but the modern architecture separates concerns:
The Docker daemon (dockerd) exposes the Docker API, manages images, networks, and volumes, and orchestrates the runtime. It exposes the familiar docker CLI interface that users interact with.
containerd is the industry-standard container runtime that handles the actual container lifecycle—creating, starting, stopping, pausing, and deleting containers. It was donated to CNCF and is the standard runtime that Kubernetes uses.
runc is the low-level container runtime that creates and runs containers according to OCI specifications. It's the actual process that spawns the container—containerd uses runc to do the heavy lifting.
The advantage of this separation is interoperability: Kubernetes doesn't need Docker; it can use containerd or CRI-O directly via the Container Runtime Interface (CRI). This modularity lets different projects use the same container runtime without depending on Docker.
Further Reading
- Boot Process — How the OS starts up and loads the kernel
- Kernel Architecture — How hypervisors relate to kernel design
- System Calls Interface — How containers make syscalls to the host kernel
- Docker Documentation: Container Runtime
- Kubernetes Documentation: Container Runtime Interface
Live migration moves a running virtual machine from one physical host to another without disconnecting clients. The process typically uses pre-copy migration: the source VM continues running while memory pages are copied iteratively to the destination, with pages modified during copy being re-sent each round. Once memory synchronization nears completion, the VM pauses briefly, the remaining dirty pages are transferred, and the VM resumes on the destination host.
Post-copy migration takes the opposite approach: the VM pauses immediately, minimal state is transferred, and the VM resumes on the destination while memory pages are faulted in on-demand. Pre-copy offers lower downtime but longer total migration time; post-copy offers faster migration but higher runtime overhead during recovery. Shared storage (SAN/NFS) eliminates disk migration, significantly reducing migration complexity.
Container escape occurs when an attacker breaks out of container isolation to access the host or other containers. Primary vectors include: kernel vulnerabilities in container runtimes (like CVE-2022-41723 in containerd, CVE-2021-30465 in runc), misconfigured capabilities granting too many privileges, mounting the Docker socket into a container giving host root access, and vulnerable syscalls not blocked by seccomp profiles.
Namespace isolation means containers share the kernel, so kernel exploits that would be contained by VM isolation can escape containers. Defense requires: never running containers with --privileged, dropping all capabilities and adding only what's necessary, using read-only root filesystems, enabling seccomp profiles, keeping container runtimes updated, and using admission controllers in Kubernetes to enforce security policies.
Cgroup v2 (unified hierarchy) addresses fundamental limitations in cgroup v1. In cgroup v1, each controller (cpu, memory, io, pids) maintained its own separate hierarchy, leading to complexity when controllers depended on each other. Cgroup v2 unifies all controllers into a single hierarchy, simplifying resource management and eliminating cross-controller conflicts.
Key differences: v2 uses a single unified tree rather than multiple parallel trees; the cpu controller no longer has a separate rt runtime interface; memory.low and memory.min provide more intuitive memory protection than the v1 hierarchical limits; pids controller is always hierarchical in v2; and v2 provides better delegation to containers through directory ownership. Most modern container runtimes support cgroup v2, which became the default in systemd and newer kernels.
Each Kubernetes pod runs as one or more containers sharing the same Linux namespaces (PID, network, mount, IPC, UTS). For a pod with a single container, the container runtime (containerd or CRI-O) creates a new PID namespace so processes inside the container cannot see host processes. The pod shares the host network namespace by default, giving containers direct access to the host's network stack.
Resource limits defined in pod specs (memory, cpu, hugepages) translate directly to cgroup settings on the node. Kubelet configures cgroupfs (or systemd slice) for each pod and container. Pod security policies and security contexts control seccomp profiles, capabilities, SELinux labels, and whether containers run as privileged. The node's kernel enforces these limits regardless of what the container runtime requests.
Nested virtualization runs a hypervisor inside a VM that is itself running on a hypervisor. For example, running VirtualBox inside a KVM VM, or running KVM inside an ESXi VM. This requires hardware support (AMD-V and Intel VT-x have flags that can be passed through) and is disabled by default on most hypervisors.
Nested virtualization is useful for development and testing where you need to run VM environments but lack physical hardware access, for running older hypervisor software that requires bare-metal installation for licensing, for CI/CD pipelines that need to test hypervisor-specific behavior, and for certain security research scenarios. It adds performance overhead since there are multiple translation layers (VM exits nested inside VM exits), making it unsuitable for production workloads.
Overlay filesystems (overlay2 is the modern version) layer multiple directories into a single merged view. For containers, two or three directories are used: the lower layer (image layers, read-only), the upper layer (container-specific changes, writable), and an optional merged view (what the container sees). When a file exists in both layers, the upper layer version shadows the lower.
Copy-on-write behavior means when a container modifies an image file, the entire file is copied to the upper layer before modification, preserving the original image layer unchanged. This allows many containers to share the same image layers while having independent writable layers. Overlay2 uses inodes efficiently and handles many layers better than the older overlay driver, making it the default for most container runtimes on modern kernels.
Container security scanners like Trivy, Grype, and Clair inspect container images for known vulnerabilities. They extract the image's software package manifest (apt, rpm, pip, npm packages, binaries) by parsing the image layers and filesystem contents, then compare against vulnerability databases (NVD, distros' security advisories, GitHub Security Advisories). They report CVEs matching the packages found, with severity ratings and fix versions.
Scanners operate in different modes: static analysis of image contents without running containers, live scanning of running containers for runtime vulnerabilities, and admission control in Kubernetes rejecting deployments with critical vulnerabilities. Some scanners also check for secrets, misconfigurations (Docker CIS benchmarks), and supply chain risks like malicious base images. Scanning should happen both at build time (CI pipeline) and continuously for deployed images as new CVEs are published.
User namespaces map UIDs inside the container to different UIDs on the host, providing true isolation: container root (UID 0 inside) can map to an unprivileged UID on the host (like 100000). This means a container escape does not automatically give root access to host resources because the container's root is not actually root on the host.
Host UID/GID mapping (the default without user namespaces) maps all container UIDs directly to the same host UIDs, so container UID 0 is host UID 0. This creates security risks if UID collisions occur or if container processes can escape their namespace. User namespaces are the foundation of rootless containers, though they require kernel 3.8+ and have some limitations with certain capabilities and device access.
For CPU-intensive workloads, VMs typically incur 1-5% overhead from virtualization (hypervisor scheduling, emulated or paravirtualized devices), while containers introduce near-zero CPU overhead since they are just processes with namespace isolation. The difference is most noticeable in workloads with high rates of context switching, system calls, or I/O operations, where the VM's additional hypervisor layer adds latency.
However, the performance story changes when considering CPU limits. A cgroup-limited container is throttled at the kernel level, which can cause latency spikes when the container exhausts its CPU quota. A VM with dedicated CPUs has no such throttling but shares physical cores according to the hypervisor scheduler. For latency-sensitive real-time workloads, bare metal or VMs with CPU pinning often outperform containers due to more predictable scheduling.
QEMU (Quick Emulator) operates in two modes. As a pure emulator, it translates guest instructions to host instructions dynamically using binary translation, emulating CPU, memory, and devices entirely in software. This allows running ARM binaries on x86 hosts, for example, but is slow due to software emulation.
When paired with KVM (Kernel-based Virtual Machine), QEMU becomes a Type 1 hypervisor. KVM runs the guest CPU directly on hardware (VMX/SVM virtualization extensions), treating most guest instructions as native execution. QEMU handles device emulation and I/O for the guest, creating a division of labor: KVM handles CPU/memory virtualization, QEMU handles everything else. This combination delivers near-native CPU performance while still supporting a wide variety of emulated and paravirtualized devices.
The device mapper is a kernel framework that underpins LVM, dm-crypt, and the older devicemapper storage driver for Docker. It creates virtual block devices by mapping requests through a series of targets (linear, snapshot, mirror, crypt, raid). Each target transforms I/O in different ways: a linear target simply maps a region to another device, a snapshot target tracks changes against an origin, and a crypt target encrypts/decrypts transparently.
Docker's devicemapper driver (now deprecated in favor of overlay2) used thin provisioning with snapshot targets: each container's writable layer was a snapshot of a thin pool, and images were backing snapshots. This allowed fast container creation but had issues with write amplification and garbage collection. Understanding device mapper is still relevant for understanding how LVM thin pools, dm-verity, and encrypted containers work at a low level.
The Linux kernel's OOM killer activates when the system exhausts allocatable memory and cannot swap. It selects and kills a process based on an oom_score calculated from resident memory size, uptime, and oom_score_adj. In containerized environments, the OOM killer operates at the cgroup level: if a container's memory limit (cgroup memory limit) is exceeded, the kernel kills processes within that cgroup, not necessarily the highest-memory process on the system.
This distinction matters because a container hitting its memory limit may kill the wrong process (a small helper process rather than the main workload) or multiple processes within the container. Kubernetes pod resource requests and limits control cgroup settings, but the OOM killer still selects within the cgroup based on its internal scoring. Proper tuning involves setting appropriate memory limits, understanding which process is likely to be killed, and using memory reservation (soft limits) to guide the kernel.
A vETH (virtual Ethernet) pair is a virtual network cable connecting two network namespaces. Data sent on one end appears on the other, like a physical ethernet cable plugging into two different switches. Containers use vETH pairs: one end is placed inside the container's network namespace, the other end remains in the host's root namespace, typically attached to a bridge (docker0, cni0).
When a container sends a packet, it goes through the container's vETH to the host bridge, which forwards based on MAC addresses or routing tables. For external traffic, NAT (Network Address Translation) translates the container's internal IP to the host's external IP. This model allows containers to have their own network stacks (separate IP addresses, routing tables, firewall rules) while sharing the host's physical network interfaces.
Kata Containers is a container runtime that runs each container inside a lightweight VM, combining the speed and density of containers with the isolation of VMs. It uses hardware virtualization (like KVM) to create a VM that boots a minimal kernel and runs the container workload inside. This provides a strong security boundary because the container's kernel is isolated from the host kernel, preventing kernel exploits from escaping to the host.
Unlike traditional containers that share the host kernel, Kata containers each have their own kernel. This trades some performance (VM boot time, memory overhead per container) and density (typically 10-40% less dense than namespace containers) for dramatically improved isolation. It is particularly valuable for multi-tenant environments where container isolation is insufficient but full VMs are too heavy. Kata integrates with containerd and Kubernetes through the shim-v2 architecture.
Kubernetes enforces resource limits through cgroups on each node. When a pod is scheduled, kubelet configures cgroup parameters for the pod's QoS class (Guaranteed, Burstable, or BestEffort) based on the requests and limits specified. Memory limits map to memory.limit_in_bytes, CPU limits map to cpu.cfs_quota_us and cpu.cfs_period_us (CFS scheduler) or cpu.max (cgroup v2).
Namespace-level ResourceQuota objects enforce cluster-wide limits on total CPU requests, memory requests, storage, and object counts across a namespace. Kubernetes admission controllers reject pod deployments that would exceed these quotas. LimitRange objects set default values for containers that don't specify resource requirements. Together, cgroups enforce limits at runtime, while admission controllers enforce quotas at deployment time.
Conclusion
Virtualization is the foundation of modern cloud computing, enabling the elastic, multi-tenant infrastructure that powers everything from startup MVPs to enterprise-scale Kubernetes clusters. Understanding hypervisors, containers, namespaces, and cgroups gives you the mental model needed to debug container issues, optimize resource utilization, and design secure multi-tenant systems.
The distinction between VMs (strong isolation, higher overhead) and containers (lightweight, shared kernel) informs architectural decisions about security boundaries and performance requirements. As you continue learning, explore Kubernetes internals, container networking models, and container security scanning to build comprehensive expertise in cloud-native infrastructure.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.