Namespaces & Cgroups
Linux container primitives: PID, network, mount, user namespaces for isolation
Namespaces & Cgroups
Containers dominate modern infrastructure. Whether you are deploying microservices, running serverless functions, or packaging applications for consistent development environments, containers provide the isolation and resource controls that make these patterns practical. But what actually makes a container work underneath the hood? Two Linux kernel features: namespaces and cgroups.
Namespaces partition kernel resources so that processes in one namespace cannot see or modify resources in other namespaces. Each namespace has its own view of global system resources. A process might believe it is PID 1 inside its namespace while being PID 1000 on the host. The network namespace gives each container its own IP address and routing table.
Cgroups (control groups) impose limits and accounting on resource usage. They ensure that one container cannot consume all available CPU, memory, or I/O bandwidth. Without cgroups, containers would compete destructively and the system would become unstable under load.
Together, these primitives provide the building blocks for container runtimes like Docker and container orchestration systems like Kubernetes. Understanding them helps you debug container issues and design more resilient deployments.
Overview
Linux namespaces and cgroups are kernel features that enable operating system-level virtualization. They allow multiple processes to share the same kernel while operating in isolated environments. Unlike virtual machines which emulate hardware and run complete operating systems, containers share the host kernel but maintain isolation through these mechanisms.
Namespaces wrap global system resources in an abstraction layer, making it appear to processes inside the namespace that they have their own isolated instance of that resource. The kernel has implemented various namespace types since the first one (mount namespaces) was introduced in kernel 2.4.19. Each namespace type isolates a specific category of resources.
Cgroups organize processes into hierarchical groups and apply resource limits to those groups. The hierarchy is tree-structured, with child groups inheriting constraints from parent groups. This allows fine-grained resource allocation where parent cgroups can reserve guarantees for child groups or impose absolute limits.
When to Use / When Not to Use
Namespace and cgroup knowledge becomes essential when debugging container issues, designing multi-tenant systems, or building container runtimes. Understanding these primitives helps you diagnose why a container cannot see a host process, why it lacks network connectivity, or why it is being throttled.
These mechanisms are ideal for multi-tenant deployments where workload isolation is critical. Cloud providers use cgroups and namespaces to isolate tenant workloads on shared infrastructure. Development teams use containers to ensure consistent environments from development through production.
However, namespace isolation is not as strong as virtual machine isolation. Containers share the host kernel, so kernel exploits can potentially break out of container isolation. For truly untrusted workloads that require strong isolation, virtual machines may be more appropriate despite their higher overhead.
Architecture or Flow Diagram
graph TD
subgraph Host System
H[Host Processes] --> HCG[Host Cgroup]
HCG --> HCGOV[CPU, Memory, IO Controllers]
end
subgraph Container A
CA[Container A Processes] --> ACG[Container A Cgroup]
ACG --> ACGOV[CPU: 2 cores<br/>Memory: 4GB limit]
ANP[Network NS] --> AIF[Isolated Net Interface]
APID[PID NS] --> AP1[PID 1 inside container]
end
subgraph Container B
CB[Container B Processes] --> BCG[Container B Cgroup]
BCG --> BCGOV[CPU: 1 core<br/>Memory: 2GB limit]
BNP[Network NS] --> BIF[Isolated Net Interface]
end
HCGOV --> ACGOV
HCGOV --> BCGOV
style Container A stroke:#00fff9
style Container B stroke:#ff00ff
The diagram shows how cgroups impose resource limits independently per container. Each container belongs to its own cgroup subtree with specific CPU, memory, and I/O constraints. Namespaces provide isolation for network, PID, mount, user, and other resources. The host cgroup sits at the top of the hierarchy, with container cgroups as children that inherit but can restrict parent limits.
Core Concepts
Namespace Types
Linux implements several namespace types, each isolating different kernel resources.
PID Namespace: Isolates process ID numbers. Processes in different PID namespaces can have the same PID. The first process in a PID namespace becomes the “init” process for that namespace, responsible for reaping orphaned processes. PID namespaces can be nested, with processes in inner namespaces being visible in outer namespaces with different PID numbers.
Network Namespace: Provides isolated network stack including interfaces, routing tables, firewall rules, and port numbers. A container with its own network namespace can have its own IP address, routing rules, and even run services on the same port as services on the host or other containers.
Mount Namespace: Isolates the set of filesystem mounts visible to processes. Processes in a mount namespace see different directory trees. This is how containers get their own root filesystem (via chroot) and why mounting inside a container does not affect the host.
User Namespace: Maps UIDs and GIDs between the namespace and the host. A process can be root (UID 0) inside a user namespace while being an unprivileged user on the host. User namespaces provide a way to grant elevated privileges without giving actual root access on the host.
UTS Namespace: Isolates hostname and domain name. Containers can have their own hostname, useful for applications that display or react to hostname.
IPC Namespace: Isolates System V IPC objects and POSIX message queues. Processes in different IPC namespaces cannot directly communicate via shared memory or message queues.
Cgroup Controllers
Cgroups organize processes hierarchically and apply resource controls through controllers.
CPU Controller: Limits CPU usage using the Completely Fair Scheduler (CFS) bandwidth control. Configures cpu.cfs_period_us and cpu.cfs_quota_us to set time periods and per-period quotas. Also supports CPU affinity through cpuset.cpus.
Memory Controller: Limits memory usage and tracks consumption. The memory.limit_in_bytes setting imposes hard limits. The controller generates events when limits are approached, enabling proactive handling before OOM conditions.
I/O Controller: Controls disk I/O bandwidth using weight-based or bandwidth-limited scheduling. The io.weight setting uses relative weights while io.throttle.read_bps imposes absolute limits.
PIDs Controller: Limits the number of processes in a cgroup, preventing fork bombs from consuming all available PIDs.
# View cgroup hierarchy
ls /sys/fs/cgroup/
# Check CPU limit for a container
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_quota_us
# See memory limit
cat /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.limit_in_bytes
# View process in cgroup
cat /proc/PID/cgroup
Production Failure Scenarios + Mitigations
Scenario: Container Out of Memory Kill
Problem: Container exceeds memory limit and is killed by the OOM killer.
Symptoms: Container restarts unexpectedly, dmesg shows OOM kills, application logs show sudden disconnection without graceful shutdown.
Mitigation: Set appropriate memory limits that account for peak usage plus headroom. Monitor memory usage trends and alert before limits are reached. Consider increasing limits if legitimate growth occurs. Use memory requests and limits in Kubernetes to ensure guaranteed memory allocation.
# Check if container was OOM killed
docker inspect CONTAINER_ID | grep -i oom
# View OOM events in journal
journalctl -k | grep -i memory
# Set memory limit with buffer
docker run -m 512m --memory-reservation=384m myapp
Scenario: CPU Throttling Causes Latency Spikes
Problem: Container CPU usage exceeds quota, causing CFS throttling that introduces latency.
Symptoms: Periodic latency spikes correlate with batch processing, CPU usage appears below limit but throttling occurs, cpu.stat shows high throttling counts.
Mitigation: Ensure CPU limits are not set too conservatively. Consider using CPU requests instead of limits for latency-sensitive workloads. Increase cpu.cfs_quota_us or cpu.cfs_period_us if using CFS scheduler.
# Check CPU throttling stats
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.stat
# Look for high throttling numbers
grep -E "throttled_time|throttled_count" /sys/fs/cgroup/cpu/*/cpu.stat
# Avoid CPU limits for critical workloads
# Use CPU requests and Guaranteed QoS instead
Scenario: Namespace Isolation Break
Problem: Misconfigured namespace allows container to access host resources.
Symptoms: Container can see host processes, host filesystem accessible from container, network isolation not working.
Mitigation: Regularly audit container configurations. Use security tools like docker bench security to identify misconfigurations. Avoid running containers with --privileged. Use user namespaces to reduce impact of potential privilege escalation.
# Check if container has excessive capabilities
docker run --rm -it myimage capsh --print
# Audit container security settings
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
aquasec/docker-bench:security
# Verify namespace isolation
docker run --rm -it --pid=host myimage ps aux
# If this shows host processes, PID namespace is not isolated
Trade-off Table
| Namespace Type | Isolation Provided | Performance Impact | Configuration Complexity |
|---|---|---|---|
| PID | Process ID numbers | Negligible | Low |
| Network | Interfaces, routing, ports | Low to medium | Medium |
| Mount | Filesystem mounts | Negligible | Medium |
| User | UID/GID mapping | Negligible | Medium |
| UTS | Hostname/domain | None | Low |
| IPC | Message queues, shared memory | Negligible | Low |
| Resource Control | Mechanism | Suitable For | Limitations |
|---|---|---|---|
| CPU limit (CFS) | Time quota per period | Batch workloads | Can cause throttling |
| CPU limit (cpuset) | CPU affinity | Latency critical | Reduces flexibility |
| Memory limit | Hard cap | Multi-tenant | Can trigger OOM |
| Memory swap | Swap usage | Memory pressure | Depends on host swap |
| I/O weight | Relative priority | Competing workloads | Shared disk required |
| I/O throttle | Absolute bandwidth | Guaranteed QoS | Complex tuning |
| Container Runtime | Namespace Support | Cgroup Support | Learning Curve |
|---|---|---|---|
| Docker | All types | v1 and v2 | Moderate |
| containerd | All types | v1 and v2 | Moderate |
| cri-o | All types | v1 and v2 | Moderate |
| podman | All types | v1 and v2 | Low (Docker-compatible) |
| runc | All types | v1 | Low |
Implementation Snippets
Creating Namespaces Manually with unshare
#!/bin/bash
# Create isolated PID and network namespace
# Create new PID namespace (process will be PID 1 in this namespace)
unshare --pid --fork --mount-proc /bin/bash
# Inside the new namespace, verify isolation
echo "PID in namespace: $$"
echo "Host PID visible from namespace:"
ps aux | head -5
# Create network namespace and add interface
ip netns add container_net
ip link add veth0 type veth peer name veth1
ip link set veth1 netns container_net
Inspecting Namespace IDs
#!/bin/bash
# View namespace IDs for a process
PID=$$
echo "Process $PID namespace IDs:"
# List namespace symlinks for process
ls -la /proc/$PID/ns/
# View specific namespace inodes
cat /proc/$PID/ns/net
cat /proc/$PID/ns/pid
# Compare with another process
cat /proc/1/ns/net
Setting Cgroup Limits with systemd
# /etc/systemd/system/myapp.slice
[Unit]
Description=My Application Slice
[Slice]
# 2 CPU cores and 4GB memory
CPUQuota=200%
MemoryLimit=4G
# I/O throttling (100 MB/s read, 50 MB/s write)
IOReadBandwidthMax=/dev/sda 100M
IOWriteBandwidthMax=/dev/sda 50M
# CPU weight (higher = more priority)
CPUWeight=100
[Install]
WantedBy=multi-user.target
Python: Checking Container Resource Limits
#!/usr/bin/env python3
"""Check container resource limits from inside the container."""
import os
import subprocess
def get_cgroup_path():
"""Find cgroup path for this container."""
with open('/proc/self/cgroup', 'r') as f:
for line in f:
if line.startswith('2:cpu'):
return line.split(':')[-1].strip()
return None
def read_cgroup_file(controller, filename):
"""Read a cgroup file."""
base = f'/sys/fs/cgroup/{controller}'
path = os.path.join(base, filename)
try:
with open(path, 'r') as f:
return f.read().strip()
except FileNotFoundError:
return 'not available'
def main():
print("Container Resource Limits")
print("=" * 40)
# Memory limits
mem_limit = read_cgroup_file('memory', 'memory.limit_in_bytes')
print(f"Memory limit: {int(mem_limit) // (1024**3):.1f} GB")
# CPU quota
cpu_quota = read_cgroup_file('cpu', 'cpu.cfs_quota_us')
cpu_period = read_cgroup_file('cpu', 'cpu.cfs_period_us')
if cpu_quota and cpu_period:
quota = int(cpu_quota)
period = int(cpu_period)
if quota > 0:
print(f"CPU quota: {quota}/{period} = {quota/period:.2f} cores")
# PID limit
pids_max = read_cgroup_file('pids', 'pids.max')
print(f"Max processes: {pids_max}")
if __name__ == '__main__':
main()
Observability Checklist
Cgroup Resource Monitoring:
- Monitor CPU throttling metrics in
cpu.stat - Track memory usage vs limits
- Watch for approaching PID limits
- Measure I/O bandwidth against throttling limits
Namespace State Verification:
- Verify namespace isolation is active
- Check for namespace escapes
- Audit capability sets
Container Metrics:
docker statsprovides real-time CPU, memory, network, and disk I/Odocker inspectreveals container configuration- Kubernetes metrics server provides cluster-wide visibility
Key Metrics to Track:
# CPU throttling
cat /sys/fs/cgroup/cpu/docker/*/cpu.stat | grep throttled
# Memory usage
cat /sys/fs/memory/docker/*/memory.usage_in_bytes
# OOM events
grep -r OOM /var/log/syslog 2>/dev/null | tail -20
Security/Compliance Notes
Capability Restrictions: Run containers with minimal capabilities. Docker drops most capabilities by default and adds only those requested. Review required capabilities for your workload and explicitly drop all others.
No New Privileges: Set security-opt=no-new-privileges:true to prevent processes from gaining additional privileges through setuid binaries or file capabilities.
User Namespace Remapping: Consider using user namespaces to map container root to unprivileged host users. This reduces the impact of container escape vulnerabilities.
Seccomp Profiles: Use seccomp profiles to block system calls that are not needed. Docker provides a default seccomp profile that blocks approximately 44 system calls known to be unnecessary for most containers.
# Kubernetes security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
Common Pitfalls / Anti-patterns
Running Containers as Root: Containers running as UID 0 have significant capabilities on the host. Use read-only root filesystems and drop capabilities to limit damage from compromise.
Ignoring PID Limits: The PIDs cgroup limits the number of processes that can run in a container. Setting this too low causes “too many open files” or fork failures. Set it high enough for expected process count plus headroom.
Host Network Mode: Using --network=host removes network namespace isolation entirely. Containers can now see and modify host network configuration. Avoid this mode in production.
Sharing Namespaces Between Containers: Sharing PID or network namespaces between containers creates tight coupling. If one container becomes compromised, it can affect the other. Use container composition through networks rather than namespace sharing.
Not Monitoring Cgroup Limits: Containers hitting resource limits behave badly (throttling, OOM kills). Monitor resource usage against limits and alert before limits are reached.
Quick Recap Checklist
- Namespaces isolate global kernel resources per process group
- Cgroups impose resource limits and accounting on process groups
- PID namespaces provide isolated process ID spaces
- Network namespaces give containers independent network stacks
- Mount namespaces isolate filesystem views
- User namespaces map UIDs/GIDs between container and host
- CPU cgroups use CFS quota or cpuset for limits
- Memory cgroups impose hard limits that trigger OOM
- Container runtimes create and manage namespaces automatically
- Security requires capability dropping, seccomp profiles, and resource limits
- Production monitoring must track cgroup metrics and namespace state
Interview Questions
Namespaces partition kernel resources so processes in one namespace cannot see or modify resources in other namespaces. A process might look like PID 1 inside its namespace while being PID 1000 on the host. Cgroups impose resource limits and accounting. Namespaces isolate, cgroups limit. Containers use both.
PID namespaces isolate process ID numbers so processes in different namespaces can have the same PID. The first process in a PID namespace becomes the "init" process for that namespace, handling orphaned process reaping. This isolation means containers cannot see or signal host processes, and a container restart does not affect PIDs outside it.
When a container exceeds its cgroup memory limit, the kernel triggers an OOM kill. The memory controller fires events before limits are hit, so you can react proactively. If the container does not handle the pressure, the OOM killer terminates the biggest memory consumer inside the container. The result is sudden restarts without graceful shutdown.
User namespaces map UIDs and GIDs between the container and host. A process can be root (UID 0) inside a user namespace while being unprivileged on the host. This limits the damage from container escapes because even if an attacker breaks out, they do not automatically get host root. User namespaces are one of the most important security enhancements for container isolation.
CPU throttling happens when a container exceeds its CPU quota set by the cgroup CFS bandwidth controller. The CFS halts the container's processes until the next period, which introduces latency spikes. This hits latency-sensitive workloads hard. The fix is setting CPU limits high enough for burst capacity, or using CPU requests instead of limits for critical services.
Cgroups organize processes in a tree-structured hierarchy where each child inherits constraints from its parent but can further restrict them. The root cgroup (`/sys/fs/cgroup`) sits at the top and has no limits. When you create a child cgroup, it starts with the parent's current limits as its ceiling. A container's cgroup under `/sys/fs/cgroup/cpu/docker/CONTAINER_ID` inherits system-wide CPU limits but can set its own more restrictive quota. This inheritance means you cannot exceed parent limits—if the parent allows 4 cores and you split into two child cgroups, each can use at most 2 cores unless the parent is reconfigured. The hierarchy also means that when a parent cgroup is throttled, all its children are throttled proportionally.
Cgroups v1 (legacy) had each controller (cpu, memory, io, pids) as a separate hierarchy, meaning a process could belong to different cgroups for different controllers. This complexity led to difficult resource management. Cgroups v2 unifies all controllers into a single hierarchy—every process belongs to exactly one cgroup for all controllers. V2 also provides better delegation (a parent can give a child control of its subtree), improved container integration, and reduced kernel complexity. Docker and Kubernetes support v2 as of kernel 4.5+. You can check which version your system uses by looking at `/sys/fs/cgroup/`—if it has subdirectories like `cgroup.controllers` rather than separate `cpu/`, `memory/` directories, you're on v2.
Rootless mode allows non-root users to run containers without any root privileges on the host. It works by combining user namespaces (which map container root UID 0 to an unprivileged host UID) with additional restrictions. When a rootless container starts, the process inside thinks it is UID 0 (root) inside the container but is actually running as a normal user on the host. The kernel enforces this mapping so even if the container process tries to escalate privileges, it only gains the mapped user's privileges. Rootless Podman and Docker rootless mode use this. Limitation: some kernel features require real root privileges and cannot work in rootless mode—network spanning, certain storage drivers, and some device accesses.
Monitoring namespace isolation involves several checks. First, verify PID namespace isolation by running `ps aux` inside a container—the host processes should not be visible, and the container's PID 1 should be visible. Second, check network namespace isolation by examining `/proc/self/ns/net` for different inode values inside versus outside the container. Third, use `docker run --rm --pid=host container ps aux`—if host processes appear, PID namespace is not isolated. Fourth, audit capabilities with `capsh --print` inside the container and compare against host capabilities. Fifth, check seccomp profiles and ensure the container is not running in privileged mode. Regular security benchmarks like CIS Docker Benchmark provide systematic checklists for this.
Containers share the host kernel, unlike VMs which emulate hardware and run their own kernel. When a container escapes (breaks out of namespace isolation), the attacker lands directly in the host kernel context—the same kernel that controls everything on the host. A kernel exploit or container escape can give host root access. VMs are generally more isolated because they have their own kernel—even if you escape a VM, you're inside another kernel, not the host. However, VM escapes have historically caused hypervisor bugs that exposed the host. The practical danger of container escapes is higher in multi-tenant environments because containers often share the kernel with less isolation than VMs would provide.
The PIDs cgroup controller limits the number of processes that can exist in a cgroup. When a container's process count hits the limit (`pids.max`), any further `fork()` or `clone()` calls fail with EAGAIN, preventing the container from creating unlimited processes. The default PIDs limit varies by runtime but is typically 4096. A fork bomb inside the container would exhaust the container's PID limit but would not affect the host or other containers—each container has its own PID cgroup hierarchy. You can tune this with `docker run --pids-limit=512` or in Kubernetes via `spec.podSpec.resources.limits.pids`. Without this, a container fork bomb could consume all available PIDs on the host, affecting every other process.
The `memory.swap` setting in cgroups controls how much swap space a container can use. By default, it is set to the same value as `memory.limit_in_bytes`, meaning a container can swap up to its memory limit. This prevents an OOM kill when a container hits its memory limit—it swaps instead. The danger is that swapping to disk is extremely slow compared to memory access, so a container that starts swapping will become extremely slow and may appear hung. For latency-sensitive workloads, setting `memory.swap=0` forces immediate OOM kills rather than slow swapping, making the failure mode more predictable. You can also set `memory.swappiness` to control how aggressively the container's pages are swapped.
`io.weight` uses a relative weight-based approach—the kernel's CFQ scheduler uses the weight to determine how much disk time a cgroup gets relative to other cgroups. Higher weight means proportionally more disk time. If two containers have weights of 100 and 200, the second gets twice the disk time of the first. `io.throttle.read_bps` (and write_bps) imposes absolute bandwidth limits—the container cannot exceed the specified bytes per second regardless of what other cgroups are doing. Weight-based scheduling is better when you want proportional fairness; throttle-based is better when you need hard guarantees. You can combine both: weight for proportional sharing within a guaranteed floor.
When a container exits, all its namespaces are destroyed by default—network namespace deleted, PID namespace terminated (all processes killed), mount namespace discarded. The container's filesystem layers persist in Docker storage but the runtime namespace state is gone. However, Docker's `--net=host` option reuses the host's network namespace instead of creating a new one, so the container's network configuration affects the host directly. Kubernetes pods use shared namespaces by default—containers in the same pod share PID, network, and mount namespaces, so they can see each other's processes and share the same network stack. Shared namespaces improve inter-container communication performance but increase coupling.
UTS namespaces isolate hostname and domain name shown by the `hostname` and `domainname` commands. The `sethostname()` and `setdomainname()` system calls modify only the calling process's UTS namespace, not the host. Beyond containers, UTS namespaces are used in HPC environments where each job node needs its own hostname to prevent conflicts in cluster software. Linux's `unshare --uts` creates a new UTS namespace. In multi-tenant servers, different tenants can have their own hostname without affecting others. The practical application in containerization is that each container can have a meaningful hostname like `web-01` instead of inheriting the host's generic name.
The `cpu.rt_runtime_us` setting specifies the time a cgroup's RT processes can run within each CPU period for real-time scheduler tasks. Unlike regular CFS bandwidth control, this applies to tasks with real-time scheduling policies (SCHED_FIFO, SCHED_RR). If a cgroup has `cpu.rt_runtime_us=200000` (200ms) and `cpu.rt_period_us=1000000` (1s), RT tasks in that cgroup can run for 200ms every second. This prevents a misbehaving RT task from monopolizing a CPU. In Kubernetes, you configure this via the `cpu.cfs_rt_runtime_us` or the pod's CPU management policy with `static` allocation. Without this, a container with real-time tasks could starve the host scheduler.
The freezer cgroup can suspend (freeze) or resume all processes in a cgroup. When a cgroup is frozen, every process in it is stopped at a safe checkpoint—no further CPU scheduling, no I/O initiated. This is useful for batch job management because you can queue jobs in a frozen state and unfreeze them when resources are available. In Docker, `docker pause` and `docker unpause` use the freezer subsystem. In Kubernetes, it is used for job preemption. Freezing is cleaner than sending SIGSTOP (which can be caught and resume) because the kernel guarantees processes cannot escape the frozen state. It also makes checkpoint/restart easier since all processes are simultaneously stopped.
A privileged container has all Linux capabilities enabled, meaning the processes inside can do almost anything the host root can do—mount filesystems, modify iptables rules, access raw network sockets. An unprivileged container (the default in Docker) drops all capabilities except a small whitelist and uses user namespace mapping so container root is not host root. Docker's default capability set (`CAP_NET_BIND_SERVICE`, `CAP_CHOWN`, etc.) is in `docker run --cap-add`. You remove capabilities with `--cap-drop ALL`. Running `--privileged` gives the container all capabilities and also disables seccomp restrictions, essentially giving the container host root access. Production containers should never run privileged.
Serverless platforms like Knative use namespaces and cgroups to provide per-invocation isolation with minimal overhead. Each function invocation might get its own PID namespace, network namespace (or share a minimal namespace), and a dedicated cgroup with guaranteed CPU and memory allocation. The key difference from traditional containers is that serverless invocations are much shorter-lived and much more numerous—potentially thousands per second. This requires fast namespace creation (using `unshare` with pre-prepared namespaces) and aggressive cgroup cleanup to prevent resource exhaustion. Some platforms reuse cold-start environments rather than creating fresh namespaces for every invocation, trading isolation for performance.
The `devices` cgroup controller whitelist-controls which device files a container can access. By default, Docker blocks all device access except the whitelist: `/dev/null`, `/dev/zero`, `/dev/urandom`, etc. Container processes cannot `mknod` new devices or access `/dev/sda` (the host disk) unless explicitly allowed with `--device` flags. The cgroup rules use `a` (allow) or `c` (char) or `b` (block) followed by device major:minor numbers. This prevents a compromised container from accessing the host's hardware directly—without `/dev/sda`, an attacker cannot read or write the host filesystem even with root inside the container. The devices cgroup is a critical layer in container security alongside capabilities and seccomp.
Further Reading
- Concurrency Fundamentals — The problem space and why synchronization is needed
- Mutex Implementation — How mutexes are implemented in userspace and kernel
- Semaphores — Counting semaphores for resource management
- Readers-Writer Locks — Optimizing for read-heavy workloads
- Lock-Free Structures — Advanced techniques for highly concurrent systems
Conclusion
Namespaces and cgroups are the Linux kernel primitives that make containers possible. Namespaces partition kernel resources for isolation while cgroups impose resource limits and accounting. Together they enable multi-tenant deployments where workloads share infrastructure safely. Understanding these mechanisms helps you debug container issues, design more resilient architectures, and implement proper resource controls. For further study, explore the runc runtime source code, cgroup v2 hierarchy differences, and container security mechanisms like seccomp and capabilities that complement namespace isolation.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.