Namespaces & Cgroups

Linux container primitives: PID, network, mount, user namespaces for isolation

published: May 19, 2026 reading time: 25 min read author: GeekWorkBench

Quick Summary

Linux container primitives: PID, network, mount, user namespaces for isolation

Namespaces & Cgroups

Containers dominate modern infrastructure. Whether you are deploying microservices, running serverless functions, or packaging applications for consistent development environments, containers provide the isolation and resource controls that make these patterns practical. But what actually makes a container work underneath the hood? Two Linux kernel features: namespaces and cgroups.

Namespaces partition kernel resources so that processes in one namespace cannot see or modify resources in other namespaces. Each namespace has its own view of global system resources. A process might believe it is PID 1 inside its namespace while being PID 1000 on the host. The network namespace gives each container its own IP address and routing table.

Cgroups (control groups) impose limits and accounting on resource usage. They ensure that one container cannot consume all available CPU, memory, or I/O bandwidth. Without cgroups, containers would compete destructively and the system would become unstable under load.

Together, these primitives provide the building blocks for container runtimes like Docker and container orchestration systems like Kubernetes. Understanding them helps you debug container issues and design more resilient deployments.

Introduction

Linux namespaces and cgroups are kernel features that enable operating system-level virtualization. They allow multiple processes to share the same kernel while operating in isolated environments. Unlike virtual machines which emulate hardware and run complete operating systems, containers share the host kernel but maintain isolation through these mechanisms.

Namespaces wrap global system resources in an abstraction layer, making it appear to processes inside the namespace that they have their own isolated instance of that resource. The kernel has implemented various namespace types since the first one (mount namespaces) was introduced in kernel 2.4.19. Each namespace type isolates a specific category of resources.

Cgroups organize processes into hierarchical groups and apply resource limits to those groups. The hierarchy is tree-structured, with child groups inheriting constraints from parent groups. This allows fine-grained resource allocation where parent cgroups can reserve guarantees for child groups or impose absolute limits.

When to Use / When Not to Use

Namespace and cgroup knowledge becomes essential when debugging container issues, designing multi-tenant systems, or building container runtimes. Understanding these primitives helps you diagnose why a container cannot see a host process, why it lacks network connectivity, or why it is being throttled.

These mechanisms are ideal for multi-tenant deployments where workload isolation is critical. Cloud providers use cgroups and namespaces to isolate tenant workloads on shared infrastructure. Development teams use containers to ensure consistent environments from development through production.

However, namespace isolation is not as strong as virtual machine isolation. Containers share the host kernel, so kernel exploits can potentially break out of container isolation. For truly untrusted workloads that require strong isolation, virtual machines may be more appropriate despite their higher overhead.

Architecture or Flow Diagram

graph TD
    subgraph Host System
        H[Host Processes] --> HCG[Host Cgroup]
        HCG --> HCGOV[CPU, Memory, IO Controllers]
    end

    subgraph Container A
        CA[Container A Processes] --> ACG[Container A Cgroup]
        ACG --> ACGOV[CPU: 2 cores<br/>Memory: 4GB limit]
        ANP[Network NS] --> AIF[Isolated Net Interface]
        APID[PID NS] --> AP1[PID 1 inside container]
    end

    subgraph Container B
        CB[Container B Processes] --> BCG[Container B Cgroup]
        BCG --> BCGOV[CPU: 1 core<br/>Memory: 2GB limit]
        BNP[Network NS] --> BIF[Isolated Net Interface]
    end

    HCGOV --> ACGOV
    HCGOV --> BCGOV

    style Container A stroke:#00fff9
    style Container B stroke:#ff00ff

The diagram shows how cgroups impose resource limits independently per container. Each container belongs to its own cgroup subtree with specific CPU, memory, and I/O constraints. Namespaces provide isolation for network, PID, mount, user, and other resources. The host cgroup sits at the top of the hierarchy, with container cgroups as children that inherit but can restrict parent limits.

Core Concepts

Namespace Types

Linux implements several namespace types, each isolating different kernel resources.

PID Namespace: Isolates process ID numbers. Processes in different PID namespaces can have the same PID. The first process in a PID namespace becomes the “init” process for that namespace, responsible for reaping orphaned processes. PID namespaces can be nested, with processes in inner namespaces being visible in outer namespaces with different PID numbers.

Network Namespace: Provides isolated network stack including interfaces, routing tables, firewall rules, and port numbers. A container with its own network namespace can have its own IP address, routing rules, and even run services on the same port as services on the host or other containers.

Mount Namespace: Isolates the set of filesystem mounts visible to processes. Processes in a mount namespace see different directory trees. This is how containers get their own root filesystem (via chroot) and why mounting inside a container does not affect the host.

User Namespace: Maps UIDs and GIDs between the namespace and the host. A process can be root (UID 0) inside a user namespace while being an unprivileged user on the host. User namespaces provide a way to grant elevated privileges without giving actual root access on the host.

UTS Namespace: Isolates hostname and domain name. Containers can have their own hostname, useful for applications that display or react to hostname.

IPC Namespace: Isolates System V IPC objects and POSIX message queues. Processes in different IPC namespaces cannot directly communicate via shared memory or message queues.

Cgroup Controllers

Cgroups organize processes hierarchically and apply resource controls through controllers.

CPU Controller: Limits CPU usage using the Completely Fair Scheduler (CFS) bandwidth control. Configures cpu.cfs_period_us and cpu.cfs_quota_us to set time periods and per-period quotas. Also supports CPU affinity through cpuset.cpus.

Memory Controller: Limits memory usage and tracks consumption. The memory.limit_in_bytes setting imposes hard limits. The controller generates events when limits are approached, enabling proactive handling before OOM conditions.

I/O Controller: Controls disk I/O bandwidth using weight-based or bandwidth-limited scheduling. The io.weight setting uses relative weights while io.throttle.read_bps imposes absolute limits.

PIDs Controller: Limits the number of processes in a cgroup, preventing fork bombs from consuming all available PIDs.

# View cgroup hierarchy
ls /sys/fs/cgroup/

# Check CPU limit for a container
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.cfs_quota_us

# See memory limit
cat /sys/fs/cgroup/memory/docker/CONTAINER_ID/memory.limit_in_bytes

# View process in cgroup
cat /proc/PID/cgroup

Production Failure Scenarios + Mitigations

Scenario: Container Out of Memory Kill

Problem: Container exceeds memory limit and is killed by the OOM killer.

Symptoms: Container restarts unexpectedly, dmesg shows OOM kills, application logs show sudden disconnection without graceful shutdown.

Mitigation: Set appropriate memory limits that account for peak usage plus headroom. Monitor memory usage trends and alert before limits are reached. Consider increasing limits if legitimate growth occurs. Use memory requests and limits in Kubernetes to ensure guaranteed memory allocation.

# Check if container was OOM killed
docker inspect CONTAINER_ID | grep -i oom

# View OOM events in journal
journalctl -k | grep -i memory

# Set memory limit with buffer
docker run -m 512m --memory-reservation=384m myapp

Scenario: CPU Throttling Causes Latency Spikes

Problem: Container CPU usage exceeds quota, causing CFS throttling that introduces latency.

Symptoms: Periodic latency spikes correlate with batch processing, CPU usage appears below limit but throttling occurs, cpu.stat shows high throttling counts.

Mitigation: Ensure CPU limits are not set too conservatively. Consider using CPU requests instead of limits for latency-sensitive workloads. Increase cpu.cfs_quota_us or cpu.cfs_period_us if using CFS scheduler.

# Check CPU throttling stats
cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.stat

# Look for high throttling numbers
grep -E "throttled_time|throttled_count" /sys/fs/cgroup/cpu/*/cpu.stat

# Avoid CPU limits for critical workloads
# Use CPU requests and Guaranteed QoS instead

Scenario: Namespace Isolation Break

Problem: Misconfigured namespace allows container to access host resources.

Symptoms: Container can see host processes, host filesystem accessible from container, network isolation not working.

Mitigation: Regularly audit container configurations. Use security tools like docker bench security to identify misconfigurations. Avoid running containers with --privileged. Use user namespaces to reduce impact of potential privilege escalation.

# Check if container has excessive capabilities
docker run --rm -it myimage capsh --print

# Audit container security settings
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
    aquasec/docker-bench:security

# Verify namespace isolation
docker run --rm -it --pid=host myimage ps aux
# If this shows host processes, PID namespace is not isolated

Trade-off Table

Namespace Type	Isolation Provided	Performance Impact	Configuration Complexity
PID	Process ID numbers	Negligible	Low
Network	Interfaces, routing, ports	Low to medium	Medium
Mount	Filesystem mounts	Negligible	Medium
User	UID/GID mapping	Negligible	Medium
UTS	Hostname/domain	None	Low
IPC	Message queues, shared memory	Negligible	Low

Resource Control	Mechanism	Suitable For	Limitations
CPU limit (CFS)	Time quota per period	Batch workloads	Can cause throttling
CPU limit (cpuset)	CPU affinity	Latency critical	Reduces flexibility
Memory limit	Hard cap	Multi-tenant	Can trigger OOM
Memory swap	Swap usage	Memory pressure	Depends on host swap
I/O weight	Relative priority	Competing workloads	Shared disk required
I/O throttle	Absolute bandwidth	Guaranteed QoS	Complex tuning

Container Runtime	Namespace Support	Cgroup Support	Learning Curve
Docker	All types	v1 and v2	Moderate
containerd	All types	v1 and v2	Moderate
cri-o	All types	v1 and v2	Moderate
podman	All types	v1 and v2	Low (Docker-compatible)
runc	All types	v1	Low

Implementation Snippets

Creating Namespaces Manually with unshare

#!/bin/bash
# Create isolated PID and network namespace

# Create new PID namespace (process will be PID 1 in this namespace)
unshare --pid --fork --mount-proc /bin/bash

# Inside the new namespace, verify isolation
echo "PID in namespace: $$"
echo "Host PID visible from namespace:"
ps aux | head -5

# Create network namespace and add interface
ip netns add container_net
ip link add veth0 type veth peer name veth1
ip link set veth1 netns container_net

Inspecting Namespace IDs

#!/bin/bash
# View namespace IDs for a process

PID=$$
echo "Process $PID namespace IDs:"

# List namespace symlinks for process
ls -la /proc/$PID/ns/

# View specific namespace inodes
cat /proc/$PID/ns/net
cat /proc/$PID/ns/pid

# Compare with another process
cat /proc/1/ns/net

Setting Cgroup Limits with systemd

# /etc/systemd/system/myapp.slice
[Unit]
Description=My Application Slice

[Slice]
# 2 CPU cores and 4GB memory
CPUQuota=200%
MemoryLimit=4G

# I/O throttling (100 MB/s read, 50 MB/s write)
IOReadBandwidthMax=/dev/sda 100M
IOWriteBandwidthMax=/dev/sda 50M

# CPU weight (higher = more priority)
CPUWeight=100

[Install]
WantedBy=multi-user.target

Python: Checking Container Resource Limits

#!/usr/bin/env python3
"""Check container resource limits from inside the container."""

import os
import subprocess

def get_cgroup_path():
    """Find cgroup path for this container."""
    with open('/proc/self/cgroup', 'r') as f:
        for line in f:
            if line.startswith('2:cpu'):
                return line.split(':')[-1].strip()
    return None

def read_cgroup_file(controller, filename):
    """Read a cgroup file."""
    base = f'/sys/fs/cgroup/{controller}'
    path = os.path.join(base, filename)
    try:
        with open(path, 'r') as f:
            return f.read().strip()
    except FileNotFoundError:
        return 'not available'

def main():
    print("Container Resource Limits")
    print("=" * 40)

    # Memory limits
    mem_limit = read_cgroup_file('memory', 'memory.limit_in_bytes')
    print(f"Memory limit: {int(mem_limit) // (1024**3):.1f} GB")

    # CPU quota
    cpu_quota = read_cgroup_file('cpu', 'cpu.cfs_quota_us')
    cpu_period = read_cgroup_file('cpu', 'cpu.cfs_period_us')
    if cpu_quota and cpu_period:
        quota = int(cpu_quota)
        period = int(cpu_period)
        if quota > 0:
            print(f"CPU quota: {quota}/{period} = {quota/period:.2f} cores")

    # PID limit
    pids_max = read_cgroup_file('pids', 'pids.max')
    print(f"Max processes: {pids_max}")

if __name__ == '__main__':
    main()

Observability Checklist

Cgroup Resource Monitoring:

Monitor CPU throttling metrics in cpu.stat
Track memory usage vs limits
Watch for approaching PID limits
Measure I/O bandwidth against throttling limits

Namespace State Verification:

Verify namespace isolation is active
Check for namespace escapes
Audit capability sets

Container Metrics:

docker stats provides real-time CPU, memory, network, and disk I/O
docker inspect reveals container configuration
Kubernetes metrics server provides cluster-wide visibility

Key Metrics to Track:

# CPU throttling
cat /sys/fs/cgroup/cpu/docker/*/cpu.stat | grep throttled

# Memory usage
cat /sys/fs/memory/docker/*/memory.usage_in_bytes

# OOM events
grep -r OOM /var/log/syslog 2>/dev/null | tail -20

Security/Compliance Notes

Capability Restrictions: Run containers with minimal capabilities. Docker drops most capabilities by default and adds only those requested. Review required capabilities for your workload and explicitly drop all others.

No New Privileges: Set security-opt=no-new-privileges:true to prevent processes from gaining additional privileges through setuid binaries or file capabilities.

User Namespace Remapping: Consider using user namespaces to map container root to unprivileged host users. This reduces the impact of container escape vulnerabilities.

Seccomp Profiles: Use seccomp profiles to block system calls that are not needed. Docker provides a default seccomp profile that blocks approximately 44 system calls known to be unnecessary for most containers.

# Kubernetes security context
securityContext:
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

Common Pitfalls / Anti-patterns

Running Containers as Root: Containers running as UID 0 have significant capabilities on the host. Use read-only root filesystems and drop capabilities to limit damage from compromise.

Ignoring PID Limits: The PIDs cgroup limits the number of processes that can run in a container. Setting this too low causes “too many open files” or fork failures. Set it high enough for expected process count plus headroom.

Host Network Mode: Using --network=host removes network namespace isolation entirely. Containers can now see and modify host network configuration. Avoid this mode in production.

Sharing Namespaces Between Containers: Sharing PID or network namespaces between containers creates tight coupling. If one container becomes compromised, it can affect the other. Use container composition through networks rather than namespace sharing.

Not Monitoring Cgroup Limits: Containers hitting resource limits behave badly (throttling, OOM kills). Monitor resource usage against limits and alert before limits are reached.

Quick Recap Checklist

Namespaces isolate global kernel resources per process group
Cgroups impose resource limits and accounting on process groups
PID namespaces provide isolated process ID spaces
Network namespaces give containers independent network stacks
Mount namespaces isolate filesystem views
User namespaces map UIDs/GIDs between container and host
CPU cgroups use CFS quota or cpuset for limits
Memory cgroups impose hard limits that trigger OOM
Container runtimes create and manage namespaces automatically
Security requires capability dropping, seccomp profiles, and resource limits
Production monitoring must track cgroup metrics and namespace state

Interview Questions

1. What is the difference between a namespace and a cgroup?

Namespaces partition kernel resources so processes in one namespace cannot see or modify resources in other namespaces. A process might look like PID 1 inside its namespace while being PID 1000 on the host. Cgroups impose resource limits and accounting. Namespaces isolate, cgroups limit. Containers use both.

2. How do PID namespaces work and why do they matter for containers?

PID namespaces isolate process ID numbers so processes in different namespaces can have the same PID. The first process in a PID namespace becomes the "init" process for that namespace, handling orphaned process reaping. This isolation means containers cannot see or signal host processes, and a container restart does not affect PIDs outside it.

3. What happens when a container exceeds its memory limit?

When a container exceeds its cgroup memory limit, the kernel triggers an OOM kill. The memory controller fires events before limits are hit, so you can react proactively. If the container does not handle the pressure, the OOM killer terminates the biggest memory consumer inside the container. The result is sudden restarts without graceful shutdown.

4. How do user namespaces improve container security?

User namespaces map UIDs and GIDs between the container and host. A process can be root (UID 0) inside a user namespace while being unprivileged on the host. This limits the damage from container escapes because even if an attacker breaks out, they do not automatically get host root. User namespaces are one of the most important security enhancements for container isolation.

5. What is CPU throttling and how does it affect containerized applications?

CPU throttling happens when a container exceeds its CPU quota set by the cgroup CFS bandwidth controller. The CFS halts the container's processes until the next period, which introduces latency spikes. This hits latency-sensitive workloads hard. The fix is setting CPU limits high enough for burst capacity, or using CPU requests instead of limits for critical services.

6. How does the cgroup hierarchy work and why does a child cgroup inherit limits from its parent?

Cgroups organize processes in a tree-structured hierarchy where each child inherits constraints from its parent but can further restrict them. The root cgroup (`/sys/fs/cgroup`) sits at the top and has no limits. When you create a child cgroup, it starts with the parent's current limits as its ceiling. A container's cgroup under `/sys/fs/cgroup/cpu/docker/CONTAINER_ID` inherits system-wide CPU limits but can set its own more restrictive quota. This inheritance means you cannot exceed parent limits—if the parent allows 4 cores and you split into two child cgroups, each can use at most 2 cores unless the parent is reconfigured. The hierarchy also means that when a parent cgroup is throttled, all its children are throttled proportionally.

7. What is the difference between cgroups v1 and cgroups v2, and why did Linux change?

Cgroups v1 (legacy) had each controller (cpu, memory, io, pids) as a separate hierarchy, meaning a process could belong to different cgroups for different controllers. This complexity led to difficult resource management. Cgroups v2 unifies all controllers into a single hierarchy—every process belongs to exactly one cgroup for all controllers. V2 also provides better delegation (a parent can give a child control of its subtree), improved container integration, and reduced kernel complexity. Docker and Kubernetes support v2 as of kernel 4.5+. You can check which version your system uses by looking at `/sys/fs/cgroup/`—if it has subdirectories like `cgroup.controllers` rather than separate `cpu/`, `memory/` directories, you're on v2.

8. What is the container rootless mode and how does it use user namespaces?

Rootless mode allows non-root users to run containers without any root privileges on the host. It works by combining user namespaces (which map container root UID 0 to an unprivileged host UID) with additional restrictions. When a rootless container starts, the process inside thinks it is UID 0 (root) inside the container but is actually running as a normal user on the host. The kernel enforces this mapping so even if the container process tries to escalate privileges, it only gains the mapped user's privileges. Rootless Podman and Docker rootless mode use this. Limitation: some kernel features require real root privileges and cannot work in rootless mode—network spanning, certain storage drivers, and some device accesses.

9. How do you monitor namespace isolation to ensure containers cannot escape their namespace boundaries?

Monitoring namespace isolation involves several checks. First, verify PID namespace isolation by running `ps aux` inside a container—the host processes should not be visible, and the container's PID 1 should be visible. Second, check network namespace isolation by examining `/proc/self/ns/net` for different inode values inside versus outside the container. Third, use `docker run --rm --pid=host container ps aux`—if host processes appear, PID namespace is not isolated. Fourth, audit capabilities with `capsh --print` inside the container and compare against host capabilities. Fifth, check seccomp profiles and ensure the container is not running in privileged mode. Regular security benchmarks like CIS Docker Benchmark provide systematic checklists for this.

10. What is the relationship between containers and the host kernel? Why can container escapes be more dangerous than VM escapes?

Containers share the host kernel, unlike VMs which emulate hardware and run their own kernel. When a container escapes (breaks out of namespace isolation), the attacker lands directly in the host kernel context—the same kernel that controls everything on the host. A kernel exploit or container escape can give host root access. VMs are generally more isolated because they have their own kernel—even if you escape a VM, you're inside another kernel, not the host. However, VM escapes have historically caused hypervisor bugs that exposed the host. The practical danger of container escapes is higher in multi-tenant environments because containers often share the kernel with less isolation than VMs would provide.

11. How does the PIDs cgroup controller prevent fork bomb attacks on a container?

The PIDs cgroup controller limits the number of processes that can exist in a cgroup. When a container's process count hits the limit (`pids.max`), any further `fork()` or `clone()` calls fail with EAGAIN, preventing the container from creating unlimited processes. The default PIDs limit varies by runtime but is typically 4096. A fork bomb inside the container would exhaust the container's PID limit but would not affect the host or other containers—each container has its own PID cgroup hierarchy. You can tune this with `docker run --pids-limit=512` or in Kubernetes via `spec.podSpec.resources.limits.pids`. Without this, a container fork bomb could consume all available PIDs on the host, affecting every other process.

12. What is the memory.swap cgroup setting and when is it useful versus dangerous?

The `memory.swap` setting in cgroups controls how much swap space a container can use. By default, it is set to the same value as `memory.limit_in_bytes`, meaning a container can swap up to its memory limit. This prevents an OOM kill when a container hits its memory limit—it swaps instead. The danger is that swapping to disk is extremely slow compared to memory access, so a container that starts swapping will become extremely slow and may appear hung. For latency-sensitive workloads, setting `memory.swap=0` forces immediate OOM kills rather than slow swapping, making the failure mode more predictable. You can also set `memory.swappiness` to control how aggressively the container's pages are swapped.

13. How does the io.weight cgroup setting work compared to io.throttle.read_bps?

`io.weight` uses a relative weight-based approach—the kernel's CFQ scheduler uses the weight to determine how much disk time a cgroup gets relative to other cgroups. Higher weight means proportionally more disk time. If two containers have weights of 100 and 200, the second gets twice the disk time of the first. `io.throttle.read_bps` (and write_bps) imposes absolute bandwidth limits—the container cannot exceed the specified bytes per second regardless of what other cgroups are doing. Weight-based scheduling is better when you want proportional fairness; throttle-based is better when you need hard guarantees. You can combine both: weight for proportional sharing within a guaranteed floor.

14. What happens to a container's namespaces when the container exits? Why does Docker keep some namespaces for persistent containers?

When a container exits, all its namespaces are destroyed by default—network namespace deleted, PID namespace terminated (all processes killed), mount namespace discarded. The container's filesystem layers persist in Docker storage but the runtime namespace state is gone. However, Docker's `--net=host` option reuses the host's network namespace instead of creating a new one, so the container's network configuration affects the host directly. Kubernetes pods use shared namespaces by default—containers in the same pod share PID, network, and mount namespaces, so they can see each other's processes and share the same network stack. Shared namespaces improve inter-container communication performance but increase coupling.

15. How does Linux implement the UTS namespace and what are practical uses beyond container isolation?

UTS namespaces isolate hostname and domain name shown by the `hostname` and `domainname` commands. The `sethostname()` and `setdomainname()` system calls modify only the calling process's UTS namespace, not the host. Beyond containers, UTS namespaces are used in HPC environments where each job node needs its own hostname to prevent conflicts in cluster software. Linux's `unshare --uts` creates a new UTS namespace. In multi-tenant servers, different tenants can have their own hostname without affecting others. The practical application in containerization is that each container can have a meaningful hostname like `web-01` instead of inheriting the host's generic name.

16. What is the cgroup `cpu.rt_runtime_us` setting and when would you use it for real-time workloads?

The `cpu.rt_runtime_us` setting specifies the time a cgroup's RT processes can run within each CPU period for real-time scheduler tasks. Unlike regular CFS bandwidth control, this applies to tasks with real-time scheduling policies (SCHED_FIFO, SCHED_RR). If a cgroup has `cpu.rt_runtime_us=200000` (200ms) and `cpu.rt_period_us=1000000` (1s), RT tasks in that cgroup can run for 200ms every second. This prevents a misbehaving RT task from monopolizing a CPU. In Kubernetes, you configure this via the `cpu.cfs_rt_runtime_us` or the pod's CPU management policy with `static` allocation. Without this, a container with real-time tasks could starve the host scheduler.

17. How does the freezer cgroup subsystem work and why is it useful for batch job management?

The freezer cgroup can suspend (freeze) or resume all processes in a cgroup. When a cgroup is frozen, every process in it is stopped at a safe checkpoint—no further CPU scheduling, no I/O initiated. This is useful for batch job management because you can queue jobs in a frozen state and unfreeze them when resources are available. In Docker, `docker pause` and `docker unpause` use the freezer subsystem. In Kubernetes, it is used for job preemption. Freezing is cleaner than sending SIGSTOP (which can be caught and resume) because the kernel guarantees processes cannot escape the frozen state. It also makes checkpoint/restart easier since all processes are simultaneously stopped.

18. What is the difference between a privileged and unprivileged container, and what capabilities does Docker drop by default?

A privileged container has all Linux capabilities enabled, meaning the processes inside can do almost anything the host root can do—mount filesystems, modify iptables rules, access raw network sockets. An unprivileged container (the default in Docker) drops all capabilities except a small whitelist and uses user namespace mapping so container root is not host root. Docker's default capability set (`CAP_NET_BIND_SERVICE`, `CAP_CHOWN`, etc.) is in `docker run --cap-add`. You remove capabilities with `--cap-drop ALL`. Running `--privileged` gives the container all capabilities and also disables seccomp restrictions, essentially giving the container host root access. Production containers should never run privileged.

19. How does Knative or serverless platforms use namespaces and cgroups differently from traditional containers?

Serverless platforms like Knative use namespaces and cgroups to provide per-invocation isolation with minimal overhead. Each function invocation might get its own PID namespace, network namespace (or share a minimal namespace), and a dedicated cgroup with guaranteed CPU and memory allocation. The key difference from traditional containers is that serverless invocations are much shorter-lived and much more numerous—potentially thousands per second. This requires fast namespace creation (using `unshare` with pre-prepared namespaces) and aggressive cgroup cleanup to prevent resource exhaustion. Some platforms reuse cold-start environments rather than creating fresh namespaces for every invocation, trading isolation for performance.

20. What is the `devices` cgroup and how does it prevent container access to host devices?

The `devices` cgroup controller whitelist-controls which device files a container can access. By default, Docker blocks all device access except the whitelist: `/dev/null`, `/dev/zero`, `/dev/urandom`, etc. Container processes cannot `mknod` new devices or access `/dev/sda` (the host disk) unless explicitly allowed with `--device` flags. The cgroup rules use `a` (allow) or `c` (char) or `b` (block) followed by device major:minor numbers. This prevents a compromised container from accessing the host's hardware directly—without `/dev/sda`, an attacker cannot read or write the host filesystem even with root inside the container. The devices cgroup is a critical layer in container security alongside capabilities and seccomp.

Conclusion

Namespaces and cgroups are the Linux kernel primitives that make containers possible. Namespaces partition kernel resources for isolation while cgroups impose resource limits and accounting. Together they enable multi-tenant deployments where workloads share infrastructure safely. Understanding these mechanisms helps you debug container issues, design more resilient architectures, and implement proper resource controls. For further study, explore the runc runtime source code, cgroup v2 hierarchy differences, and container security mechanisms like seccomp and capabilities that complement namespace isolation.

Namespaces & Cgroups

Introduction

When to Use / When Not to Use

Architecture or Flow Diagram

Core Concepts

Namespace Types

Cgroup Controllers

Production Failure Scenarios + Mitigations

Scenario: Container Out of Memory Kill

Scenario: CPU Throttling Causes Latency Spikes

Scenario: Namespace Isolation Break

Trade-off Table

Implementation Snippets

Creating Namespaces Manually with unshare

Inspecting Namespace IDs

Setting Cgroup Limits with systemd

Python: Checking Container Resource Limits

Observability Checklist

Security/Compliance Notes

Common Pitfalls / Anti-patterns

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates