System Call Security

Syscall filtering, seccomp, capability-based security, and least privilege in Linux

published: reading time: 25 min read author: GeekWorkBench

System Call Security

Every operation your program performs that requires kernel involvement passes through system calls. Opening files, creating processes, allocating memory, network communication, all of these traverse the boundary between your application and the operating system kernel. This boundary is where security happens. Understanding how to control what system calls a program can make is fundamental to building secure systems.

System call security provides mechanisms to restrict what privileged operations a process can perform. Rather than running with full kernel privileges, processes can be confined to only the specific operations they actually need. This principle of least privilege dramatically reduces the blast radius when something goes wrong. If a vulnerability exists in a web server that has been restricted from making any file operations except reading specific document files, the attacker cannot use it to spawn a shell or read sensitive configuration.

The Linux kernel provides several complementary mechanisms for syscall security: seccomp filtering, capabilities, and namespace isolation. Each addresses different threat models and operational requirements. Together they form the foundation of container security and secure computing baselines.

Overview

System calls are the fundamental interface between user space and kernel space. When a user space program needs to do anything privileged, it invokes a system call instruction that transfers control to a predefined kernel handler. The kernel validates the request, performs the operation, and returns results to user space.

Without restrictions, a compromised process can invoke any system call, potentially escalating privileges or accessing resources it should not touch. System call security mechanisms allow administrators and developers to restrict which system calls a process can make, effectively shrinking its privilege domain.

Seccomp (secure computing mode) was originally introduced in kernel 2.6.12 as a simple mode that could either allow all system calls or none. The BPF extension added in 2.6.25 transformed seccomp into a powerful filtering mechanism. Modern seccomp-bpf allows fine-grained control over system call arguments, return values, and even interposition of system calls for debugging.

Capabilities split traditional root privileges into discrete units. Instead of having all privileges (or none), a process can have specific capabilities like CAP_NET_BIND_SERVICE to bind to privileged ports or CAP_SYS_ADMIN for system administration tasks. This fine-grained model enables precise privilege assignment.

When to Use / When Not to Use

System call security mechanisms are essential when running untrusted code, implementing security-critical services, or building container runtimes. Any application that processes external input from potentially malicious sources should consider syscall filtering.

Container orchestration platforms like Docker and Kubernetes rely heavily on these mechanisms. When you run a container, the runtime configures namespace isolation and capability sets to confine what the containerized process can do. Understanding these mechanisms helps you debug container security issues and design more secure deployments.

However, these mechanisms add complexity. Development and debugging become harder when system calls are blocked. Some applications require capabilities that are difficult to grant without introducing risk. In development environments or systems without external exposure, the overhead may not justify the benefits.

Architecture or Flow Diagram

graph TD
    A[User Space Process] -->|syscall| B[System Call Entry]
    B --> C{Seccomp Filter Active?}
    C -->|No| D[Execute System Call]
    C -->|Yes| E[BPF Program Runs]
    E --> F{BPF Rules Allow?}
    F -->|No| G[Deliver Signal<br/>EPERM]
    F -->|Yes| D
    D --> H[Return to User Space]

    subgraph Capabilities Check
    I{CAP_SYS_ADMIN?}
    I -->|Yes| J[Privileged Operation]
    I -->|No| K[Permission Denied]
    end

    style G fill:#ff6b6b
    style K fill:#ff6b6b

The system call filtering flow begins when a process makes a system call. The kernel intercepts this call and checks whether a seccomp filter is installed for the process. If a filter exists, the BPF program runs and evaluates the system call number and its arguments. Based on the filter rules, the BPF program either returns a verdict allowing the call, blocks it with an error, or triggers a signal to the process.

Capabilities are checked independently for specific privileged operations. The kernel maintains a set of permitted and effective capabilities for each process. When a privileged operation is attempted, the kernel verifies the process has the required capability before proceeding.

Core Concepts

Seccomp Modes

Linux provides three seccomp modes with increasing sophistication.

Seccomp Mode 1 (SECCOMP_MODE_STRICT) allows only read, write, _exit, and sigreturn system calls. Any other system call results in process termination. This mode is rarely used in practice due to its extreme restrictions.

Seccomp Mode 2 (SECCOMP_MODE_FILTER) enables BPF filtering with system call arguments. This mode is what people typically refer to when discussing seccomp. It allows fine-grained control but requires writing BPF programs.

Seccomp User Notifications (added in kernel 5.0) allows a user space supervisor to make seccomp decisions. This enables userspace seccomp implementations and is used by tools like strace and container runtimes.

BPF for Seccomp Filtering

Seccomp BPF programs follow the classic BPF execution model. The kernel runs the BPF program with the system call number and arguments available in BPF context. The program returns a verdict: SECCOMP_RET_ALLOW permits the call, SECCOMP_RET_KILL terminates the process, SECCOMP_RET_ERRNO returns a specified error code, or SECCOMP_RET_TRAP delivers a signal.

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

void install_seccomp_filter() {
    struct sock_filter filter[] = {
        // Load system call number
        BPF_STMT(BPF_LD, BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
        // Allow read
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Allow write
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Allow exit
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Kill for everything else
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

Linux Capabilities

Traditional Unix distinguished between privileged processes (those running as root with UID 0) and unprivileged processes. Capabilities split these privileges into discrete units.

Permitted capabilities are capabilities the process is allowed to use. Effective capabilities are currently active capabilities. Inheritable capabilities persist across execve calls. Bounding set limits which capabilities can ever be acquired.

# View capabilities of a process
getpcaps $$

# Capabilities explanation
# CAP_NET_BIND_SERVICE - Bind to ports below 1024
# CAP_SYS_ADMIN - Most system administration operations
# CAP_SYS_CHROOT - Use chroot()
# CAP_NET_RAW - Use raw sockets
# CAP_DAC_OVERRIDE - Bypass file permission checks

# Drop all capabilities except specific ones
capsh --drop=CAP_NET_RAW --print

Production Failure Scenarios + Mitigations

Scenario: Seccomp Blocks Required System Call

Problem: Application fails because seccomp filter blocks a system call the application legitimately needs.

Symptoms: Process receives SIGKILL or SIGSYS signals, application crashes with seccomp violations in logs.

Mitigation: Use auditd to identify required system calls before deploying seccomp filters. Generate allowlists by running applications under strace to capture all system calls made during normal operation. Test filters thoroughly in staging before production deployment.

# Find system calls made by application
strace -f -e trace=read,write,open,close,stat -o app_trace.log ./your_application

# Review generated allowlist
grep -o '__NR_[a-z_]*' app_trace.log | sort -u

Scenario: Capability Misconfiguration Locks Out Administrator

Problem: Administrative script loses required capabilities and cannot perform necessary operations.

Symptoms: Scripts fail with “Operation not permitted” errors, cron jobs do not run properly, backup scripts fail.

Mitigation: Document required capabilities for each administrative function. Use capability-aware monitoring tools. When dropping capabilities, drop only the minimum set required. Always preserve CAP_SYS_ADMIN for recovery operations.

# Test capability requirements before deployment
sudo -u username capsh --caps="cap_net_bind_service,cap_dac_read_search+ep" -- -c "./your_script"

# Recover from lockout - reboot to single user mode
# Add capability requirements to script comments

Scenario: Container Escape via Syscall

Problem: Vulnerability in container allows malicious container contents to escape namespace isolation.

Symptoms: Unauthorized access to host system files, processes outside container visible, network traffic bypasses container network.

Mitigation: Keep kernel updated to patch container escape vulnerabilities. Use seccomp profiles that block syscalls known to be exploited for container escapes (like unshare with certain flags). Follow container security best practices like not running containers as root.

# Use Docker's default seccomp profile (blocks ~44 syscalls)
docker run --security-opt seccomp=default.json ...

# Block dangerous syscalls
docker run --security-opt seccomp=<(
  echo '{"defaultAction":"SCMP_ACT_ALLOW",
        "syscalls":[{"name":"unshare","action":"SCMP_ACT_ERRNO"}]}'
) ...

Trade-off Table

MechanismGranularityPerformance ImpactComplexity
Seccomp BPFSystem call number and argumentsMinimal (< 1% overhead)High (requires BPF knowledge)
CapabilitiesIndividual privilegesNoneMedium (must understand caps)
SELinux/AppArmorFile and network access rulesMinimalVery High (policy language)
NamespacesResource isolationMinimalMedium
Capability TypePurposeRisk if Misused
CAP_SYS_ADMINSystem administrationFull system compromise
CAP_NET_ADMINNetwork configurationNetwork sniffing, routing changes
CAP_NET_RAWRaw packet creationSpoofing, reconnaissance
CAP_SYS_CHROOTChange root directoryDirectory traversal
CAP_DAC_OVERRIDEBypass permission checksRead/write anything
Seccomp VerdictBehaviorUse Case
SECCOMP_RET_ALLOWPermit system callWhitelist allowed calls
SECCOMP_RET_KILLKill process immediatelyDefault deny
SECCOMP_RET_ERRNOReturn error to processSilent blocking
SECCOMP_RET_TRAPDeliver signalNotification without kill
SECCOMP_RET_USER_NOTIFForward to userspaceSupervisor-based decisions

Implementation Snippets

Basic Seccomp with libseccomp

#define _GNU_SOURCE
#include <seccomp.h>
#include <scmp/seccomp.h>
#include <stdio.h>

int main() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    if (!ctx) return 1;

    // Allow read, write, exit
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);

    // Allow open (but not openat)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 0);

    // Load filter
    if (seccomp_load(ctx) < 0) {
        perror("seccomp_load");
        seccomp_release(ctx);
        return 1;
    }

    seccomp_release(ctx);

    // Now run restricted code
    printf("Running with seccomp filter active\n");
    return 0;
}

Checking and Modifying Capabilities

#!/bin/bash
# Drop capabilities not needed by application

# Start with all capabilities
exec capsh --caps="cap_net_bind_service+ep cap_setuid+ep cap_setgid+ep" \
    --keep=1 --user=appuser \
    -- -c "./my_application"

Using strace with Seccomp

# Trace system calls in a seccomp-filtered process
# Requires kernel 5.8+ or use --seccomp=lower:0 to disable

strace -k -f ./seccomp_restricted_program 2>&1 | less

# -k shows stack traces for each syscall
# -f follows forked processes

Observability Checklist

System Call Monitoring:

  • Enable auditd rules to log specific system calls
  • Monitor for unexpected system call patterns
  • Track seccomp filter violations

Capability Monitoring:

  • Regularly audit capability assignments with getpcaps
  • Alert on unexpected capability changes
  • Document capability requirements for each service

Kernel Parameters:

  • kernel.yama.ptrace_scope - Control ptrace visibility
  • kernel.dmesg_restrict - Restrict kernel message access
  • fs.suid_dumpable - Control core dump behavior for setuid programs

Configuration Validation:

  • Test seccomp filters in staging before production
  • Verify capability drops after privilege operations
  • Validate namespace configurations for containers

Security/Compliance Notes

Defense in Depth: Never rely on a single security mechanism. Layer seccomp filtering, capabilities, and mandatory access controls for comprehensive protection.

Minimal Privilege Principle: Grant only the capabilities absolutely required for a service to function. Additional privileges increase attack surface and potential damage from compromise.

Audit Trail Requirements: Many compliance frameworks require logging of privileged operations. Configure auditd to log security-relevant system calls and capability usage.

# Audit rule to log capability changes
echo '-w /proc/self/status -p wa -k cap_change' >> /etc/audit/rules.d/capability.rules

# Audit rule to log seccomp violations
echo '-a always,exit -F arch=b64 -S seccomp -k seccomp' >> /etc/audit/rules.d/seccomp.rules

Container Security Context: When running containers in production, use security contexts to restrict capabilities and system calls.

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE

Common Pitfalls / Anti-patterns

Allowing execve Under Seccomp: BPF filters cannot inspect the arguments of execve after the call completes (the new program has already started). If a filter allows execve, a compromised process can execute any program. Block execve or carefully audit the new program’s requirements.

Capability Leaks After Fork: When a privileged process forks, child processes inherit capabilities. If the child drops privileges incorrectly, it may retain capabilities it should not have. Always re-evaluate capabilities after fork() and execve().

Ignoring Ambient Capabilities: Some capabilities persist across execve of programs that are not file-capability aware. The securebits flags control whether capabilities survive execve. Misunderstanding these flags leads to security bypasses.

Seccomp Filter Performance: While BPF filtering has minimal overhead, overly complex filters with many rules can add latency. Optimize filters by grouping similar rules and using jumps efficiently.

TOCTOU in Capability Checks: Time-of-check-time-of-use vulnerabilities can occur when capabilities are checked before an operation but the operation happens later with different privileges. Structure code to perform operations immediately after privilege acquisition.

Quick Recap Checklist

  • System calls are the boundary between user space and kernel space
  • Seccomp BPF filtering allows fine-grained control over allowed system calls
  • Capabilities split root privileges into discrete units
  • Least privilege principle means granting minimum necessary access
  • Container runtimes use these mechanisms for isolation
  • Production failures often involve blocked required system calls
  • Comprehensive testing in staging prevents deployment issues
  • Multiple security mechanisms should be layered together
  • Audit logging supports compliance and incident investigation
  • Understanding these mechanisms is essential for container security

Interview Questions

1. What is the difference between seccomp mode 1 and seccomp mode 2?

Seccomp mode 1 (SECCOMP_MODE_STRICT) allows only read, write, exit, and sigreturn system calls. Any other system call terminates the process immediately. Seccomp mode 2 (SECCOMP_MODE_FILTER) uses BPF programs to filter system calls based on the call number and arguments, allowing much finer-grained control over which system calls are permitted.

2. Explain the principle of least privilege as it applies to system call security.

Least privilege means granting a process only the minimum privileges required to perform its function. Rather than running as root with access to everything, a process should run with a specific limited set of capabilities and seccomp filters that block all unnecessary system calls. If a vulnerability is discovered in a least-privileged process, the attacker cannot escalate privileges or access resources beyond what the legitimate process needed.

3. What is the purpose of capabilities in Linux?

Capabilities split the traditional root privilege (UID 0) into discrete units that can be granted independently. Instead of having all privileges or none, a process can have specific capabilities like CAP_NET_BIND_SERVICE to bind to ports below 1024, or CAP_SYS_ADMIN for system administration tasks. This fine-grained model enables precise privilege assignment and reduces the risk from privilege escalation vulnerabilities.

4. How would you debug a seccomp filter that is blocking a legitimate system call?

First, use strace or auditd to identify which system call is being blocked and why. The kernel logs seccomp violations that can be viewed with ausearch or by checking /proc/PID/status for Seccomp mode. Once identified, update the BPF filter to allow the necessary call, potentially with argument filtering if the call itself is legitimate but some uses are not.

5. What are the security implications of running a container with --privileged?

A privileged container effectively runs with all capabilities and bypasses most security restrictions that normally protect the host. It can access all devices, modify network configuration, load kernel modules, and potentially escape container isolation. Privileged containers should be avoided in production. Instead, use specific capability grants and seccomp profiles that allow only necessary privileges.

6. How does the BPF verifier work in the context of seccomp filtering?

The BPF verifier validates seccomp BPF programs before they are loaded into the kernel, ensuring they cannot crash the kernel or run indefinitely. It performs static analysis: tracking the minimum and maximum possible values of each register at every instruction, ensuring no memory accesses occur out of bounds, and rejecting programs that could loop infinitely or access invalid memory.

For seccomp specifically, the verifier ensures the BPF program terminates and only accesses fields in the seccomp_data structure that are within bounds. A seccomp filter cannot read arbitrary kernel memory—it can only read the system call number and arguments provided in the seccomp_data struct. The verifier also enforces that the program returns only valid verdict codes (ALLOW, KILL, ERRNO, TRAP, USER_NOTIF). This is what makes seccomp BPF safe despite running in kernel context.

7. What is the kernel-user boundary and why does crossing it matter for security?

The kernel-user boundary is the protection domain boundary between user space (Ring 3, restricted instruction set, no direct hardware access) and kernel space (Ring 0, full hardware access). Crossing this boundary requires a context switch: from user registers and stack to kernel registers and stack, plus permission checks. System calls are the controlled crossing points—each syscall is a deliberate transition where the kernel validates the request before performing privileged operations.

Security implications: user space cannot directly manipulate hardware, access arbitrary memory, or execute privileged instructions. Any operation requiring these must go through the kernel via a syscall, giving the kernel a chokepoint to enforce security policy. Attacks that exploit the boundary include syscall smuggling (passing malformed arguments that the kernel misinterprets), return-oriented programming that chains syscall gadgets, and kernel exploit techniques that manipulate data visible across the boundary.

8. What is the difference between discretionary access control (DAC) and mandatory access control (MAC)?

Discretionary Access Control (DAC), the traditional Unix permission model, allows owners of objects (files, processes) to decide who can access them. The owner can modify permissions at their discretion (hence the name). Root can bypass DAC checks entirely. Examples: standard Unix rwx permissions, ACLs, and POSIX permissions.

Mandatory Access Control (MAC) enforces system-wide security policies that users cannot override, even with root privileges. The kernel mediates every access based on a security policy database, regardless of the owner's wishes. Examples: SELinux, AppArmor, and seccomp. A MAC policy might label web servers as httpd_t and deny them access to user home directories, even if root runs the process and the files are world-readable under DAC.

9. How do you construct a minimal seccomp filter for a specific application?

A practical approach: run the application under strace during representative operation to capture all syscalls made, extract the unique syscall numbers and arguments, then construct a BPF program that allows only those calls. Example using libseccomp to whitelist read, write, exit, and a specific openat call:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
seccomp_load(ctx);

Test extensively in staging before production. When using seccomp_user_notif, you can log unexpected syscalls rather than killing, then refine the policy based on real usage patterns. Tools like SECCOMP Audit mode in auditd help identify which syscalls are actually needed in production.

10. What are the security implications of running a container without seccomp and with --privileged?

Running a container without seccomp means the container can make any syscall the kernel supports. Without seccomp, an exploit that achieves code execution inside the container has access to the full syscall surface, including dangerous calls like ptrace, mount, unshare, and syslog that are rarely needed by applications but commonly used in container escapes.

Running with --privileged grants all capabilities and essentially disables all Linux security mechanisms inside the container: it can access all devices (including /dev/mem, /dev/cpu), write to arbitrary filesystem locations, modify network configuration, load kernel modules via /proc/sysrq-trigger, and bypass all cgroup restrictions. A privileged container running as UID 0 can often escape to host root. These flags should never be used in production; use specific capability grants and seccomp profiles instead.

11. How does strace interact with seccomp and what can it tell you about syscall usage?

strace uses the ptrace syscall to attach to a target process and intercept syscalls by stopping the process at each entry/exit and reading the syscall number and arguments. However, seccomp mode 2 with SECCOMP_RET_USER_NOTIF (kernel 5.8+) allows a seccomp filter to forward syscall information to a user space supervisor via a file descriptor, enabling strace-like functionality without ptrace.

You can capture all syscalls an application makes by running strace -f -c -o trace.log ./program, which summarizes syscall counts and times. To build a seccomp allowlist, extract syscall names from the trace and convert them to syscall numbers using perf stat -e 'syscalls:sys_enter_*' or parsing /proc/syscall. Note that some syscalls (like nanosleep or gettimeofday) may appear disproportionately due to timing calls.

12. What is the relationship between namespaces and seccomp in container security?

Namespaces and seccomp address different attack surfaces. Namespaces isolate resource views (PID trees, network interfaces, mount points, UTS hostname, IPC mechanisms, user UIDs) so that a container cannot see or interfere with resources belonging to other containers or the host. Seccomp restricts the syscalls a process can make, limiting the kernel operations available even if the process escapes namespace isolation.

Together they form defense in depth: namespace isolation prevents the container from seeing sensitive resources, and seccomp prevents exploitation of kernel syscalls that could break namespace boundaries. A container with network namespace isolation cannot access host network devices, but without seccomp it could still call unshare(CLONE_NEWNET) to create a new network namespace. A seccomp filter blocking unshare with CLONE_NEWNET closes this gap. Docker's default seccomp profile blocks ~44 syscalls known to be problematic for container isolation.

13. What is user namespace security and how does it relate to container root privileges?

User namespaces map UIDs inside the container to different UIDs on the host. Inside a user namespace, UID 0 (root) can map to UID 100000 on the host. This means container root is not actually root on the host—it cannot modify host files, cannot bind to privileged ports without mapping that capability, and cannot see host processes.

Rootless containers (Docker rootless mode, Podman) use user namespaces to run without any root privileges on the host. Even if an attacker escapes the container and gains container-root, they only have the mapped UID range, not actual root. User namespaces also enable safer nested containers. However, user namespaces have had vulnerabilities (CVE-2022-0492, CVE-2022-0812) that allowed privilege escalation, and combining user namespaces with seccomp and capabilities provides defense in depth.

14. How do seccomp user notifications enable sophisticated security policies?

Seccomp user notifications (kernel 5.0+) allow a user space supervisor to receive and respond to syscalls that the seccomp filter defers. The filter returns SECCOMP_RET_USER_NOTIF, which creates a notification file descriptor. The supervisor (running in user space) reads the syscall information, decides whether to allow or deny it, and responds using a SECCOMP_IOCTL_NOTIF_SEND ioctl.

This enables patterns like: a service that normally runs without a capability but temporarily needs it for a specific operation (like binding to a privileged port), where a supervisor intercepts the bind() call, performs the operation on behalf of the process, and returns a mocked success. It also enables strace functionality without ptrace, audit daemons that log all syscalls, and custom policy enforcement that goes beyond simple allow/deny.

15. What are the security differences between AppArmor and SELinux for system call filtering?

AppArmor (AppArmor) operates on file paths and capabilities, using pathname-based mandatory access control. A process is labeled with a profile that specifies what files it can access (by path, not by inode), what capabilities it can use, and what network operations are permitted. AppArmor is considered easier to debug because file paths are more intuitive than security contexts.

SELinux (Security-Enhanced Linux) operates on security contexts (labels) assigned to all system subjects (processes) and objects (files, sockets, ports). The policy engine evaluates whether a subject with a given security context can perform an operation on an object with another security context. SELinux is more powerful and flexible but significantly more complex to configure—policies are written in a custom language and require a full policy rebuild for changes.

16. What is the difference between CAP_SYS_ADMIN and CAP_SYS_MODULE?

CAP_SYS_ADMIN is the most powerful capability, encompassing a wide range of system administration operations: mounting filesystems, creating namespaces, modifying process credentials, configuring swap, setting hostname and domainname, and more. Its breadth makes it nearly equivalent to root—having CAP_SYS_ADMIN allows bypassing many permission checks. This is why CAP_SYS_ADMIN is the primary target for restriction in container environments.

CAP_SYS_MODULE allows loading and unloading kernel modules via init_module and delete_module syscalls. Loading a malicious kernel module is one of the most powerful attack vectors because the module runs in kernel context with full hardware access. Restricting CAP_SYS_MODULE prevents containers from loading kernel modules, which closes an escape vector where a container could load a module that manipulates the host kernel.

17. What is the seccomp filter that Docker's default profile installs?

Docker's default seccomp profile (located in /etc/docker/seccomp.json and applied when using --security-opt seccomp=default) is an allowlist that blocks ~44 syscalls known to be dangerous for container workloads. It blocks: syslog (reading kernel logs), mount and umount (filesystem manipulation), unshare (namespace creation without restrictions), mount variants, name_to_handle_at (opening doors to filesystem access), and perf_event_open (profiling with hardware counters from containers).

It explicitly allows ~300 syscalls needed by typical applications. The profile uses add_syscall_rule calls to add exceptions for specific arguments (like allowing personality but only with ADDR_NO_RANDOMIZE). The default profile is a practical starting point but should be customized for specific workloads—running a web server has different syscall requirements than running a database.

18. What is kernel hardening and how do these syscall security mechanisms fit into a broader hardening strategy?

Kernel hardening encompasses compile-time and runtime protections that reduce kernel attack surface. Compiler flags like -Wformat-security and -fstack-protector-strong make exploits harder. Runtime hardening includes kernel pointer hiding (kernel.kptr_restrict, kernel.dmesg_restrict), disabling unused filesystems (/proc/sys/fs/enable), and restricting /dev/kmem and /dev/mem access.

Seccomp, capabilities, and namespaces are user-space-facing hardening mechanisms that restrict what a compromised or misbehaving process can do. Together with kernel hardening, they create defense in depth: even if an attacker finds a kernel vulnerability, the exploit's capabilities are limited by seccomp filters, reduced capability sets, and namespace isolation. A comprehensive hardening strategy layers all of these: minimal base image, dropped capabilities, seccomp filters, read-only rootfs, no-new-privileges flag, and SELinux/AppArmor policies.

19. How does the no_new_privs flag work and why should containers set it?

The no_new_privs flag (prctl(PR_SET_NO_NEW_PRIVS, 1)) prevents a process and its children from gaining new privileges through setuid binaries or file capabilities. Any execve of a setuid binary or file with capabilities will run with the same privileges as the parent—the privilege escalation will not happen. Existing privileges (capabilities, UID mappings) are preserved, but no new ones can be acquired.

Containers should set no_new_privs because it ensures that privilege escalation exploits cannot work. A compromised container process that tries to exec a setuid binary to gain root will not succeed. This pairs well with dropping CAP_SETUID and other privilege-granting capabilities. In Kubernetes, securityContext.noNewPrivs: true sets this flag for all containers in a pod.

20. What is the attack surface of the execve syscall under seccomp?

The execve syscall presents a unique challenge for seccomp: once a new program starts, its memory is completely replaced by the new binary's code and data. If the seccomp filter allows execve, a compromised process can load any program from the filesystem. A malicious execve from within a seccomp-filtered process defeats the filter because the new program runs with its own seccomp profile (or none, if not inherited).

Mitigation: either block execve entirely in the seccomp filter, or use PR_SET_NO_NEW_PRIVS to prevent the new program from gaining privileges. Additionally, combine with a read-only root filesystem so the attacker cannot introduce their own binary. Some seccomp profiles allow execve only to specific pre-approved binaries by checking the pathname argument. In containers, Docker's default seccomp profile does not block execve, making it important to pair with no_new_privs.

Further Reading

Conclusion

System call security sits at the boundary between user space and kernel space. Seccomp BPF, capabilities, and namespace isolation together form the foundation of container and kernel security. The principle of least privilege should guide every decision you make here: grant only the capabilities actually needed, filter only the syscalls actually used. These mechanisms only work as a layered defense because each has blind spots that others cover.

If you want to dig deeper, the BPF verifier source is surprisingly readable, seccomp user notifications enable some clever supervisor patterns, and studying how containerd implements syscall filtering will show you how the theory translates to production code.

Category

Related Posts

ASLR & Stack Protection

Address Space Layout Randomization, stack canaries, and exploit mitigation techniques

#operating-systems #aslr-stack-protection #computer-science

Assembly Language Basics: Writing Code the CPU Understands

Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.

#operating-systems #assembly-language-basics #computer-science

Boolean Logic & Gates

Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.

#operating-systems #boolean-logic-gates #computer-science