System Call Security

Syscall filtering, seccomp, capability-based security, and least privilege in Linux

published: May 19, 2026 reading time: 29 min read author: GeekWorkBench

Quick Summary

Syscall filtering, seccomp, capability-based security, and least privilege in Linux

System Call Security

Every operation your program performs that requires kernel involvement passes through system calls. Opening files, creating processes, allocating memory, network communication, all of these traverse the boundary between your application and the operating system kernel. This boundary is where security happens. Understanding how to control what system calls a program can make is fundamental to building secure systems.

System call security provides mechanisms to restrict what privileged operations a process can perform. Rather than running with full kernel privileges, processes can be confined to only the specific operations they actually need. This principle of least privilege dramatically reduces the blast radius when something goes wrong. If a vulnerability exists in a web server that has been restricted from making any file operations except reading specific document files, the attacker cannot use it to spawn a shell or read sensitive configuration.

The Linux kernel provides several complementary mechanisms for syscall security: seccomp filtering, capabilities, and namespace isolation. Each addresses different threat models and operational requirements. Together they form the foundation of container security and secure computing baselines.

Introduction

System calls are the fundamental interface between user space and kernel space. When a user space program needs to do anything privileged, it invokes a system call instruction that transfers control to a predefined kernel handler. The kernel validates the request, performs the operation, and returns results to user space.

Without restrictions, a compromised process can invoke any system call, potentially escalating privileges or accessing resources it should not touch. System call security mechanisms allow administrators and developers to restrict which system calls a process can make, effectively shrinking its privilege domain.

Seccomp (secure computing mode) was originally introduced in kernel 2.6.12 as a simple mode that could either allow all system calls or none. The BPF extension added in 2.6.25 transformed seccomp into a powerful filtering mechanism. Modern seccomp-bpf allows fine-grained control over system call arguments, return values, and even interposition of system calls for debugging.

Capabilities split traditional root privileges into discrete units. Instead of having all privileges (or none), a process can have specific capabilities like CAP_NET_BIND_SERVICE to bind to privileged ports or CAP_SYS_ADMIN for system administration tasks. This fine-grained model enables precise privilege assignment.

When to Use / When Not to Use

System call security mechanisms are essential when running untrusted code, implementing security-critical services, or building container runtimes. Any application that processes external input from potentially malicious sources should consider syscall filtering.

Container orchestration platforms like Docker and Kubernetes rely heavily on these mechanisms. When you run a container, the runtime configures namespace isolation and capability sets to confine what the containerized process can do. Understanding these mechanisms helps you debug container security issues and design more secure deployments.

However, these mechanisms add complexity. Development and debugging become harder when system calls are blocked. Some applications require capabilities that are difficult to grant without introducing risk. In development environments or systems without external exposure, the overhead may not justify the benefits.

Architecture or Flow Diagram

graph TD
    A[User Space Process] -->|syscall| B[System Call Entry]
    B --> C{Seccomp Filter Active?}
    C -->|No| D[Execute System Call]
    C -->|Yes| E[BPF Program Runs]
    E --> F{BPF Rules Allow?}
    F -->|No| G[Deliver Signal<br/>EPERM]
    F -->|Yes| D
    D --> H[Return to User Space]

    subgraph Capabilities Check
    I{CAP_SYS_ADMIN?}
    I -->|Yes| J[Privileged Operation]
    I -->|No| K[Permission Denied]
    end

    style G fill:#ff6b6b
    style K fill:#ff6b6b

The system call filtering flow begins when a process makes a system call. The kernel intercepts this call and checks whether a seccomp filter is installed for the process. If a filter exists, the BPF program runs and evaluates the system call number and its arguments. Based on the filter rules, the BPF program either returns a verdict allowing the call, blocks it with an error, or triggers a signal to the process.

Capabilities are checked independently for specific privileged operations. The kernel maintains a set of permitted and effective capabilities for each process. When a privileged operation is attempted, the kernel verifies the process has the required capability before proceeding.

Core Concepts

Seccomp Modes

Linux provides three seccomp modes with increasing sophistication.

Seccomp Mode 1 (SECCOMP_MODE_STRICT) allows only read, write, _exit, and sigreturn system calls. Any other system call results in process termination. This mode is rarely used in practice due to its extreme restrictions.

Seccomp Mode 2 (SECCOMP_MODE_FILTER) enables BPF filtering with system call arguments. This mode is what people typically refer to when discussing seccomp. It allows fine-grained control but requires writing BPF programs.

Seccomp User Notifications (added in kernel 5.0) allows a user space supervisor to make seccomp decisions. This enables userspace seccomp implementations and is used by tools like strace and container runtimes.

Mode 1 barely surfaces in modern systems. The allowed set (read, write, exit, sigreturn) is so restrictive that essentially no application can function under it. You might encounter it in kiosk software or deeply embedded devices where every binary is known in advance, but production servers never use it.

Mode 2 is where seccomp becomes practical. The BPF filter can inspect both the system call number and its arguments before the kernel executes the call. This means you can allow openat but only when the path stays inside a specific directory, or permit write but only to certain file descriptors. The filter runs in kernel context and must terminate quickly. The BPF verifier rejects programs that could loop indefinitely or access memory out of bounds. Docker’s default seccomp profile is a Mode 2 filter that blocks about 44 syscalls known to enable container escapes while allowing the remaining ~300 syscalls typical applications need.

Mode 3 arrived in kernel 5.0 and changed the model entirely. Instead of the kernel making the allow/deny decision internally, Mode 3 forwards filtered syscalls to a user space supervisor via a file descriptor. That supervisor reads the syscall number and arguments, decides whether to permit it, and sends back a verdict through an ioctl. Strace uses this mechanism to trace seccomp-filtered processes without needing ptrace. Container runtimes use it to build policy engines that track state across a process lifetime, since the supervisor can maintain context that a static BPF program cannot. The cost is latency: every forwarded syscall requires a context switch from kernel to user space and back. For high-frequency syscalls this overhead matters, which is why most seccomp deployments use Mode 2 with BPF and reserve Mode 3 for cases that genuinely need userspace logic.

The three modes represent increasingly flexible tradeoffs between security and overhead. Mode 1 offers the strongest guarantees but zero flexibility. Mode 2 with BPF handles the majority of production use cases where you know your application’s syscall surface in advance. Mode 3 is specialized infrastructure for tools that need dynamic policy decisions. Mode 2 with BPF is the common choice for production seccomp filters. The code below demonstrates how it works.

BPF for Seccomp Filtering

Seccomp BPF programs follow the classic BPF execution model. The kernel runs the BPF program with the system call number and arguments available in BPF context. The program returns a verdict: SECCOMP_RET_ALLOW permits the call, SECCOMP_RET_KILL terminates the process, SECCOMP_RET_ERRNO returns a specified error code, or SECCOMP_RET_TRAP delivers a signal.

#include <linux/seccomp.h>
#include <linux/filter.h>
#include <sys/prctl.h>

void install_seccomp_filter() {
    struct sock_filter filter[] = {
        // Load system call number
        BPF_STMT(BPF_LD, BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)),
        // Allow read
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Allow write
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Allow exit
        BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
        // Kill for everything else
        BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
    };

    struct sock_fprog prog = {
        .len = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

Linux Capabilities

Traditional Unix distinguished between privileged processes (those running as root with UID 0) and unprivileged processes. Capabilities split these privileges into discrete units.

Permitted capabilities are capabilities the process is allowed to use. Effective capabilities are currently active capabilities. Inheritable capabilities persist across execve calls. Bounding set limits which capabilities can ever be acquired.

# View capabilities of a process
getpcaps $$

# Capabilities explanation
# CAP_NET_BIND_SERVICE - Bind to ports below 1024
# CAP_SYS_ADMIN - Most system administration operations
# CAP_SYS_CHROOT - Use chroot()
# CAP_NET_RAW - Use raw sockets
# CAP_DAC_OVERRIDE - Bypass file permission checks

# Drop all capabilities except specific ones
capsh --drop=CAP_NET_RAW --print

Production Failure Scenarios + Mitigations

Scenario: Seccomp Blocks Required System Call

Problem: Application fails because seccomp filter blocks a system call the application legitimately needs.

Symptoms: Process receives SIGKILL or SIGSYS signals, application crashes with seccomp violations in logs.

Mitigation: Use auditd to identify required system calls before deploying seccomp filters. Generate allowlists by running applications under strace to capture all system calls made during normal operation. Test filters thoroughly in staging before production deployment.

# Find system calls made by application
strace -f -e trace=read,write,open,close,stat -o app_trace.log ./your_application

# Review generated allowlist
grep -o '__NR_[a-z_]*' app_trace.log | sort -u

Scenario: Capability Misconfiguration Locks Out Administrator

Problem: Administrative script loses required capabilities and cannot perform necessary operations.

Symptoms: Scripts fail with “Operation not permitted” errors, cron jobs do not run properly, backup scripts fail.

Mitigation: Document required capabilities for each administrative function. Use capability-aware monitoring tools. When dropping capabilities, drop only the minimum set required. Always preserve CAP_SYS_ADMIN for recovery operations.

# Test capability requirements before deployment
sudo -u username capsh --caps="cap_net_bind_service,cap_dac_read_search+ep" -- -c "./your_script"

# Recover from lockout - reboot to single user mode
# Add capability requirements to script comments

Scenario: Container Escape via Syscall

Problem: Vulnerability in container allows malicious container contents to escape namespace isolation.

Symptoms: Unauthorized access to host system files, processes outside container visible, network traffic bypasses container network.

Mitigation: Keep kernel updated to patch container escape vulnerabilities. Use seccomp profiles that block syscalls known to be exploited for container escapes (like unshare with certain flags). Follow container security best practices like not running containers as root.

# Use Docker's default seccomp profile (blocks ~44 syscalls)
docker run --security-opt seccomp=default.json ...

# Block dangerous syscalls
docker run --security-opt seccomp=<(
  echo '{"defaultAction":"SCMP_ACT_ALLOW",
        "syscalls":[{"name":"unshare","action":"SCMP_ACT_ERRNO"}]}'
) ...

Trade-off Table

Mechanism	Granularity	Performance Impact	Complexity
Seccomp BPF	System call number and arguments	Minimal (< 1% overhead)	High (requires BPF knowledge)
Capabilities	Individual privileges	None	Medium (must understand caps)
SELinux/AppArmor	File and network access rules	Minimal	Very High (policy language)
Namespaces	Resource isolation	Minimal	Medium

Capability Type	Purpose	Risk if Misused
CAP_SYS_ADMIN	System administration	Full system compromise
CAP_NET_ADMIN	Network configuration	Network sniffing, routing changes
CAP_NET_RAW	Raw packet creation	Spoofing, reconnaissance
CAP_SYS_CHROOT	Change root directory	Directory traversal
CAP_DAC_OVERRIDE	Bypass permission checks	Read/write anything

Seccomp Verdict	Behavior	Use Case
SECCOMP_RET_ALLOW	Permit system call	Whitelist allowed calls
SECCOMP_RET_KILL	Kill process immediately	Default deny
SECCOMP_RET_ERRNO	Return error to process	Silent blocking
SECCOMP_RET_TRAP	Deliver signal	Notification without kill
SECCOMP_RET_USER_NOTIF	Forward to userspace	Supervisor-based decisions

Implementation Snippets

Basic Seccomp with libseccomp

#define _GNU_SOURCE
#include <seccomp.h>
#include <scmp/seccomp.h>
#include <stdio.h>

int main() {
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
    if (!ctx) return 1;

    // Allow read, write, exit
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);

    // Allow open (but not openat)
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(open), 0);

    // Load filter
    if (seccomp_load(ctx) < 0) {
        perror("seccomp_load");
        seccomp_release(ctx);
        return 1;
    }

    seccomp_release(ctx);

    // Now run restricted code
    printf("Running with seccomp filter active\n");
    return 0;
}

Checking and Modifying Capabilities

Before locking down a process with capabilities, figure out what it actually needs. Run getpcaps against the PID to see what the process currently holds. CAP_NET_BIND_SERVICE lets a process bind to ports below 1024. CAP_SYS_ADMIN is wide open — namespace creation, mount operations, credential changes, it covers a lot of ground. CAP_NET_RAW opens up raw sockets for packet crafting. CAP_DAC_OVERRIDE lets a process bypass standard Unix permission checks entirely.

When dropping capabilities, you are working with four distinct sets. The permitted set is what the process is allowed to acquire — it cannot gain capabilities outside this set. The effective set is what is currently active; the kernel checks only these during privileged operations. The inheritable set survives across execve calls, so a child process can inherit a capability from its parent if the inheritable set includes it. The bounding set is a kernel-enforced ceiling: no process can ever acquire a capability outside its bounding set, even if the binary on disk has that capability set. Once you drop something from the bounding set, it is gone for the life of that process tree.

capsh is the main tool for working with all of this. It prints current capability state, drops specific capabilities, sets the user and group the executable runs as, and runs test commands under a simulated capability environment. The +ep notation means “effective and permitted” — the capability is available and currently active. The +i notation adds to the inheritable set. When writing startup scripts for capability-restricted services, test with capsh --dry-run before going live, and log the final configuration so operators can audit it later.

#!/bin/bash
# Drop capabilities not needed by application

# Start with all capabilities
exec capsh --caps="cap_net_bind_service+ep cap_setuid+ep cap_setgid+ep" \
    --keep=1 --user=appuser \
    -- -c "./my_application"

Using strace with Seccomp

Tracing syscalls in a seccomp-filtered process was painful before kernel 5.8. When a filter kills the process on blocked calls, strace never even sees them — the process just disappears. The fix is SECCOMP_RET_USER_NOTIF, which passes syscall events to a user space supervisor instead of terminating. Strace implements this supervisor, so it traces filtered processes without needing ptrace at all.

On older kernels, temporarily disable seccomp with --seccomp=lower:0 while debugging. If you cannot touch the profile (production system, no restart), attach strace externally with strace -p PID. The 5.8+ user notification path carries the events to strace automatically.

The real value is building an allowlist. Run strace -f -c -o trace.log ./your_application to get a table of every syscall, how often it fires, and whether it errors. Then map names to numbers using grep __NR_ /usr/include/asm/unistd_64.h. The syscall counts alone tell you which calls the application actually depends on, and the error rates surface which argument combinations are legitimate versus anomalous.

# Trace system calls in a seccomp-filtered process
# Requires kernel 5.8+ or use --seccomp=lower:0 to disable

strace -k -f ./seccomp_restricted_program 2>&1 | less

# -k shows stack traces for each syscall
# -f follows forked processes

# Build a syscall allowlist from a production trace
strace -f -c -o app_trace.log ./your_application
# Then review: syscall counts, errors, and frequency

Observability Checklist

System Call Monitoring:

Enable auditd rules to log specific system calls
Monitor for unexpected system call patterns
Track seccomp filter violations

Capability Monitoring:

Regularly audit capability assignments with getpcaps
Alert on unexpected capability changes
Document capability requirements for each service

Kernel Parameters:

kernel.yama.ptrace_scope - Control ptrace visibility
kernel.dmesg_restrict - Restrict kernel message access
fs.suid_dumpable - Control core dump behavior for setuid programs

Configuration Validation:

Test seccomp filters in staging before production
Verify capability drops after privilege operations
Validate namespace configurations for containers

Security/Compliance Notes

Defense in Depth: Never rely on a single security mechanism. Layer seccomp filtering, capabilities, and mandatory access controls for comprehensive protection.

Minimal Privilege Principle: Grant only the capabilities absolutely required for a service to function. Additional privileges increase attack surface and potential damage from compromise.

Audit Trail Requirements: Many compliance frameworks require logging of privileged operations. Configure auditd to log security-relevant system calls and capability usage.

# Audit rule to log capability changes
echo '-w /proc/self/status -p wa -k cap_change' >> /etc/audit/rules.d/capability.rules

# Audit rule to log seccomp violations
echo '-a always,exit -F arch=b64 -S seccomp -k seccomp' >> /etc/audit/rules.d/seccomp.rules

Container Security Context: When running containers in production, use security contexts to restrict capabilities and system calls.

apiVersion: v1
kind: Pod
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      securityContext:
        capabilities:
          drop:
            - ALL
          add:
            - NET_BIND_SERVICE

Common Pitfalls / Anti-patterns

Allowing execve Under Seccomp: BPF filters cannot inspect the arguments of execve after the call completes (the new program has already started). If a filter allows execve, a compromised process can execute any program. Block execve or carefully audit the new program’s requirements.

Capability Leaks After Fork: When a privileged process forks, child processes inherit capabilities. If the child drops privileges incorrectly, it may retain capabilities it should not have. Always re-evaluate capabilities after fork() and execve().

Ignoring Ambient Capabilities: Some capabilities persist across execve of programs that are not file-capability aware. The securebits flags control whether capabilities survive execve. Misunderstanding these flags leads to security bypasses.

Seccomp Filter Performance: While BPF filtering has minimal overhead, overly complex filters with many rules can add latency. Optimize filters by grouping similar rules and using jumps efficiently.

TOCTOU in Capability Checks: Time-of-check-time-of-use vulnerabilities can occur when capabilities are checked before an operation but the operation happens later with different privileges. Structure code to perform operations immediately after privilege acquisition.

Quick Recap Checklist

System calls are the boundary between user space and kernel space
Seccomp BPF filtering allows fine-grained control over allowed system calls
Capabilities split root privileges into discrete units
Least privilege principle means granting minimum necessary access
Container runtimes use these mechanisms for isolation
Production failures often involve blocked required system calls
Comprehensive testing in staging prevents deployment issues
Multiple security mechanisms should be layered together
Audit logging supports compliance and incident investigation
Understanding these mechanisms is essential for container security

Interview Questions

1. What is the difference between seccomp mode 1 and seccomp mode 2?

Seccomp mode 1 (SECCOMP_MODE_STRICT) allows only read, write, exit, and sigreturn system calls. Any other system call terminates the process immediately. Seccomp mode 2 (SECCOMP_MODE_FILTER) uses BPF programs to filter system calls based on the call number and arguments, allowing much finer-grained control over which system calls are permitted.

2. Explain the principle of least privilege as it applies to system call security.

Least privilege means granting a process only the minimum privileges required to perform its function. Rather than running as root with access to everything, a process should run with a specific limited set of capabilities and seccomp filters that block all unnecessary system calls. If a vulnerability is discovered in a least-privileged process, the attacker cannot escalate privileges or access resources beyond what the legitimate process needed.

3. What is the purpose of capabilities in Linux?

Capabilities split the traditional root privilege (UID 0) into discrete units that can be granted independently. Instead of having all privileges or none, a process can have specific capabilities like CAP_NET_BIND_SERVICE to bind to ports below 1024, or CAP_SYS_ADMIN for system administration tasks. This fine-grained model enables precise privilege assignment and reduces the risk from privilege escalation vulnerabilities.

4. How would you debug a seccomp filter that is blocking a legitimate system call?

First, use strace or auditd to identify which system call is being blocked and why. The kernel logs seccomp violations that can be viewed with ausearch or by checking /proc/PID/status for Seccomp mode. Once identified, update the BPF filter to allow the necessary call, potentially with argument filtering if the call itself is legitimate but some uses are not.

5. What are the security implications of running a container with --privileged?

A privileged container effectively runs with all capabilities and bypasses most security restrictions that normally protect the host. It can access all devices, modify network configuration, load kernel modules, and potentially escape container isolation. Privileged containers should be avoided in production. Instead, use specific capability grants and seccomp profiles that allow only necessary privileges.

6. How does the BPF verifier work in the context of seccomp filtering?

The BPF verifier validates seccomp BPF programs before they are loaded into the kernel, ensuring they cannot crash the kernel or run indefinitely. It performs static analysis: tracking the minimum and maximum possible values of each register at every instruction, ensuring no memory accesses occur out of bounds, and rejecting programs that could loop infinitely or access invalid memory.

For seccomp specifically, the verifier ensures the BPF program terminates and only accesses fields in the seccomp_data structure that are within bounds. A seccomp filter cannot read arbitrary kernel memory—it can only read the system call number and arguments provided in the seccomp_data struct. The verifier also enforces that the program returns only valid verdict codes (ALLOW, KILL, ERRNO, TRAP, USER_NOTIF). This is what makes seccomp BPF safe despite running in kernel context.

7. What is the kernel-user boundary and why does crossing it matter for security?

The kernel-user boundary is the protection domain boundary between user space (Ring 3, restricted instruction set, no direct hardware access) and kernel space (Ring 0, full hardware access). Crossing this boundary requires a context switch: from user registers and stack to kernel registers and stack, plus permission checks. System calls are the controlled crossing points—each syscall is a deliberate transition where the kernel validates the request before performing privileged operations.

Security implications: user space cannot directly manipulate hardware, access arbitrary memory, or execute privileged instructions. Any operation requiring these must go through the kernel via a syscall, giving the kernel a chokepoint to enforce security policy. Attacks that exploit the boundary include syscall smuggling (passing malformed arguments that the kernel misinterprets), return-oriented programming that chains syscall gadgets, and kernel exploit techniques that manipulate data visible across the boundary.

8. What is the difference between discretionary access control (DAC) and mandatory access control (MAC)?

Discretionary Access Control (DAC), the traditional Unix permission model, allows owners of objects (files, processes) to decide who can access them. The owner can modify permissions at their discretion (hence the name). Root can bypass DAC checks entirely. Examples: standard Unix rwx permissions, ACLs, and POSIX permissions.

Mandatory Access Control (MAC) enforces system-wide security policies that users cannot override, even with root privileges. The kernel mediates every access based on a security policy database, regardless of the owner's wishes. Examples: SELinux, AppArmor, and seccomp. A MAC policy might label web servers as httpd_t and deny them access to user home directories, even if root runs the process and the files are world-readable under DAC.

9. How do you construct a minimal seccomp filter for a specific application?

A practical approach: run the application under strace during representative operation to capture all syscalls made, extract the unique syscall numbers and arguments, then construct a BPF program that allows only those calls. Example using libseccomp to whitelist read, write, exit, and a specific openat call:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(openat), 0);
seccomp_load(ctx);

Test extensively in staging before production. When using seccomp_user_notif, you can log unexpected syscalls rather than killing, then refine the policy based on real usage patterns. Tools like SECCOMP Audit mode in auditd help identify which syscalls are actually needed in production.

10. What are the security implications of running a container without seccomp and with --privileged?

Running a container without seccomp means the container can make any syscall the kernel supports. Without seccomp, an exploit that achieves code execution inside the container has access to the full syscall surface, including dangerous calls like ptrace, mount, unshare, and syslog that are rarely needed by applications but commonly used in container escapes.

Running with --privileged grants all capabilities and essentially disables all Linux security mechanisms inside the container: it can access all devices (including /dev/mem, /dev/cpu), write to arbitrary filesystem locations, modify network configuration, load kernel modules via /proc/sysrq-trigger, and bypass all cgroup restrictions. A privileged container running as UID 0 can often escape to host root. These flags should never be used in production; use specific capability grants and seccomp profiles instead.

11. How does strace interact with seccomp and what can it tell you about syscall usage?

strace uses the ptrace syscall to attach to a target process and intercept syscalls by stopping the process at each entry/exit and reading the syscall number and arguments. However, seccomp mode 2 with SECCOMP_RET_USER_NOTIF (kernel 5.8+) allows a seccomp filter to forward syscall information to a user space supervisor via a file descriptor, enabling strace-like functionality without ptrace.

You can capture all syscalls an application makes by running strace -f -c -o trace.log ./program, which summarizes syscall counts and times. To build a seccomp allowlist, extract syscall names from the trace and convert them to syscall numbers using perf stat -e 'syscalls:sys_enter_*' or parsing /proc/syscall. Note that some syscalls (like nanosleep or gettimeofday) may appear disproportionately due to timing calls.

12. What is the relationship between namespaces and seccomp in container security?

Namespaces and seccomp address different attack surfaces. Namespaces isolate resource views (PID trees, network interfaces, mount points, UTS hostname, IPC mechanisms, user UIDs) so that a container cannot see or interfere with resources belonging to other containers or the host. Seccomp restricts the syscalls a process can make, limiting the kernel operations available even if the process escapes namespace isolation.

Together they form defense in depth: namespace isolation prevents the container from seeing sensitive resources, and seccomp prevents exploitation of kernel syscalls that could break namespace boundaries. A container with network namespace isolation cannot access host network devices, but without seccomp it could still call unshare(CLONE_NEWNET) to create a new network namespace. A seccomp filter blocking unshare with CLONE_NEWNET closes this gap. Docker's default seccomp profile blocks ~44 syscalls known to be problematic for container isolation.

13. What is user namespace security and how does it relate to container root privileges?

User namespaces map UIDs inside the container to different UIDs on the host. Inside a user namespace, UID 0 (root) can map to UID 100000 on the host. This means container root is not actually root on the host—it cannot modify host files, cannot bind to privileged ports without mapping that capability, and cannot see host processes.

Rootless containers (Docker rootless mode, Podman) use user namespaces to run without any root privileges on the host. Even if an attacker escapes the container and gains container-root, they only have the mapped UID range, not actual root. User namespaces also enable safer nested containers. However, user namespaces have had vulnerabilities (CVE-2022-0492, CVE-2022-0812) that allowed privilege escalation, and combining user namespaces with seccomp and capabilities provides defense in depth.

14. How do seccomp user notifications enable sophisticated security policies?

Seccomp user notifications (kernel 5.0+) allow a user space supervisor to receive and respond to syscalls that the seccomp filter defers. The filter returns SECCOMP_RET_USER_NOTIF, which creates a notification file descriptor. The supervisor (running in user space) reads the syscall information, decides whether to allow or deny it, and responds using a SECCOMP_IOCTL_NOTIF_SEND ioctl.

This enables patterns like: a service that normally runs without a capability but temporarily needs it for a specific operation (like binding to a privileged port), where a supervisor intercepts the bind() call, performs the operation on behalf of the process, and returns a mocked success. It also enables strace functionality without ptrace, audit daemons that log all syscalls, and custom policy enforcement that goes beyond simple allow/deny.

15. What are the security differences between AppArmor and SELinux for system call filtering?

AppArmor (AppArmor) operates on file paths and capabilities, using pathname-based mandatory access control. A process is labeled with a profile that specifies what files it can access (by path, not by inode), what capabilities it can use, and what network operations are permitted. AppArmor is considered easier to debug because file paths are more intuitive than security contexts.

SELinux (Security-Enhanced Linux) operates on security contexts (labels) assigned to all system subjects (processes) and objects (files, sockets, ports). The policy engine evaluates whether a subject with a given security context can perform an operation on an object with another security context. SELinux is more powerful and flexible but significantly more complex to configure—policies are written in a custom language and require a full policy rebuild for changes.

16. What is the difference between CAP_SYS_ADMIN and CAP_SYS_MODULE?

CAP_SYS_ADMIN is the most powerful capability, encompassing a wide range of system administration operations: mounting filesystems, creating namespaces, modifying process credentials, configuring swap, setting hostname and domainname, and more. Its breadth makes it nearly equivalent to root—having CAP_SYS_ADMIN allows bypassing many permission checks. This is why CAP_SYS_ADMIN is the primary target for restriction in container environments.

CAP_SYS_MODULE allows loading and unloading kernel modules via init_module and delete_module syscalls. Loading a malicious kernel module is one of the most powerful attack vectors because the module runs in kernel context with full hardware access. Restricting CAP_SYS_MODULE prevents containers from loading kernel modules, which closes an escape vector where a container could load a module that manipulates the host kernel.

17. What is the seccomp filter that Docker's default profile installs?

Docker's default seccomp profile (located in /etc/docker/seccomp.json and applied when using --security-opt seccomp=default) is an allowlist that blocks ~44 syscalls known to be dangerous for container workloads. It blocks: syslog (reading kernel logs), mount and umount (filesystem manipulation), unshare (namespace creation without restrictions), mount variants, name_to_handle_at (opening doors to filesystem access), and perf_event_open (profiling with hardware counters from containers).

It explicitly allows ~300 syscalls needed by typical applications. The profile uses add_syscall_rule calls to add exceptions for specific arguments (like allowing personality but only with ADDR_NO_RANDOMIZE). The default profile is a practical starting point but should be customized for specific workloads—running a web server has different syscall requirements than running a database.

18. What is kernel hardening and how do these syscall security mechanisms fit into a broader hardening strategy?

Kernel hardening encompasses compile-time and runtime protections that reduce kernel attack surface. Compiler flags like -Wformat-security and -fstack-protector-strong make exploits harder. Runtime hardening includes kernel pointer hiding (kernel.kptr_restrict, kernel.dmesg_restrict), disabling unused filesystems (/proc/sys/fs/enable), and restricting /dev/kmem and /dev/mem access.

Seccomp, capabilities, and namespaces are user-space-facing hardening mechanisms that restrict what a compromised or misbehaving process can do. Together with kernel hardening, they create defense in depth: even if an attacker finds a kernel vulnerability, the exploit's capabilities are limited by seccomp filters, reduced capability sets, and namespace isolation. A comprehensive hardening strategy layers all of these: minimal base image, dropped capabilities, seccomp filters, read-only rootfs, no-new-privileges flag, and SELinux/AppArmor policies.

19. How does the no_new_privs flag work and why should containers set it?

The no_new_privs flag (prctl(PR_SET_NO_NEW_PRIVS, 1)) prevents a process and its children from gaining new privileges through setuid binaries or file capabilities. Any execve of a setuid binary or file with capabilities will run with the same privileges as the parent—the privilege escalation will not happen. Existing privileges (capabilities, UID mappings) are preserved, but no new ones can be acquired.

Containers should set no_new_privs because it ensures that privilege escalation exploits cannot work. A compromised container process that tries to exec a setuid binary to gain root will not succeed. This pairs well with dropping CAP_SETUID and other privilege-granting capabilities. In Kubernetes, securityContext.noNewPrivs: true sets this flag for all containers in a pod.

20. What is the attack surface of the execve syscall under seccomp?

The execve syscall presents a unique challenge for seccomp: once a new program starts, its memory is completely replaced by the new binary's code and data. If the seccomp filter allows execve, a compromised process can load any program from the filesystem. A malicious execve from within a seccomp-filtered process defeats the filter because the new program runs with its own seccomp profile (or none, if not inherited).

Mitigation: either block execve entirely in the seccomp filter, or use PR_SET_NO_NEW_PRIVS to prevent the new program from gaining privileges. Additionally, combine with a read-only root filesystem so the attacker cannot introduce their own binary. Some seccomp profiles allow execve only to specific pre-approved binaries by checking the pathname argument. In containers, Docker's default seccomp profile does not block execve, making it important to pair with no_new_privs.

Conclusion

System call security sits at the boundary between user space and kernel space. Seccomp BPF, capabilities, and namespace isolation together form the foundation of container and kernel security. The principle of least privilege should guide every decision you make here: grant only the capabilities actually needed, filter only the syscalls actually used. These mechanisms only work as a layered defense because each has blind spots that others cover.

If you want to dig deeper, the BPF verifier source is surprisingly readable, seccomp user notifications enable some clever supervisor patterns, and studying how containerd implements syscall filtering will show you how the theory translates to production code.

System Call Security

Introduction

When to Use / When Not to Use

Architecture or Flow Diagram

Core Concepts

Seccomp Modes

BPF for Seccomp Filtering

Linux Capabilities

Production Failure Scenarios + Mitigations

Scenario: Seccomp Blocks Required System Call

Scenario: Capability Misconfiguration Locks Out Administrator

Scenario: Container Escape via Syscall

Trade-off Table

Implementation Snippets

Basic Seccomp with libseccomp

Checking and Modifying Capabilities

Using strace with Seccomp

Observability Checklist

Security/Compliance Notes

Common Pitfalls / Anti-patterns

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

ASLR & Stack Protection

Assembly Language Basics: Writing Code the CPU Understands

Boolean Logic & Gates