Performance Profiling
Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.
Introduction
Performance profiling is the systematic analysis of where time is spent during program execution. In operating systems, profiling serves two distinct but related purposes: understanding how applications use system resources and diagnosing where the kernel itself spends its time. Without profiling data, optimization is guesswork—developers assume hot paths rather than measure them.
Linux provides a rich ecosystem of profiling tools, each suited to different problem domains. perf excels at CPU-bound workloads and hardware event counting. ftrace provides function-level tracing with minimal overhead. The BPF Compiler Collection (BCC) enables custom analysis scripts without kernel patches. Flame graphs convert complex profiling data into visual call stacks that immediately reveal bottlenecks.
When to Use / When Not to Use
Performance profiling is appropriate when:
- Application runs slower than expected — Quantify where time is actually spent
- CPU utilization is high but throughput is low — Identify lock contention or memory pressure
- Investigating latency spikes — Trace request paths to find blocking operations
- Kernel performance tuning — Understand scheduler behavior, I/O patterns, or interrupt handling
Profiling is NOT appropriate when:
- The problem is obvious from logs — Don’t profile what you already understand
- System is in crisis — Fix known issues first; profiling takes time to set up correctly
- Working on a feature that doesn’t exist yet — Profile production or realistic benchmarks, not speculation
Architecture or Flow Diagram
flowchart TB
subgraph "Data Sources"
PMC[Hardware PMCs<br/>Performance Monitoring Counters]
SWC[Software Counters<br/>tracepoints, kprobes, uprobes]
TIM[Timers<br/>interval sampling]
end
subgraph "Collection Layer"
PERF[perf stat / record]
FTRACE[ftrace<br/>function / wakeup / irq]
BCC[BCC Tools<br/>Python / Lua frontends]
end
subgraph "Analysis Layer"
FG[FlameGraph<br/>stack collapse + svg]
RAPP[rapp<br/>rapport text reports]
STD[perf report<br/>text browser]
end
PMC --> PERF
SWC --> FTRACE
SWC --> BCC
PERF --> STD
PERF --> FG
FTRACE --> FG
BCC --> FG
style FG stroke:#ff6b6b,stroke-width:3px
Core Concepts
perf: The Swiss Army Knife
perf is the primary userspace interface to Linux performance monitoring:
# Basic CPU profiling - record stack traces for 30 seconds
perf record -F 99 -g -a -- sleep 30
# Analyze specific process
perf record -F 100 -g -p $PID -- sleep 10
# Hardware event counting
perf stat -e cycles,instructions,cache-misses -a -- sleep 5
# Profile kernel functions
perf record -g -a --call-graph dwarf sleep 10
# View results
perf report --stdio --symbol-filter='*my_function*'
ftrace: Function-Level Tracing
ftrace provides kernel-internal tracing with minimal overhead:
# Enable function tracer (requires debugfs)
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo function > current_tracer
echo my_function > set_ftrace_filter
echo 1 > tracing_on
# Measure function latency with function_graph
echo function_graph > current_tracer
echo 1 > max_graph_depth
# Trace specific events - scheduler latency
echo 1 > events/sched/sched_switch/enable
echo 1 > events/sched/sched_wakeup/enable
cat trace | head -100
BCC Tools for Custom Analysis
BCC provides Python/Lua frontends for eBPF programs:
#!/usr/bin/env python3
"""Example BCC tool: trace file operations by process name"""
from bcc import BPF
from bcc.utils import printb
program = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>
struct data_t {
u32 pid;
u32 uid;
char comm[TASK_COMM_LEN];
char filename[256];
};
BPF_PERF_OUTPUT(events);
TRACEPOINT_PROBE(syscalls, sys_enter_openat) {
struct data_t data = {};
data.pid = bpf_get_current_pid_tgid() >> 32;
data.uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
bpf_get_current_comm(&data.comm, sizeof(data.comm));
// Strlen would require userspace helpers; capture raw path
return 0;
}
"""
b = BPF(text=program)
# Print output
def print_event(cpu, data, size):
event = b["events"].event(data)
print(f"{event.pid:6d} {event.comm.decode():16s} {event.filename}")
b["events"].open_perf_buffer(print_event)
while True:
b.perf_buffer_poll()
Generating Flame Graphs
#!/bin/bash
# Generate flame graph from perf data
PERF_DATA=perf.data
FLAMEGRAPH_DIR=/opt/FlameGraph
# Record profile
perf record -F 99 -a -g -- sleep 30
# Convert to flame graph
perf script | $FLAMEGRAPH_DIR/stackcollapse-perf.pl | \
$FLAMEGRAPH_DIR/flamegraph.pl > profile.svg
# Filter to specific process
perf script | grep myservice | \
$FLAMEGRAPH_DIR/stackcollapse-perf.pl | \
$FLAMEGRAPH_DIR/flamegraph.pl > myservice.svg
Production Failure Scenarios
Scenario 1: Profiling Overhead Swamps Actual Performance
Problem: Setting sampling frequency too high (>10kHz) or enabling too many tracepoints causes the system to spend more time recording events than doing useful work.
Mitigation:
- Start with low-overhead configurations:
perf statfor counters,-F 99for sampling - Use
perf benchto measure baseline before profiling - Enable events selectively; disable unused tracepoint groups
- For production, use BPF programs that aggregate in-kernel
Scenario 2: Stack Trace Corruption
Problem: perf report shows [unknown] frames or corrupted call graphs.
Mitigation:
- Ensure debug symbols are installed (
linux-tools-common,linux-headers-$(uname -r)) - Use
--call-graph dwarfinstead of frame pointer for accurate stacks - On older kernels, check if
no_frame_pointersboot parameter is set - Verify the binary isn’t stripped before shipping debug packages
Scenario 3: perf Fails with “Permission Denied”
Problem: /proc/sys/kernel/perf_event_paranoid is set to restrictive values.
Mitigation:
# Check current setting
cat /proc/sys/kernel/perf_event_paranoid
# Temporarily change (requires root)
echo 1 > /proc/sys/kernel/perf_event_paranoid
# Permanent change in /etc/sysctl.conf
echo "kernel.perf_event_paranoid = 1" >> /etc/sysctl.conf
Trade-off Table
| Tool | Overhead | Granularity | Customization | Best For |
|---|---|---|---|---|
| perf stat | Very low | Counter totals | Low—preset events | Baseline measurements |
| perf record | Low-Medium | Call stacks | Medium—custom events | CPU hotspot analysis |
| ftrace | Low | Function-level | High—kernel patching | Kernel internals |
| BCC/eBPF | Very low | Custom | Very high | Production tracing |
| FlameGraph | N/A (post-process) | Visual | N/A | Communication |
Implementation Snippet: Custom perf Event
For specialized measurements, create custom tracepoints:
/* In kernel module or tracepoint-enabled code */
#include <linux/tracepoint.h>
/* Define tracepoint in kernel code */
#include <trace/events/sched.h>
void my_sched_function(void)
{
trace_sched_wakeup(p, success);
}
/* Or define custom tracepoint (requires kernel modification) */
/* In include/trace/events/mydevice.h: */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM mydevice
TRACE_EVENT(mydevice_operation,
TP_PROTO(int device_id, size_t bytes, int error),
TP_ARGS(device_id, bytes, error),
TP_STRUCT__entry(
__field(int, device_id)
__field(size_t, bytes)
__field(int, error)
),
TP_fast_assign(
__entry->device_id = device_id;
__entry->bytes = bytes;
__entry->error = error;
),
TP_printk("device=%d bytes=%zu error=%d",
__entry->device_id, __entry->bytes, __entry->error)
);
/* User space then can listen: */
perf stat -e mydevice:mydevice_operation -a -- sleep 5
Observability Checklist
When profiling a production system, collect these metrics in parallel:
- CPU utilization —
mpstat 1for per-core usage - Context switch rate —
vmstat 1, look atcscolumn - Memory pressure —
free -m,cat /proc/meminfo - I/O throughput —
iostat -xz 1 - Network —
netstat -sorss -s - Scheduler latency —
perf sched latency - interrupt distribution —
cat /proc/interrupts
Common Pitfalls / Anti-Patterns
- perf_event_paranoid — Restricts who can use perf; sensitive containers may be blocked
- Kernel address exposure —
perfcan reveal kernel addresses; treat dumps as sensitive - eBPF restrictions — Modern kernels restrict eBPF program capabilities for unprivileged users
- DTrace vs perf — Linux perf is less mature than Solaris DTrace; consider SystemTap for complex probes in enterprise environments
Common Pitfalls / Anti-patterns
- Profiling wrong thing — Profiling build time when you should profile runtime, or vice versa
- Ignoring cache effects — First run always looks different; measure warmed cache performance
- Microbenchmarking — Measuring small functions in isolation ignores cache, branch prediction, and pipeline effects
- Attributing cost incorrectly — “system” time is not always kernel time; check user/kernel ratio carefully
- Over-interpreting samples — 99 samples of 10,000 may not be statistically significant for your conclusion
Quick Recap Checklist
- Use
perf statfor quick counter summaries before detailed profiling - Use
perf record -gto capture call stacks for hotspot analysis - Flame graphs visualize call patterns and make bottlenecks obvious
- ftrace is best for kernel-internal function tracing with low overhead
- BCC/eBPF enables production-safe custom instrumentation without kernel changes
- Always measure baseline before and after changes to verify improvement
- Consider system-wide effects when profiling single applications
Real-World Case Study: Database Query Optimization
A production PostgreSQL instance showed high CPU but low throughput. Using perf:
perf stat -e cycles,instructions,cache-misses -a -- sleep 5showed CPI of 3.2 (poor)perf record -F 99 -a -g -- sleep 30captured call stacks- Flame graph revealed 40% of cycles in
memcpywithin index scan operations - The culprit: updating full rows when only specific columns changed (write amplification)
The fix involved batched column-specific updates, reducing CPU usage by 65%. This demonstrates why profiling before optimization is critical—the assumed bottleneck (lock contention) was actually a memory bandwidth issue.
Advanced Topic: Speculative Execution and Profiling Errors
Modern CPUs execute instructions speculatively before determining branch outcomes. When profiling with statistical sampling, samples taken during speculative execution can show functions that were never actually on the critical path—the branch was mispredicted and the pipeline was flushed. This leads to false hotspots in your profile.
Mitigations include:
- Use last branch record (LBR) to see actual taken branches, not speculative ones
- Profile with realistic data that exercises real branch patterns
- Use event-based sampling with branch filtering:
perf record -e branch-misses ... - Consider that functions appearing in fewer than 1% of samples may be noise from speculation
Interview Questions
Tracepoints are static, predefined hooks in kernel code that are always compiled in (but can be disabled). kprobes are dynamic instrumentation that attaches to any kernel function entry or any instruction. uprobes are the user-space equivalent, attaching to user-space binary functions. Tracepoints have lowest overhead and stable interfaces; kprobes/uprobes are more flexible but require knowing exact symbol addresses.
Modern CPUs mispredict branches and pipeline instructions speculatively. When a profiler takes a sample during speculative execution, you may see functions in stack traces that weren't actually on the critical path. Use perf record - branches to filter for only taken branches, or rely on last branch record (LBR) when available. Always profile with realistic data that exercises real branch behavior, not synthetic test cases.
perf stat show more cycles than time?Modern CPUs can execute multiple instructions per cycle via pipelining and superscalar execution. "Cycles" counts actual clock cycles elapsed, while "instructions" counts retired instructions. The ratio (cycles per instruction, CPI) reveals pipeline efficiency. If cycles exceed time significantly on a multi-core system, it means the workload is using multiple cores in parallel—each core contributing its own cycles.
eBPF (extended Berkeley Packet Filter) allows sandboxed programs to run in the kernel without modifying kernel source or loading modules. Programs are verified for safety before execution and can be attached to kprobes, tracepoints, or network interfaces. This enables custom metrics collection, networking, and security enforcement—all without kernel patches. BCC provides high-level languages (Python, Lua) that compile to eBPF bytecode.
Statistical sampling at 99Hz or 999Hz cannot reliably capture microsecond-scale functions—they may not appear in any sample. Instead, use: (1) perf stat -e with cache miss or branch misprediction events that correlate with the function; (2) instruction counting via perf stat -e instructions; (3) trace the specific function with ftrace and its graph mode to measure individual calls; (4) consider if the function needs algorithm optimization rather than just measurement.
Hardware PMCs (Performance Monitoring Counters) are built into the CPU and count events like cycles, instructions, cache misses, and branch mispredictions—they require minimal overhead to read. Software counters are implemented by the kernel via tracepoints and kprobes; they count events like context switches, page faults, and scheduler decisions. Hardware counters provide lower overhead and higher precision for CPU-bound analysis; software counters provide visibility into kernel activities that have no hardware equivalent.
When a PMC overflows (reaches its maximum count), the kernel receives an interrupt and records a sample with the current instruction pointer. For continuous monitoring with overflow handling, use perf record -f to use frequency-based sampling where the kernel reprograms the counter after each sample rather than waiting for overflow. The perf_event_attr::watermark setting controls when the overflow interrupt fires. In multi-threaded programs, the kernel uses inherit settings to decide whether children inherit the parent's events.
perf uses a per-CPU ring buffer (struct ring_buffer) to transfer samples from kernel to user space. When perf records, itmmaps the buffer and the kernel writes samples directly into it without syscalls per sample—only a single mmapped region update. When the buffer is full, perf loses samples (in overwrite mode) or blocks (in per-CPU mode). Using perf record -m 2 increases the buffer size to reduce sample loss in high-frequency profiling scenarios.
ftrace excels at function-level tracing with minimal overhead—its function tracer can be enabled with just echo function > current_tracer. perf is better for hardware event counting and call graph analysis. Choose ftrace when you need: (1) function-level visibility with low overhead, (2) kernel function latency via function_graph, (3) tracepoint filtering with event parameters, (4) dynamic instrumentation with kprobe events. Choose perf when you need hardware metrics, call stacks, or userspace tracing.
Flame graphs show relative time spent in each function stack—wider boxes mean more samples. The vertical axis shows ancestry (caller on top, callee below); horizontal position has no meaning. To interpret: (1) find the widest blocks—these are your hottest paths; (2) read from root to leaf to understand call sequences; (3) compare flame graphs before/after changes to verify improvement; (4) look for "plateaus" of similar frame widths—these often indicate a bottleneck where all callers converge. The colors are typically cosmetic unless you enable hue-by-attribute.
On-CPU profiling samples where the CPU is actively executing code—useful for CPU-bound workloads where the bottleneck is computation. Off-CPU profiling (also called wakeup profiling) samples when threads are NOT running—useful for I/O-bound workloads, lock contention, or scheduling issues where threads block. Use perf record -a -g -- sleep 10 for on-CPU; use perf record -a -g --call-graph=fp -e sched:sched_switch -c 1 for off-CPU analysis. Off-CPU flame graphs often reveal "stacks lying down"—blocked threads show their wait channels (like futex, read, poll) as the blocking call, revealing I/O or lock bottlenecks.
LBR is a hardware feature that stores the last N (typically 16-32) branch records including source and destination addresses. Unlike standard sampling that interrupts periodically and captures only the current instruction pointer, LBR captures actual taken branches with their destinations. This is crucial because: (1) it eliminates speculation noise—branches that were mispredicted and flushed from pipeline don't appear as real work; (2) it provides precise call chains without frame pointers; (3) it enables perf record -j any,u -- branches to capture all branch events efficiently. Enable with perf record -j any,u -a and view with perf report --branch-history.
Cache misses don't directly measure memory latency—they measure miss events. When a cache miss occurs, the CPU waits (stalls) for data from memory, but a single cache-misses counter event doesn't tell you how long each stall lasted. To correlate misses with latency: (1) use perf stat -e cycles,cache-misses,cache-references and calculate miss rate; (2) combine with stalled-cycles-frontend and stalled-cycles-backend to see how much of the pipeline is waiting; (3) use perf record -e cache-misses -W with weight to capture miss latency; (4) consider using memory-latency events on Intel (cbo_X events) for actual memory access latency histograms.
CPI (Cycles Per Instruction) = cycles / instructions. A CPI of 1.0 means the CPU ideally completed one instruction per cycle (superscalar execution can exceed 1.0). CPI > 1 indicates stalls—typically: (1) CPI 1.0-2.0: normal for memory-bound workloads with some cache misses; (2) CPI 2.0-4.0: significant cache miss penalties or memory latency; (3) CPI > 4.0: severe memory bottleneck, possible cache thrashing or non-optimized memory access patterns. For multi-core systems, also check instructions per cycle (IPC)—if cores are parallelized, total cycles increase but each core may show different CPI. A high system-wide CPI with low CPU utilization indicates I/O waiting rather than computation stalls.
Dynamic module loading creates profiling challenges because symbols may not be available when perf isn't running. Solutions: (1) Build modules with debug symbols and install to /lib/modules/$(uname -r)/build—perf will find them automatically; (2) Use module.symbols file to map dynamic addresses to names after the fact; (3) Persistently enable tracing with ftrace before running workload—ftrace can capture module function entries even without symbol resolution; (4) Use BPF with kprobe on module functions: BPF programs attached to module symbols will auto-load when the module loads; (5) For post-mortem analysis, use perf record -k mono -g with mono (monotonic clock) for ordering.
AMD's IBS (Instruction-Based Sampling) provides more detailed profiling than standard PMCs. IBS periodically triggers based on: (strong>Fetches—instruction fetch cycles andop information; Ops—micro-ops retired and their characteristics. Unlike standard PMCs which count events (cache misses, branches), IBS provides exact instruction addresses that caused events. Benefits: (1) more precise hotspot identification without frame pointer dependency; (2) micro-op level visibility for AMD Zen architectures; (3) better handling of instruction cache misses. Access via perf record -e ibs_fetch/.../ -e ibs_op/.../. Intel equivalents are PEBS (Precise Event-Based Sampling) and LBR (Last Branch Record).
False sharing occurs when threads on different cores modify variables that share the same cache line, causing unnecessary cache invalidations. Profiling indicators: (1) High cache-misses with low memory traffic—many invalidations but not much data transferred; (2) perf stat showing high cache_misses and store_blocks ld_flood events; (3) High context switch rates between threads that access related data structures. Confirmation: use perf c2c record (cache-to-cache analysis) which shows cache line contention events including ld_l1hit, ld_l2hit, stores with RFO (Read-For-Ownership) that indicate cache line bouncing. Fix: pad data structures to ensure each thread's data is cache-line-aligned.
--call-graph dwarf (default on modern systems) uses DWARF debug information to unwind the stack—it reads saved registers and stack frame pointers from memory to reconstruct call chains. More accurate but requires debug symbols (-g at compile time) and has higher overhead. --call-graph fp uses frame pointer chaining—each stack frame contains a pointer to the previous frame. Faster but only works if code was compiled with -fno-omit-frame-pointer. Modern GCC defaults to -fomit-frame-pointer for optimized builds, making fp mode unreliable. For containerized environments where debug symbols may not be available locally, use --call-graph dwarf with separate debuginfo packages.
The perf_event_paranoid sysctl controls access to performance events for unprivileged users:
- -1: No restrictions (cap_sys_admin bypasses all checks)
- 0: Allow all user access to performance events
- 1: Disallow raw tracepoint access (still allows hardware PMCs)
- 2+: Disallow all performance events except cycle counts
In containers, this is often set to restrictive values. If perf_event_paranoid >= 2, most hardware counters won't work. Solutions: (1) run container with --cap=SYS_ADMIN (not recommended for security); (2) use perf record -P (privilege-level filtering) or profile in a VM; (3) use BPF with CAP_PERFMON (kernel 5.8+) which has finer-grained control.
BPF maps can exhibit staleness when the kernel is in the middle of updating them while userspace is reading. Key issues: (1) Lookup vs Update race: bpf_map_lookup_elem() may return a value that's being concurrently modified; (2) Incremental updates: when aggregating in-kernel, userspace might see partial results. Mitigations: (1) use BPF_F_NO_PREALLOC for atomic map allocation; (2) use bpf_percpu_hash for per-CPU aggregation which avoids cross-CPU synchronization; (3) implement double buffering in userspace—read from one map while the kernel writes to another, then swap; (4) use __sync_fetch_and_add for atomic increments. For production tracing, consider using ring_buffer (available in Linux 5.8+) which supports efficient, lock-free data transfer.
Further Reading
- Brendan Gregg’s perf Tools - Comprehensive perf examples and flame graph creation
- BPF Performance Tools - eBPF tools for Linux observability
- Linux perf Examples - One-minute tutorials with real examples
- LWN.net: The perf tools - Deep dive into perf internals
- Kernel Profiling with perf - Official kernel profiling guide
Conclusion
Performance profiling transforms guesswork into data-driven optimization by systematically measuring where time is spent during execution. Linux provides a layered ecosystem: perf for hardware event counting and call stack sampling, ftrace for function-level kernel tracing with minimal overhead, and BCC/eBPF for custom production-safe instrumentation without kernel changes.
Always start with baseline measurements using perf stat before detailed profiling. Use flame graphs to visualize call patterns and make bottlenecks immediately obvious. For production systems, aggregate data in-kernel with BPF to minimize overhead. Remember that profiling is not optimization—it tells you where problems exist, not how to fix them. Combine profiling data with algorithm understanding to make meaningful improvements.
For continued learning, explore kernel profiling with perf inside containers, eBPF programs for custom latency histograms, and hardware performance monitoring with last branch records (LBR) for speculative execution analysis.
Category
Related Posts
CPU Affinity & Real-Time Operating Systems
CPU affinity binds processes to specific cores for cache warmth and latency control. RTOS adds deterministic scheduling with bounded latency for industrial, medical, and automotive systems.
Fork & Exec System Calls
fork() duplicates a running process, then exec() replaces it with a new program. Together they power every shell, web server, and daemon on Unix-like systems.
System Calls Interface
System calls are the boundary between user programs and the kernel. They are the mechanism by which user-space applications request services from the operating system — opening files, creating processes, allocating memory, and more. Understanding syscalls reveals how the OS enforces isolation and provides safe access to hardware.