Performance Profiling

Master Linux performance profiling with perf, ftrace, BCC tools, and flame graphs to identify and eliminate kernel bottlenecks.

published: May 19, 2026 reading time: 19 min read author: GeekWorkBench

Introduction

Performance profiling is the systematic analysis of where time is spent during program execution. In operating systems, profiling serves two distinct but related purposes: understanding how applications use system resources and diagnosing where the kernel itself spends its time. Without profiling data, optimization is guesswork—developers assume hot paths rather than measure them.

Linux provides a rich ecosystem of profiling tools, each suited to different problem domains. perf excels at CPU-bound workloads and hardware event counting. ftrace provides function-level tracing with minimal overhead. The BPF Compiler Collection (BCC) enables custom analysis scripts without kernel patches. Flame graphs convert complex profiling data into visual call stacks that immediately reveal bottlenecks.

When to Use / When Not to Use

Performance profiling is appropriate when:

Application runs slower than expected — Quantify where time is actually spent
CPU utilization is high but throughput is low — Identify lock contention or memory pressure
Investigating latency spikes — Trace request paths to find blocking operations
Kernel performance tuning — Understand scheduler behavior, I/O patterns, or interrupt handling

Profiling is NOT appropriate when:

The problem is obvious from logs — Don’t profile what you already understand
System is in crisis — Fix known issues first; profiling takes time to set up correctly
Working on a feature that doesn’t exist yet — Profile production or realistic benchmarks, not speculation

Architecture or Flow Diagram

flowchart TB
    subgraph "Data Sources"
        PMC[Hardware PMCs<br/>Performance Monitoring Counters]
        SWC[Software Counters<br/>tracepoints, kprobes, uprobes]
        TIM[Timers<br/>interval sampling]
    end

    subgraph "Collection Layer"
        PERF[perf stat / record]
        FTRACE[ftrace<br/>function / wakeup / irq]
        BCC[BCC Tools<br/>Python / Lua frontends]
    end

    subgraph "Analysis Layer"
        FG[FlameGraph<br/>stack collapse + svg]
        RAPP[rapp<br/>rapport text reports]
        STD[perf report<br/>text browser]
    end

    PMC --> PERF
    SWC --> FTRACE
    SWC --> BCC
    PERF --> STD
    PERF --> FG
    FTRACE --> FG
    BCC --> FG

    style FG stroke:#ff6b6b,stroke-width:3px

Core Concepts

perf: The Swiss Army Knife

perf is the primary userspace interface to Linux performance monitoring:

# Basic CPU profiling - record stack traces for 30 seconds
perf record -F 99 -g -a -- sleep 30

# Analyze specific process
perf record -F 100 -g -p $PID -- sleep 10

# Hardware event counting
perf stat -e cycles,instructions,cache-misses -a -- sleep 5

# Profile kernel functions
perf record -g -a --call-graph dwarf sleep 10

# View results
perf report --stdio --symbol-filter='*my_function*'

ftrace: Function-Level Tracing

ftrace provides kernel-internal tracing with minimal overhead:

# Enable function tracer (requires debugfs)
cd /sys/kernel/debug/tracing
echo 0 > tracing_on
echo function > current_tracer
echo my_function > set_ftrace_filter
echo 1 > tracing_on

# Measure function latency with function_graph
echo function_graph > current_tracer
echo 1 > max_graph_depth

# Trace specific events - scheduler latency
echo 1 > events/sched/sched_switch/enable
echo 1 > events/sched/sched_wakeup/enable
cat trace | head -100

BCC Tools for Custom Analysis

BCC provides Python/Lua frontends for eBPF programs:

#!/usr/bin/env python3
"""Example BCC tool: trace file operations by process name"""

from bcc import BPF
from bcc.utils import printb

program = """
#include <uapi/linux/ptrace.h>
#include <linux/sched.h>

struct data_t {
    u32 pid;
    u32 uid;
    char comm[TASK_COMM_LEN];
    char filename[256];
};

BPF_PERF_OUTPUT(events);

TRACEPOINT_PROBE(syscalls, sys_enter_openat) {
    struct data_t data = {};
    data.pid = bpf_get_current_pid_tgid() >> 32;
    data.uid = bpf_get_current_uid_gid() & 0xFFFFFFFF;
    bpf_get_current_comm(&data.comm, sizeof(data.comm));

    // Strlen would require userspace helpers; capture raw path
    return 0;
}
"""

b = BPF(text=program)

# Print output
def print_event(cpu, data, size):
    event = b["events"].event(data)
    print(f"{event.pid:6d} {event.comm.decode():16s} {event.filename}")

b["events"].open_perf_buffer(print_event)
while True:
    b.perf_buffer_poll()

Generating Flame Graphs

#!/bin/bash
# Generate flame graph from perf data

PERF_DATA=perf.data
FLAMEGRAPH_DIR=/opt/FlameGraph

# Record profile
perf record -F 99 -a -g -- sleep 30

# Convert to flame graph
perf script | $FLAMEGRAPH_DIR/stackcollapse-perf.pl | \
    $FLAMEGRAPH_DIR/flamegraph.pl > profile.svg

# Filter to specific process
perf script | grep myservice | \
    $FLAMEGRAPH_DIR/stackcollapse-perf.pl | \
    $FLAMEGRAPH_DIR/flamegraph.pl > myservice.svg

Production Failure Scenarios

Scenario 1: Profiling Overhead Swamps Actual Performance

Problem: Setting sampling frequency too high (>10kHz) or enabling too many tracepoints causes the system to spend more time recording events than doing useful work.

Mitigation:

Start with low-overhead configurations: perf stat for counters, -F 99 for sampling
Use perf bench to measure baseline before profiling
Enable events selectively; disable unused tracepoint groups
For production, use BPF programs that aggregate in-kernel

Scenario 2: Stack Trace Corruption

Problem: perf report shows [unknown] frames or corrupted call graphs.

Mitigation:

Ensure debug symbols are installed (linux-tools-common, linux-headers-$(uname -r))
Use --call-graph dwarf instead of frame pointer for accurate stacks
On older kernels, check if no_frame_pointers boot parameter is set
Verify the binary isn’t stripped before shipping debug packages

Scenario 3: perf Fails with “Permission Denied”

Problem: /proc/sys/kernel/perf_event_paranoid is set to restrictive values.

Mitigation:

# Check current setting
cat /proc/sys/kernel/perf_event_paranoid

# Temporarily change (requires root)
echo 1 > /proc/sys/kernel/perf_event_paranoid

# Permanent change in /etc/sysctl.conf
echo "kernel.perf_event_paranoid = 1" >> /etc/sysctl.conf

Trade-off Table

Tool	Overhead	Granularity	Customization	Best For
perf stat	Very low	Counter totals	Low—preset events	Baseline measurements
perf record	Low-Medium	Call stacks	Medium—custom events	CPU hotspot analysis
ftrace	Low	Function-level	High—kernel patching	Kernel internals
BCC/eBPF	Very low	Custom	Very high	Production tracing
FlameGraph	N/A (post-process)	Visual	N/A	Communication

Implementation Snippet: Custom perf Event

For specialized measurements, create custom tracepoints:

/* In kernel module or tracepoint-enabled code */
#include <linux/tracepoint.h>

/* Define tracepoint in kernel code */
#include <trace/events/sched.h>

void my_sched_function(void)
{
    trace_sched_wakeup(p, success);
}

/* Or define custom tracepoint (requires kernel modification) */
/* In include/trace/events/mydevice.h: */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM mydevice

TRACE_EVENT(mydevice_operation,
    TP_PROTO(int device_id, size_t bytes, int error),
    TP_ARGS(device_id, bytes, error),
    TP_STRUCT__entry(
        __field(int, device_id)
        __field(size_t, bytes)
        __field(int, error)
    ),
    TP_fast_assign(
        __entry->device_id = device_id;
        __entry->bytes = bytes;
        __entry->error = error;
    ),
    TP_printk("device=%d bytes=%zu error=%d",
              __entry->device_id, __entry->bytes, __entry->error)
);

/* User space then can listen: */
perf stat -e mydevice:mydevice_operation -a -- sleep 5

Observability Checklist

When profiling a production system, collect these metrics in parallel:

CPU utilization — mpstat 1 for per-core usage
Context switch rate — vmstat 1, look at cs column
Memory pressure — free -m, cat /proc/meminfo
I/O throughput — iostat -xz 1
Network — netstat -s or ss -s
Scheduler latency — perf sched latency
interrupt distribution — cat /proc/interrupts

Common Pitfalls / Anti-Patterns

perf_event_paranoid — Restricts who can use perf; sensitive containers may be blocked
Kernel address exposure — perf can reveal kernel addresses; treat dumps as sensitive
eBPF restrictions — Modern kernels restrict eBPF program capabilities for unprivileged users
DTrace vs perf — Linux perf is less mature than Solaris DTrace; consider SystemTap for complex probes in enterprise environments

Common Pitfalls / Anti-patterns

Profiling wrong thing — Profiling build time when you should profile runtime, or vice versa
Ignoring cache effects — First run always looks different; measure warmed cache performance
Microbenchmarking — Measuring small functions in isolation ignores cache, branch prediction, and pipeline effects
Attributing cost incorrectly — “system” time is not always kernel time; check user/kernel ratio carefully
Over-interpreting samples — 99 samples of 10,000 may not be statistically significant for your conclusion

Quick Recap Checklist

Use perf stat for quick counter summaries before detailed profiling
Use perf record -g to capture call stacks for hotspot analysis
Flame graphs visualize call patterns and make bottlenecks obvious
ftrace is best for kernel-internal function tracing with low overhead
BCC/eBPF enables production-safe custom instrumentation without kernel changes
Always measure baseline before and after changes to verify improvement
Consider system-wide effects when profiling single applications

Real-World Case Study: Database Query Optimization

A production PostgreSQL instance showed high CPU but low throughput. Using perf:

perf stat -e cycles,instructions,cache-misses -a -- sleep 5 showed CPI of 3.2 (poor)
perf record -F 99 -a -g -- sleep 30 captured call stacks
Flame graph revealed 40% of cycles in memcpy within index scan operations
The culprit: updating full rows when only specific columns changed (write amplification)

The fix involved batched column-specific updates, reducing CPU usage by 65%. This demonstrates why profiling before optimization is critical—the assumed bottleneck (lock contention) was actually a memory bandwidth issue.

Advanced Topic: Speculative Execution and Profiling Errors

Modern CPUs execute instructions speculatively before determining branch outcomes. When profiling with statistical sampling, samples taken during speculative execution can show functions that were never actually on the critical path—the branch was mispredicted and the pipeline was flushed. This leads to false hotspots in your profile.

Mitigations include:

Use last branch record (LBR) to see actual taken branches, not speculative ones
Profile with realistic data that exercises real branch patterns
Use event-based sampling with branch filtering: perf record -e branch-misses ...
Consider that functions appearing in fewer than 1% of samples may be noise from speculation

Interview Questions

1. What is the difference between tracepoints, kprobes, and uprobes in Linux?

Tracepoints are static, predefined hooks in kernel code that are always compiled in (but can be disabled). kprobes are dynamic instrumentation that attaches to any kernel function entry or any instruction. uprobes are the user-space equivalent, attaching to user-space binary functions. Tracepoints have lowest overhead and stable interfaces; kprobes/uprobes are more flexible but require knowing exact symbol addresses.

2. How does hardware branch prediction interact with profiling?

Modern CPUs mispredict branches and pipeline instructions speculatively. When a profiler takes a sample during speculative execution, you may see functions in stack traces that weren't actually on the critical path. Use perf record - branches to filter for only taken branches, or rely on last branch record (LBR) when available. Always profile with realistic data that exercises real branch behavior, not synthetic test cases.

3. What does "CPU cycle accounting" mean and why does perf stat show more cycles than time?

Modern CPUs can execute multiple instructions per cycle via pipelining and superscalar execution. "Cycles" counts actual clock cycles elapsed, while "instructions" counts retired instructions. The ratio (cycles per instruction, CPI) reveals pipeline efficiency. If cycles exceed time significantly on a multi-core system, it means the workload is using multiple cores in parallel—each core contributing its own cycles.

4. What is eBPF and why is it significant for Linux observability?

eBPF (extended Berkeley Packet Filter) allows sandboxed programs to run in the kernel without modifying kernel source or loading modules. Programs are verified for safety before execution and can be attached to kprobes, tracepoints, or network interfaces. This enables custom metrics collection, networking, and security enforcement—all without kernel patches. BCC provides high-level languages (Python, Lua) that compile to eBPF bytecode.

5. How do you profile a function that runs in microseconds?

Statistical sampling at 99Hz or 999Hz cannot reliably capture microsecond-scale functions—they may not appear in any sample. Instead, use: (1) perf stat -e with cache miss or branch misprediction events that correlate with the function; (2) instruction counting via perf stat -e instructions; (3) trace the specific function with ftrace and its graph mode to measure individual calls; (4) consider if the function needs algorithm optimization rather than just measurement.

6. What is the difference between hardware and software performance counters?

Hardware PMCs (Performance Monitoring Counters) are built into the CPU and count events like cycles, instructions, cache misses, and branch mispredictions—they require minimal overhead to read. Software counters are implemented by the kernel via tracepoints and kprobes; they count events like context switches, page faults, and scheduler decisions. Hardware counters provide lower overhead and higher precision for CPU-bound analysis; software counters provide visibility into kernel activities that have no hardware equivalent.

7. How does the kernel's perf event subsystem handle overflow?

When a PMC overflows (reaches its maximum count), the kernel receives an interrupt and records a sample with the current instruction pointer. For continuous monitoring with overflow handling, use perf record -f to use frequency-based sampling where the kernel reprograms the counter after each sample rather than waiting for overflow. The perf_event_attr::watermark setting controls when the overflow interrupt fires. In multi-threaded programs, the kernel uses inherit settings to decide whether children inherit the parent's events.

8. What is the relationship between perf and the kernel's tracering buffer?

perf uses a per-CPU ring buffer (struct ring_buffer) to transfer samples from kernel to user space. When perf records, itmmaps the buffer and the kernel writes samples directly into it without syscalls per sample—only a single mmapped region update. When the buffer is full, perf loses samples (in overwrite mode) or blocks (in per-CPU mode). Using perf record -m 2 increases the buffer size to reduce sample loss in high-frequency profiling scenarios.

9. Why would you choose ftrace over perf for kernel tracing?

ftrace excels at function-level tracing with minimal overhead—its function tracer can be enabled with just echo function > current_tracer. perf is better for hardware event counting and call graph analysis. Choose ftrace when you need: (1) function-level visibility with low overhead, (2) kernel function latency via function_graph, (3) tracepoint filtering with event parameters, (4) dynamic instrumentation with kprobe events. Choose perf when you need hardware metrics, call stacks, or userspace tracing.

10. How do you interpret flame graph bandwidth correctly?

Flame graphs show relative time spent in each function stack—wider boxes mean more samples. The vertical axis shows ancestry (caller on top, callee below); horizontal position has no meaning. To interpret: (1) find the widest blocks—these are your hottest paths; (2) read from root to leaf to understand call sequences; (3) compare flame graphs before/after changes to verify improvement; (4) look for "plateaus" of similar frame widths—these often indicate a bottleneck where all callers converge. The colors are typically cosmetic unless you enable hue-by-attribute.

11. What is the difference between on-CPU and off-CPU profiling, and when would you use each?

On-CPU profiling samples where the CPU is actively executing code—useful for CPU-bound workloads where the bottleneck is computation. Off-CPU profiling (also called wakeup profiling) samples when threads are NOT running—useful for I/O-bound workloads, lock contention, or scheduling issues where threads block. Use perf record -a -g -- sleep 10 for on-CPU; use perf record -a -g --call-graph=fp -e sched:sched_switch -c 1 for off-CPU analysis. Off-CPU flame graphs often reveal "stacks lying down"—blocked threads show their wait channels (like futex, read, poll) as the blocking call, revealing I/O or lock bottlenecks.

12. How does Last Branch Record (LBR) improve profiling accuracy over standard sampling?

LBR is a hardware feature that stores the last N (typically 16-32) branch records including source and destination addresses. Unlike standard sampling that interrupts periodically and captures only the current instruction pointer, LBR captures actual taken branches with their destinations. This is crucial because: (1) it eliminates speculation noise—branches that were mispredicted and flushed from pipeline don't appear as real work; (2) it provides precise call chains without frame pointers; (3) it enables perf record -j any,u -- branches to capture all branch events efficiently. Enable with perf record -j any,u -a and view with perf report --branch-history.

13. What is the relationship between cache misses and memory latency in profiling data?

Cache misses don't directly measure memory latency—they measure miss events. When a cache miss occurs, the CPU waits (stalls) for data from memory, but a single cache-misses counter event doesn't tell you how long each stall lasted. To correlate misses with latency: (1) use perf stat -e cycles,cache-misses,cache-references and calculate miss rate; (2) combine with stalled-cycles-frontend and stalled-cycles-backend to see how much of the pipeline is waiting; (3) use perf record -e cache-misses -W with weight to capture miss latency; (4) consider using memory-latency events on Intel (cbo_X events) for actual memory access latency histograms.

14. What does "CPI > 1" mean in perf stat output and what does it indicate?

CPI (Cycles Per Instruction) = cycles / instructions. A CPI of 1.0 means the CPU ideally completed one instruction per cycle (superscalar execution can exceed 1.0). CPI > 1 indicates stalls—typically: (1) CPI 1.0-2.0: normal for memory-bound workloads with some cache misses; (2) CPI 2.0-4.0: significant cache miss penalties or memory latency; (3) CPI > 4.0: severe memory bottleneck, possible cache thrashing or non-optimized memory access patterns. For multi-core systems, also check instructions per cycle (IPC)—if cores are parallelized, total cycles increase but each core may show different CPI. A high system-wide CPI with low CPU utilization indicates I/O waiting rather than computation stalls.

15. How do you profile kernel module code that's loaded and unloaded dynamically?

Dynamic module loading creates profiling challenges because symbols may not be available when perf isn't running. Solutions: (1) Build modules with debug symbols and install to /lib/modules/$(uname -r)/build—perf will find them automatically; (2) Use module.symbols file to map dynamic addresses to names after the fact; (3) Persistently enable tracing with ftrace before running workload—ftrace can capture module function entries even without symbol resolution; (4) Use BPF with kprobe on module functions: BPF programs attached to module symbols will auto-load when the module loads; (5) For post-mortem analysis, use perf record -k mono -g with mono (monotonic clock) for ordering.

16. What is the difference between instruction-based sampling (IBS) and standard PMC profiling?

AMD's IBS (Instruction-Based Sampling) provides more detailed profiling than standard PMCs. IBS periodically triggers based on: (strong>Fetches—instruction fetch cycles andop information; Ops—micro-ops retired and their characteristics. Unlike standard PMCs which count events (cache misses, branches), IBS provides exact instruction addresses that caused events. Benefits: (1) more precise hotspot identification without frame pointer dependency; (2) micro-op level visibility for AMD Zen architectures; (3) better handling of instruction cache misses. Access via perf record -e ibs_fetch/.../ -e ibs_op/.../. Intel equivalents are PEBS (Precise Event-Based Sampling) and LBR (Last Branch Record).

17. How do you identify false sharing in multi-threaded applications using profiling?

False sharing occurs when threads on different cores modify variables that share the same cache line, causing unnecessary cache invalidations. Profiling indicators: (1) High cache-misses with low memory traffic—many invalidations but not much data transferred; (2) perf stat showing high cache_misses and store_blocks ld_flood events; (3) High context switch rates between threads that access related data structures. Confirmation: use perf c2c record (cache-to-cache analysis) which shows cache line contention events including ld_l1hit, ld_l2hit, stores with RFO (Read-For-Ownership) that indicate cache line bouncing. Fix: pad data structures to ensure each thread's data is cache-line-aligned.

18. What is the difference between --call-graph dwarf and --call-graph fp in perf record?

--call-graph dwarf (default on modern systems) uses DWARF debug information to unwind the stack—it reads saved registers and stack frame pointers from memory to reconstruct call chains. More accurate but requires debug symbols (-g at compile time) and has higher overhead. --call-graph fp uses frame pointer chaining—each stack frame contains a pointer to the previous frame. Faster but only works if code was compiled with -fno-omit-frame-pointer. Modern GCC defaults to -fomit-frame-pointer for optimized builds, making fp mode unreliable. For containerized environments where debug symbols may not be available locally, use --call-graph dwarf with separate debuginfo packages.

19. How does the kernel's perf_event_paranoid setting affect profiling capabilities?

The perf_event_paranoid sysctl controls access to performance events for unprivileged users:

-1: No restrictions (cap_sys_admin bypasses all checks)
0: Allow all user access to performance events
1: Disallow raw tracepoint access (still allows hardware PMCs)
2+: Disallow all performance events except cycle counts

In containers, this is often set to restrictive values. If perf_event_paranoid >= 2, most hardware counters won't work. Solutions: (1) run container with --cap=SYS_ADMIN (not recommended for security); (2) use perf record -P (privilege-level filtering) or profile in a VM; (3) use BPF with CAP_PERFMON (kernel 5.8+) which has finer-grained control.

20. What is the significance of "staleness" in BPF maps during profiling sessions?

BPF maps can exhibit staleness when the kernel is in the middle of updating them while userspace is reading. Key issues: (1) Lookup vs Update race: bpf_map_lookup_elem() may return a value that's being concurrently modified; (2) Incremental updates: when aggregating in-kernel, userspace might see partial results. Mitigations: (1) use BPF_F_NO_PREALLOC for atomic map allocation; (2) use bpf_percpu_hash for per-CPU aggregation which avoids cross-CPU synchronization; (3) implement double buffering in userspace—read from one map while the kernel writes to another, then swap; (4) use __sync_fetch_and_add for atomic increments. For production tracing, consider using ring_buffer (available in Linux 5.8+) which supports efficient, lock-free data transfer.

Further Reading

Brendan Gregg’s perf Tools - Comprehensive perf examples and flame graph creation

BPF Performance Tools - eBPF tools for Linux observability

Linux perf Examples - One-minute tutorials with real examples

LWN.net: The perf tools - Deep dive into perf internals

Kernel Profiling with perf - Official kernel profiling guide

Conclusion

Performance profiling transforms guesswork into data-driven optimization by systematically measuring where time is spent during execution. Linux provides a layered ecosystem: perf for hardware event counting and call stack sampling, ftrace for function-level kernel tracing with minimal overhead, and BCC/eBPF for custom production-safe instrumentation without kernel changes.

Always start with baseline measurements using perf stat before detailed profiling. Use flame graphs to visualize call patterns and make bottlenecks immediately obvious. For production systems, aggregate data in-kernel with BPF to minimize overhead. Remember that profiling is not optimization—it tells you where problems exist, not how to fix them. Combine profiling data with algorithm understanding to make meaningful improvements.

For continued learning, explore kernel profiling with perf inside containers, eBPF programs for custom latency histograms, and hardware performance monitoring with last branch records (LBR) for speculative execution analysis.

Introduction

When to Use / When Not to Use

Architecture or Flow Diagram

Core Concepts

perf: The Swiss Army Knife

ftrace: Function-Level Tracing

BCC Tools for Custom Analysis

Generating Flame Graphs

Production Failure Scenarios

Scenario 1: Profiling Overhead Swamps Actual Performance

Scenario 2: Stack Trace Corruption

Scenario 3: perf Fails with “Permission Denied”

Trade-off Table

Implementation Snippet: Custom perf Event

Observability Checklist

Common Pitfalls / Anti-Patterns

Common Pitfalls / Anti-patterns

Quick Recap Checklist

Real-World Case Study: Database Query Optimization

Advanced Topic: Speculative Execution and Profiling Errors

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

CPU Affinity & Real-Time Operating Systems

Fork & Exec System Calls

System Calls Interface