Async-Profiler: Low-Overhead CPU and Memory Profiling

Learn async-profiler for low-overhead CPU and memory profiling in production. Generate flame graphs, analyze allocations, and diagnose JVM bottlenecks.

published: May 26, 2026 reading time: 21 min read author: GeekWorkBench

Async-Profiler: Low-Overhead CPU and Memory Profiling

Async-profiler has become the go-to tool for JVM profiling in production. Unlike traditional profilers that rely on JVMTI instrumentation with its associated overhead, async-profiler uses operating system-level instrumentation to capture stack traces without modifying running code. The result is profiling data that accurately represents what the JVM is doing, with overhead typically under 5% even for detailed CPU profiles.

This tool is not new, but it is still underutilized. Many teams struggle with production performance issues because their profilers introduce too much overhead to run on live systems. Async-profiler removes that barrier.

Introduction

Async-profiler is a production-grade profiling tool that captures CPU and memory allocation data without the overhead that makes traditional JVMTI-based profilers unusable in live systems. Where conventional profilers halt application threads to inspect their state, async-profiler relies on operating system signals and JVM internal APIs to collect stack traces with typically under 5% overhead. This makes it practical for investigating performance issues in production environments where introducing significant latency would mask the problems you are trying to diagnose.

The tool has become the de facto standard for JVM profiling because it produces flame graphs that make hot code paths immediately visible, works without code changes or restarts, and handles both CPU profiling and allocation analysis from the same agent. It addresses a persistent problem in Java production environments: the performance bottleneck you need to find only manifests under real load, yet adding profiling instrumentation changes the behavior you are trying to observe. By staying under 5% overhead even for detailed CPU captures, async-profiler allows you to profile production traffic without creating a self-fulfilling prophecy.

This post covers when async-profiler is the right tool and when alternatives like JProfiler or YourKit are better suited, explains how the agent uses OS signals and hardware counters to capture stack traces without modifying running code, and provides practical commands for CPU profiling, allocation analysis, and memory leak detection. You will also see real failure scenarios where async-profiler identified JIT compilation hotspots, lock contention, and allocation patterns causing GC pressure.

When to Use Async-Profiler / When Not to Use Async-Profiler

Async-profiler shines when you need accurate CPU flame graphs without halting application threads. It works well for identifying which code paths consume the most CPU, finding lock contention bottlenecks, and analyzing allocation patterns in production. The ability to profile without restart makes it invaluable for investigating issues in live systems.

Do not use async-profiler when you need method-level allocation byte counts or object retention graphs—YourKit handles those better. For very short-lived processes, the startup overhead may dominate. And if you need IDE integration for step-by-step debugging, JProfiler or IntelliJ profiler remain more convenient despite their overhead.

How Async-Profiler Works

graph TD
    subgraph "Operating System"
        PMU[Performance Monitoring Unit<br/>Hardware Counter Interrupts]
        Signals[OS Signals<br/>SIGPROF, SIGURG]
    end

    subgraph "JVM"
        JVMTI[JVMTI Agent<br/>Async-profiler Agent]
        CodeCache[Code Cache<br/>JIT Compiled Code]
        Threads[Java Threads<br/>Running State]
    end

    subgraph "Async-Profiler"
        Coll[Collector Thread<br/>Stack Walking]
        Buffer[Circular Event Buffer]
        Converter[Flame Graph Converter]
    end

    PMU -->|"Hardware events"| Signals
    Signals -->|"signal handler"| JVMTI
    JVMTI -->|"walk stacks"| Threads
    JVMTI --> Coll
    Coll --> Buffer
    Buffer --> Converter

Async-profiler registers signal handlers with the OS. When the PMU triggers an interrupt or the configured interval elapses, the OS delivers a signal to a Java thread. The signal handler captures the thread's stack trace by walking the stack using JVM internal APIs and native frame walking. This happens without cooperation from Java code itself, which is why overhead stays low. The collector thread aggregates these traces and converts them to flame graph format for analysis.

## Installation and Setup

```bash
# Download async-profiler release
wget https://github.com/async-profiler/async-profiler/releases/download/v2.9/async-profiler-2.9-linux-x64.tar.gz
tar xzf async-profiler-2.9-linux-x64.tar.gz

# Verify installation
cd async-profiler-2.9
./ profiler.sh --version

# Attach to running JVM (requires pid)
./ profiler.sh <pid> -o flamegraph.svg --duration 30

For Java 21 and later, async-profiler works with both x86 and ARM architectures. Older versions may require building from source for certain platforms.

CPU Profiling

CPU profiling captures stack traces at regular intervals to build a picture of where CPU time goes:

# Basic CPU profiling for 30 seconds
./profiler.sh -d 30 -e cpu <pid>

# CPU profiling with frame rate limit (reduce overhead further)
./profiler.sh -d 30 -e cpu --framebuf=4096 <pid>

# CPU profile with method filtering (only include specific packages)
./profiler.sh -d 30 -e cpu --include 'com/mycompany/**' --exclude 'java/**' <pid>

# Output as flame graph
./profiler.sh -d 30 -e cpu -o flamegraph <pid>

The -o flamegraph option produces an interactive SVG flame graph that you can open in a browser. The width of each frame represents the proportion of CPU time spent in that method. The vertical axis shows the call stack depth.

Allocation Profiling

Async-profiler can instrument object allocations without the massive overhead of JVMTI allocation profiling:

# Profile allocations over 30 seconds
./profiler.sh -d 30 -e alloc <pid>

# Allocation profiling with live object tracking
./profiler.sh -d 30 -e alloc --liveobjgraph <pid>

# Sample allocations by size (only track allocations >16KB)
./profiler.sh -d 30 -e alloc --alloc-thresh=16384 <pid>

Allocation profiling works by intercepting the slow path in object allocation—the code path taken when TLAB (Thread Local Allocation Buffer) is full or allocation requires JVM intervention. This is only a fraction of total allocations but provides statistically representative sampling with minimal overhead.

Memory Leak Detection with Heap Live Data

The --liveobjs mode tracks live objects that survive garbage collections:

# Capture live object set (requires Full GC to trigger)
./profiler.sh -d 30 --liveobjs <pid>

# Compare two heap dumps to find growing objects
./profiler.sh --liveobjs --base=<pid1> <pid2>

This mode walks the heap after a GC to find all live objects and their allocation sites. Comparing two captures reveals which objects are growing over time—strong evidence of memory leaks.

Reading Flame Graphs

A flame graph shows call stacks horizontally. Each wide block represents a method. The vertical stacking shows parent calls. Reading from bottom to top traces a specific call path. Reading horizontally shows which methods share the same parent.

# Sample flame graph output format (text version)
- profile
  - Runtime.optimize
    - Compiler.runCompilation
      - MyService.process [5000 samples]
        - MyServiceHelper.calculate [3000 samples]
        - MyServiceHelper.validate [2000 samples]
    - MyService.process [2000 samples]

The numbers in brackets show sample counts. In this example, MyService.process appears 7000 times total, but only 2000 of those samples show it directly under profile—the other 5000 samples show it called through Runtime.optimize and Compiler.runCompilation, indicating JIT compilation is happening during processing.

Production Failure Scenarios

Scenario 1: Identifying JIT Compilation Hotspots

A service exhibits periodic latency spikes that correlate with certain code paths being executed for the first time or after a while. The flame graph shows Runtime.compileMethod or Interpreter.compile consuming significant time.

This happens because JIT compilation is expensive. Methods that have not been compiled yet run through the interpreter. Methods that were compiled but not used recently may have their compiled code invalidated. The first request after a deployment always pays compilation cost.

Solution: Warm up the application before sending traffic. Use -XX:CompileCommand=option to force compilation of critical methods. Consider tiered compilation settings that prioritize frequently-used methods.

Scenario 2: Finding Lock Contention in Concurrent Code

An application shows high CPU utilization but low throughput, suggesting threads are waiting rather than working. The -e lock profiler mode captures lock contention:

./profiler.sh -d 30 -e lock <pid>

The resulting flame graph shows which code paths contend for locks most frequently. Often this reveals design issues where too many threads compete for the same lock, or where lock-free algorithms could replace locking implementations.

Scenario 3: Allocation Rate Problems

A service experiencing GC pressure shows allocation hot spots in flame graphs. The alloc mode reveals which code creates the most objects:

./profiler.sh -d 60 -e alloc -o flamegraph <pid>

The flame graph shows which call paths allocate memory most frequently. Focus optimization efforts on those paths—reducing unnecessary object creation directly reduces GC pressure.

Trade-off Table

Mode	Overhead	What It Shows	Limitations
CPU	2-5%	Hot code paths	Requires sampling interval tuning
Alloc	3-8%	Allocation hot spots	Only samples TLAB refill path
Lock	1-3%	Lock contention	Only shows contended locks
Live Objects	10-20%	Live object graph	Requires Full GC

Implementation Snippets

Programmatically Starting Async-Profiler

import java.lang.management.*;
import java.io.*;
import sun.management.*;

public class AsyncProfilerControl {

    public static void main(String[] args) throws Exception {
        // Get the VM management interface
        HotSpotDiagnosticMXBean vmDiag =
            ManagementFactory.getHotSpotDiagnosticMXBean();

        // The profiler runs as a native agent
        // Control it via JCMD for a running process:
        // jcmd <pid>Profiler.start duration=30s

        // For programmatic control, use Attach API
        String pid = ManagementFactory.getRuntimeMXBean().getName().split("@")[0];
        System.out.println("Process ID: " + pid);

        // The profiler is controlled externally via the profiler.sh script
        // or by attaching with Attach API
        String profilerversion =
            vmDiag.getVMOption("CompilerConfigFile").getValue();
        System.out.println("Profiling requires external profiler.sh or API integration");
    }
}

Container Environment Profiling

#!/bin/bash
# Profile Java process in Kubernetes container
PID=$(pgrep -f "your-application.jar" | head -1)

if [ -z "$PID" ]; then
    echo "Java process not found"
    exit 1
fi

# Get container memory limits for appropriate frame buffer size
MEM_LIMIT=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo "4294967296")
FRAMEBUF=$((MEM_LIMIT / 100))  # 1% of memory for frame buffer

# Run async-profiler inside container (requires container with async-profiler installed)
docker exec your-container \
    /opt/async-profiler/profiler.sh \
    -d 60 \
    -e cpu \
    --framebuf=$FRAMEBUF \
    $PID \
    -o flamegraph \
    > /tmp/flamegraph.svg

echo "Flame graph written to /tmp/flamegraph.svg"

Observability Checklist

Before concluding profiling is complete, verify these points. The profiling duration captured enough time to see stable patterns—30 seconds minimum for frequently-executed code. The sampling interval was appropriate—too frequent creates overhead, too sparse misses important events. The JVM was warmed up; cold code paths during profiling skew results. CPU and memory metrics correlated with profiling findings—confirms findings are real.

Also check that the JVM did not undergo GC during CPU profiling that could skew results, and that the workload during profiling matched the production behavior you are investigating.

Security and Compliance Notes

Async-profiler has significant capabilities that should be restricted. The ability to capture stack traces reveals application internals, including class names, method names, and code paths. This information aids attackers in understanding your application’s architecture.

On production systems, restrict profiler access to authorized operations personnel. Use Linux capabilities to limit which users can attach profilers. Consider profiling-only environments that do not run production traffic.

The --liveobjs mode reveals object graphs that may expose sensitive data structures. Handle these profiles with the same care as heap dumps.

Common Pitfalls / Anti-Patterns

The most common mistake is profiling for too short a duration. A 5-second profile captures only what happened in those 5 seconds. If traffic is bursty, you may miss important code paths. Always profile for at least 30 seconds, preferably several minutes or across multiple load cycles.

Another issue is incorrect interpretation of wide frames. A wide frame in a CPU flame graph shows the most samples, but that does not always mean the biggest problem. Look for unexpected patterns, like library code consuming significant time, which might indicate inefficient usage rather than a problem in your code.

Forgetting to exclude idle orGC threads skews results. JVM threads like GC task thread and VM Thread should typically be excluded from analysis unless you are investigating GC issues specifically.

Finally, not matching the workload during profiling. Profiling a development workload against production architecture gives misleading results. Ensure the system state during profiling represents what you are trying to diagnose.

Quick Recap Checklist

Install async-profiler on your production hosts before you need it
Use -e cpu for CPU hot spots, -e alloc for allocation issues
Profile for at least 30 seconds, preferably during actual production load
Read flame graphs from bottom to top to trace call paths
Use --liveobjs to find memory leaks with allocation site information
Exclude GC threads when investigating CPU issues
Protect profiler access—it exposes application internals

Interview Questions

1. How does async-profiler achieve low-overhead CPU profiling compared to traditional JVMTI-based profilers?

Traditional JVMTI profilers work by pausing application threads and querying their state through the JVM tool interface. This requires safepoint operations, which cause all threads to stop at synchronization points. The overhead compounds because threads must coordinate even when they have no work to do.

Async-profiler works differently. It uses operating system signals—typically SIGPROF—delivered directly to Java threads. The signal handler captures the stack trace using JVM internal APIs for walking Java stacks and native frames. No safepoint coordination is needed. Application threads continue running between signal deliveries, and only briefly pause to capture their state. The result is accurate profiling with overhead typically under 5%.

2. What is the difference between CPU profiling and allocation profiling in async-profiler?

CPU profiling with -e cpu captures stack traces at regular hardware timer intervals. This shows where the CPU is spending time executing code. It works for any executing code, including native methods and kernel time.

Allocation profiling with -e alloc hooks into the object allocation path, specifically the TLAB refill slow path. Not every object allocation is sampled—only allocations that trigger the slow path. This is a small fraction of total allocations but statistically representative. The advantage is dramatically lower overhead than full allocation tracking, typically 3-8% versus 50%+ for JVMTI allocation profiling.

3. How do you interpret a flame graph correctly?

Read the flame graph from the bottom up. The widest block at the bottom represents the entry point, usually a thread or main method. Stacks above it show parent-child relationships. A block's width is proportional to its sample count.

To trace a specific call path, start at the bottom and follow the widest path upward. Look for unexpected library code consuming significant time—often a sign of inefficient usage rather than library problems. Hot coders or JIT compilation shows as narrow stacks with some frames labeled with compilation-related names.

The horizontal axis is not time—it is alphabetical ordering of similar stacks. Two identical stacks appearing far apart does not mean they occurred at different times.

4. What causes lock contention to show up in async-profiler?

Lock contention shows up when threads spend time waiting to acquire locks rather than executing code. In CPU profiling, you see threads in WAITING or BLOCKED states when the signal fires. In -e lock mode, async-profiler instruments lock acquisition specifically.

The flame graph for lock contention shows which code paths are contending for locks. Often this reveals a design issue where too many threads compete for the same lock, a lock being held too long, or unnecessary synchronization. The solution is often redesigning the locking strategy or using concurrent data structures instead.

5. When should you use the --liveobjs option in async-profiler?

Use --liveobjs when you suspect a memory leak and need to find the leaking objects. Unlike heap dumps that capture all objects, --liveobjs walks only objects that survive garbage collection, plus it records their allocation sites.

Compare two captures from different times. Objects that grow in count or size between captures indicate potential leaks. The allocation sites tell you exactly where the leaking objects are created, which is half the battle in fixing leaks.

This mode requires a Full GC to trigger the capture, so it has higher overhead than CPU profiling. Run it when the system is stable or during low-traffic periods. The output shows the object graph as a tree, which you can search for suspicious growth patterns.

6. What is the framebuf option in async-profiler and when should you tune it?

The --framebuf option controls the size of the ring buffer that holds stack trace samples. Larger buffers prevent overflow under high sample rates but consume more memory. The default is typically 4MB.

Tune this when profiling under high CPU load causes buffer overflow warnings. In container environments, set it to approximately 1% of the container memory limit. If you see "buffer overflow" in profiler output, increase the frame buffer size.

7. How does async-profiler's -e lock mode differ from CPU profiling for contention analysis?

CPU profiling shows where blocked threads were when the profiling signal fired—an indirect view of contention. The -e lock mode specifically instruments lock acquisition and acquisition attempts, showing which locks are contested, how long acquisition waits, and which call paths lead to the contended lock.

The -e lock mode is more accurate for contention analysis because it instruments the exact point of contention rather than inferring it from thread state snapshots. Use it when lock contention is the suspected bottleneck.

8. What is the difference between async-profiler and perf for Java profiling?

Linux perf can profile Java processes but struggles to interpret Java stack traces—it sees native frames and JIT-compiled code but cannot map them back to Java method names without special configuration. Async-profiler has built-in knowledge of the JVM internal structures and can produce readable Java method names.

Async-profiler also provides Java-specific profiling modes like allocation profiling and provides allocation sites (where objects were created). Perf is useful for mixed native/Java profiles or when you need hardware PMU counters, but for pure Java profiling, async-profiler is more convenient.

9. How do you profile a JVM running inside a Docker container with async-profiler?

The challenge is that async-profiler uses OS signals and needs access to /proc filesystem. In Docker, ensure the container runs with --privileged or at minimum --cap-add=SYS_PTRACE.

Bind-mount the async-profiler directory into the container and run it from inside the container rather than trying to profile across the container boundary. Alternatively, use docker exec to run the profiler inside the container namespace.

10. Can async-profiler profile both JVM and native code in the same session?

Yes. In CPU mode, async-profiler captures both Java methods and native code in the same flame graph. Native frames appear with their C/C++ function names. This is useful for diagnosing issues where Java code calls into native libraries.

Kernel time is also captured during CPU profiling—the flame graph shows kernel frames at the top for system calls. Use --exclude patterns to filter out noise from kernel frames if needed.

11. How do you interpret allocation flame graphs for optimization decisions?

Allocation flame graphs show where objects are created, not where memory is consumed. Wide frames indicate high allocation rates at those call paths—the target for optimization. Focus on paths allocating large numbers of short-lived objects.

Reduce allocation rate by object pooling (for expensive objects), lazy initialization (defer creation until needed), or replacing collections with primitive alternatives. The goal is to reduce allocations in hot paths, not eliminate all allocations.

12. What is the --alloc-thresh option and when would you use it?

--alloc-thresh sets a threshold for allocation sampling—only samples allocations above the specified size. This reduces overhead when you only care about large allocations. For example, --alloc-thresh=16384 only samples allocations over 16KB.

Use this when optimizing for large object allocations specifically, or when profiling shows that a small number of large allocations are causing GC pressure. For general allocation profiling, leave the threshold at its default.

13. How does async-profiler handle Java 21 virtual threads?

Async-profiler supports profiling virtual threads starting from version 2.21. In CPU profiling mode, it captures virtual thread stacks alongside platform threads. The flame graph shows virtual thread names versus platform thread names.

Note that virtual thread scheduling (virtual threads are scheduled on platform threads) means that CPU time attribution can be split between the virtual thread and the carrier thread. Version 2.21+ handles this with proper separation.

14. How do you compare flame graphs from different time periods?

Generate flame graphs with the same filter and duration settings from different time periods. Save them as separate SVG files. Open both in a browser and compare the top frames visually, or use diff tools that compare SVG structures.

For programmatic comparison, use the text output format (-o text) and write a script to diff sample counts per method. This helps identify whether optimization efforts are reducing CPU time in targeted code paths.

15. Why might your flame graph show "process reaping" or "[21] idle" frames?

"[21] idle" indicates CPU time spent in the idle threads when the processor would otherwise be idle. This is normal and often appears when the system has more CPU cores than the application uses. "process reaping" appears during process exit and is not actionable.

Use --exclude=idle to filter out idle time. Use --exclude='process reaping' to exclude process exit frames. These are housekeeping frames that do not represent actual application behavior.

16. How does async-profiler handle kernelmode and usermode time in CPU profiling?

In CPU profiling mode, async-profiler captures both kernel-mode and user-mode time separately. Flame graphs show kernel frames at the top of the stack for system calls. This distinction matters because high kernel time indicates excessive system calls or I/O operations, while high user time indicates CPU-bound Java code.

You can filter kernel frames with --exclude=kernel if you only want to analyze application code CPU usage.

17. What is the --reverse flag in async-profiler and when is it useful?

The --reverse flag reverses the flame graph so the widest frames appear at the top instead of the bottom. This makes it easy to identify which methods are the ultimate consumers of CPU, rather than which are parents in call chains. Some engineers find this format more intuitive for identifying optimization targets.

18. How do you profile only specific threads with async-profiler?

Use the --threads option to filter profiling to specific threads. You can specify thread IDs or use regex patterns to match thread names. For example, --threads "pool-.*" profiles only threads in pools with names starting with "pool-".

This is useful when you suspect specific thread pools are causing issues and want to isolate the analysis.

19. What does the -f option control in async-profiler output formats?

The -f option specifies the output format: flamegraph (SVG), text (plain text), collapsed (stackcollapse format for use with flame graph tools), html, or syntax Highlight. The collapsed format is useful for piping to other tools or generating custom visualizations.

20. How do you use async-profiler to identify JIT compilation bottlenecks?

During application warmup, JIT compilation consumes significant CPU. In CPU flame graphs, look for frames like Runtime.compileMethod, Interpreter.compile, or C1compilerthread. These indicate compilation overhead. Use --include to filter to compiler threads specifically.

If compilation bottlenecks are severe, consider using tiered compilation with higher thresholds, or pre-warming the application with representative load before production traffic.

Conclusion

Async-profiler has become the standard for low-overhead production profiling because it uses OS-level signals rather than JVMTI instrumentation. For CPU profiling, use -e cpu with flame graph output. For allocation issues, use -e alloc. For lock contention, use -e lock. The --liveobjs mode provides memory leak detection with allocation sites.

The key to effective async-profiler usage is profiling during representative production load for at least 30 seconds. Read flame graphs from bottom to top to trace call paths, and exclude GC threads when investigating CPU issues. Async-profiler works without code changes or restarts, making it invaluable for investigating issues in live systems.

Async-Profiler: Low-Overhead CPU and Memory Profiling

Introduction

When to Use Async-Profiler / When Not to Use Async-Profiler

How Async-Profiler Works

CPU Profiling

Allocation Profiling

Memory Leak Detection with Heap Live Data

Reading Flame Graphs

Production Failure Scenarios

Scenario 1: Identifying JIT Compilation Hotspots

Scenario 2: Finding Lock Contention in Concurrent Code

Scenario 3: Allocation Rate Problems

Trade-off Table

Implementation Snippets

Programmatically Starting Async-Profiler

Container Environment Profiling

Observability Checklist

Security and Compliance Notes

Common Pitfalls / Anti-Patterns

Quick Recap Checklist

Interview Questions

Further Reading

Conclusion

Category

Tags

Related Posts

CMS and G1 Collectors: Low-Latency Garbage Collection

Deoptimization Debugging: When JIT Compiled Code Reverts

Execution Engine: Interpreter, JIT Compiler, and Garbage Collector