Async-Profiler: Low-Overhead CPU and Memory Profiling
Learn async-profiler for low-overhead CPU and memory profiling in production. Generate flame graphs, analyze allocations, and diagnose JVM bottlenecks.
Async-Profiler: Low-Overhead CPU and Memory Profiling
Async-profiler has become the go-to tool for JVM profiling in production. Unlike traditional profilers that rely on JVMTI instrumentation with its associated overhead, async-profiler uses operating system-level instrumentation to capture stack traces without modifying running code. The result is profiling data that accurately represents what the JVM is doing, with overhead typically under 5% even for detailed CPU profiles.
This tool is not new, but it is still underutilized. Many teams struggle with production performance issues because their profilers introduce too much overhead to run on live systems. Async-profiler removes that barrier.
Introduction
Async-profiler is a production-grade profiling tool that captures CPU and memory allocation data without the overhead that makes traditional JVMTI-based profilers unusable in live systems. Where conventional profilers halt application threads to inspect their state, async-profiler relies on operating system signals and JVM internal APIs to collect stack traces with typically under 5% overhead. This makes it practical for investigating performance issues in production environments where introducing significant latency would mask the problems you are trying to diagnose.
The tool has become the de facto standard for JVM profiling because it produces flame graphs that make hot code paths immediately visible, works without code changes or restarts, and handles both CPU profiling and allocation analysis from the same agent. It addresses a persistent problem in Java production environments: the performance bottleneck you need to find only manifests under real load, yet adding profiling instrumentation changes the behavior you are trying to observe. By staying under 5% overhead even for detailed CPU captures, async-profiler allows you to profile production traffic without creating a self-fulfilling prophecy.
This post covers when async-profiler is the right tool and when alternatives like JProfiler or YourKit are better suited, explains how the agent uses OS signals and hardware counters to capture stack traces without modifying running code, and provides practical commands for CPU profiling, allocation analysis, and memory leak detection. You will also see real failure scenarios where async-profiler identified JIT compilation hotspots, lock contention, and allocation patterns causing GC pressure.
When to Use Async-Profiler / When Not to Use Async-Profiler
Async-profiler shines when you need accurate CPU flame graphs without halting application threads. It works well for identifying which code paths consume the most CPU, finding lock contention bottlenecks, and analyzing allocation patterns in production. The ability to profile without restart makes it invaluable for investigating issues in live systems.
Do not use async-profiler when you need method-level allocation byte counts or object retention graphs—YourKit handles those better. For very short-lived processes, the startup overhead may dominate. And if you need IDE integration for step-by-step debugging, JProfiler or IntelliJ profiler remain more convenient despite their overhead.
How Async-Profiler Works
graph TD
subgraph "Operating System"
PMU[Performance Monitoring Unit<br/>Hardware Counter Interrupts]
Signals[OS Signals<br/>SIGPROF, SIGURG]
end
subgraph "JVM"
JVMTI[JVMTI Agent<br/>Async-profiler Agent]
CodeCache[Code Cache<br/>JIT Compiled Code]
Threads[Java Threads<br/>Running State]
end
subgraph "Async-Profiler"
Coll[Collector Thread<br/>Stack Walking]
Buffer[Circular Event Buffer]
Converter[Flame Graph Converter]
end
PMU -->|"Hardware events"| Signals
Signals -->|"signal handler"| JVMTI
JVMTI -->|"walk stacks"| Threads
JVMTI --> Coll
Coll --> Buffer
Buffer --> Converter
Async-profiler registers signal handlers with the OS. When the PMU triggers an interrupt or the configured interval elapses, the OS delivers a signal to a Java thread. The signal handler captures the thread's stack trace by walking the stack using JVM internal APIs and native frame walking. This happens without cooperation from Java code itself, which is why overhead stays low. The collector thread aggregates these traces and converts them to flame graph format for analysis.
## Installation and Setup
```bash
# Download async-profiler release
wget https://github.com/async-profiler/async-profiler/releases/download/v2.9/async-profiler-2.9-linux-x64.tar.gz
tar xzf async-profiler-2.9-linux-x64.tar.gz
# Verify installation
cd async-profiler-2.9
./ profiler.sh --version
# Attach to running JVM (requires pid)
./ profiler.sh <pid> -o flamegraph.svg --duration 30
For Java 21 and later, async-profiler works with both x86 and ARM architectures. Older versions may require building from source for certain platforms.
CPU Profiling
CPU profiling captures stack traces at regular intervals to build a picture of where CPU time goes:
# Basic CPU profiling for 30 seconds
./profiler.sh -d 30 -e cpu <pid>
# CPU profiling with frame rate limit (reduce overhead further)
./profiler.sh -d 30 -e cpu --framebuf=4096 <pid>
# CPU profile with method filtering (only include specific packages)
./profiler.sh -d 30 -e cpu --include 'com/mycompany/**' --exclude 'java/**' <pid>
# Output as flame graph
./profiler.sh -d 30 -e cpu -o flamegraph <pid>
The -o flamegraph option produces an interactive SVG flame graph that you can open in a browser. The width of each frame represents the proportion of CPU time spent in that method. The vertical axis shows the call stack depth.
Allocation Profiling
Async-profiler can instrument object allocations without the massive overhead of JVMTI allocation profiling:
# Profile allocations over 30 seconds
./profiler.sh -d 30 -e alloc <pid>
# Allocation profiling with live object tracking
./profiler.sh -d 30 -e alloc --liveobjgraph <pid>
# Sample allocations by size (only track allocations >16KB)
./profiler.sh -d 30 -e alloc --alloc-thresh=16384 <pid>
Allocation profiling works by intercepting the slow path in object allocation—the code path taken when TLAB (Thread Local Allocation Buffer) is full or allocation requires JVM intervention. This is only a fraction of total allocations but provides statistically representative sampling with minimal overhead.
Memory Leak Detection with Heap Live Data
The --liveobjs mode tracks live objects that survive garbage collections:
# Capture live object set (requires Full GC to trigger)
./profiler.sh -d 30 --liveobjs <pid>
# Compare two heap dumps to find growing objects
./profiler.sh --liveobjs --base=<pid1> <pid2>
This mode walks the heap after a GC to find all live objects and their allocation sites. Comparing two captures reveals which objects are growing over time—strong evidence of memory leaks.
Reading Flame Graphs
A flame graph shows call stacks horizontally. Each wide block represents a method. The vertical stacking shows parent calls. Reading from bottom to top traces a specific call path. Reading horizontally shows which methods share the same parent.
# Sample flame graph output format (text version)
- profile
- Runtime.optimize
- Compiler.runCompilation
- MyService.process [5000 samples]
- MyServiceHelper.calculate [3000 samples]
- MyServiceHelper.validate [2000 samples]
- MyService.process [2000 samples]
The numbers in brackets show sample counts. In this example, MyService.process appears 7000 times total, but only 2000 of those samples show it directly under profile—the other 5000 samples show it called through Runtime.optimize and Compiler.runCompilation, indicating JIT compilation is happening during processing.
Production Failure Scenarios
Scenario 1: Identifying JIT Compilation Hotspots
A service exhibits periodic latency spikes that correlate with certain code paths being executed for the first time or after a while. The flame graph shows Runtime.compileMethod or Interpreter.compile consuming significant time.
This happens because JIT compilation is expensive. Methods that have not been compiled yet run through the interpreter. Methods that were compiled but not used recently may have their compiled code invalidated. The first request after a deployment always pays compilation cost.
Solution: Warm up the application before sending traffic. Use -XX:CompileCommand=option to force compilation of critical methods. Consider tiered compilation settings that prioritize frequently-used methods.
Scenario 2: Finding Lock Contention in Concurrent Code
An application shows high CPU utilization but low throughput, suggesting threads are waiting rather than working. The -e lock profiler mode captures lock contention:
./profiler.sh -d 30 -e lock <pid>
The resulting flame graph shows which code paths contend for locks most frequently. Often this reveals design issues where too many threads compete for the same lock, or where lock-free algorithms could replace locking implementations.
Scenario 3: Allocation Rate Problems
A service experiencing GC pressure shows allocation hot spots in flame graphs. The alloc mode reveals which code creates the most objects:
./profiler.sh -d 60 -e alloc -o flamegraph <pid>
The flame graph shows which call paths allocate memory most frequently. Focus optimization efforts on those paths—reducing unnecessary object creation directly reduces GC pressure.
Trade-off Table
| Mode | Overhead | What It Shows | Limitations |
|---|---|---|---|
| CPU | 2-5% | Hot code paths | Requires sampling interval tuning |
| Alloc | 3-8% | Allocation hot spots | Only samples TLAB refill path |
| Lock | 1-3% | Lock contention | Only shows contended locks |
| Live Objects | 10-20% | Live object graph | Requires Full GC |
Implementation Snippets
Programmatically Starting Async-Profiler
import java.lang.management.*;
import java.io.*;
import sun.management.*;
public class AsyncProfilerControl {
public static void main(String[] args) throws Exception {
// Get the VM management interface
HotSpotDiagnosticMXBean vmDiag =
ManagementFactory.getHotSpotDiagnosticMXBean();
// The profiler runs as a native agent
// Control it via JCMD for a running process:
// jcmd <pid>Profiler.start duration=30s
// For programmatic control, use Attach API
String pid = ManagementFactory.getRuntimeMXBean().getName().split("@")[0];
System.out.println("Process ID: " + pid);
// The profiler is controlled externally via the profiler.sh script
// or by attaching with Attach API
String profilerversion =
vmDiag.getVMOption("CompilerConfigFile").getValue();
System.out.println("Profiling requires external profiler.sh or API integration");
}
}
Container Environment Profiling
#!/bin/bash
# Profile Java process in Kubernetes container
PID=$(pgrep -f "your-application.jar" | head -1)
if [ -z "$PID" ]; then
echo "Java process not found"
exit 1
fi
# Get container memory limits for appropriate frame buffer size
MEM_LIMIT=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes 2>/dev/null || echo "4294967296")
FRAMEBUF=$((MEM_LIMIT / 100)) # 1% of memory for frame buffer
# Run async-profiler inside container (requires container with async-profiler installed)
docker exec your-container \
/opt/async-profiler/profiler.sh \
-d 60 \
-e cpu \
--framebuf=$FRAMEBUF \
$PID \
-o flamegraph \
> /tmp/flamegraph.svg
echo "Flame graph written to /tmp/flamegraph.svg"
Observability Checklist
Before concluding profiling is complete, verify these points. The profiling duration captured enough time to see stable patterns—30 seconds minimum for frequently-executed code. The sampling interval was appropriate—too frequent creates overhead, too sparse misses important events. The JVM was warmed up; cold code paths during profiling skew results. CPU and memory metrics correlated with profiling findings—confirms findings are real.
Also check that the JVM did not undergo GC during CPU profiling that could skew results, and that the workload during profiling matched the production behavior you are investigating.
Security and Compliance Notes
Async-profiler has significant capabilities that should be restricted. The ability to capture stack traces reveals application internals, including class names, method names, and code paths. This information aids attackers in understanding your application’s architecture.
On production systems, restrict profiler access to authorized operations personnel. Use Linux capabilities to limit which users can attach profilers. Consider profiling-only environments that do not run production traffic.
The --liveobjs mode reveals object graphs that may expose sensitive data structures. Handle these profiles with the same care as heap dumps.
Common Pitfalls / Anti-Patterns
The most common mistake is profiling for too short a duration. A 5-second profile captures only what happened in those 5 seconds. If traffic is bursty, you may miss important code paths. Always profile for at least 30 seconds, preferably several minutes or across multiple load cycles.
Another issue is incorrect interpretation of wide frames. A wide frame in a CPU flame graph shows the most samples, but that does not always mean the biggest problem. Look for unexpected patterns, like library code consuming significant time, which might indicate inefficient usage rather than a problem in your code.
Forgetting to exclude idle orGC threads skews results. JVM threads like GC task thread and VM Thread should typically be excluded from analysis unless you are investigating GC issues specifically.
Finally, not matching the workload during profiling. Profiling a development workload against production architecture gives misleading results. Ensure the system state during profiling represents what you are trying to diagnose.
Quick Recap Checklist
- Install async-profiler on your production hosts before you need it
- Use
-e cpufor CPU hot spots,-e allocfor allocation issues - Profile for at least 30 seconds, preferably during actual production load
- Read flame graphs from bottom to top to trace call paths
- Use
--liveobjsto find memory leaks with allocation site information - Exclude GC threads when investigating CPU issues
- Protect profiler access—it exposes application internals
Interview Questions
Traditional JVMTI profilers work by pausing application threads and querying their state through the JVM tool interface. This requires safepoint operations, which cause all threads to stop at synchronization points. The overhead compounds because threads must coordinate even when they have no work to do.
Async-profiler works differently. It uses operating system signals—typically SIGPROF—delivered directly to Java threads. The signal handler captures the stack trace using JVM internal APIs for walking Java stacks and native frames. No safepoint coordination is needed. Application threads continue running between signal deliveries, and only briefly pause to capture their state. The result is accurate profiling with overhead typically under 5%.
CPU profiling with -e cpu captures stack traces at regular hardware timer intervals. This shows where the CPU is spending time executing code. It works for any executing code, including native methods and kernel time.
Allocation profiling with -e alloc hooks into the object allocation path, specifically the TLAB refill slow path. Not every object allocation is sampled—only allocations that trigger the slow path. This is a small fraction of total allocations but statistically representative. The advantage is dramatically lower overhead than full allocation tracking, typically 3-8% versus 50%+ for JVMTI allocation profiling.
Read the flame graph from the bottom up. The widest block at the bottom represents the entry point, usually a thread or main method. Stacks above it show parent-child relationships. A block's width is proportional to its sample count.
To trace a specific call path, start at the bottom and follow the widest path upward. Look for unexpected library code consuming significant time—often a sign of inefficient usage rather than library problems. Hot coders or JIT compilation shows as narrow stacks with some frames labeled with compilation-related names.
The horizontal axis is not time—it is alphabetical ordering of similar stacks. Two identical stacks appearing far apart does not mean they occurred at different times.
Lock contention shows up when threads spend time waiting to acquire locks rather than executing code. In CPU profiling, you see threads in WAITING or BLOCKED states when the signal fires. In -e lock mode, async-profiler instruments lock acquisition specifically.
The flame graph for lock contention shows which code paths are contending for locks. Often this reveals a design issue where too many threads compete for the same lock, a lock being held too long, or unnecessary synchronization. The solution is often redesigning the locking strategy or using concurrent data structures instead.
Use --liveobjs when you suspect a memory leak and need to find the leaking objects. Unlike heap dumps that capture all objects, --liveobjs walks only objects that survive garbage collection, plus it records their allocation sites.
Compare two captures from different times. Objects that grow in count or size between captures indicate potential leaks. The allocation sites tell you exactly where the leaking objects are created, which is half the battle in fixing leaks.
This mode requires a Full GC to trigger the capture, so it has higher overhead than CPU profiling. Run it when the system is stable or during low-traffic periods. The output shows the object graph as a tree, which you can search for suspicious growth patterns.
The --framebuf option controls the size of the ring buffer that holds stack trace samples. Larger buffers prevent overflow under high sample rates but consume more memory. The default is typically 4MB.
Tune this when profiling under high CPU load causes buffer overflow warnings. In container environments, set it to approximately 1% of the container memory limit. If you see "buffer overflow" in profiler output, increase the frame buffer size.
CPU profiling shows where blocked threads were when the profiling signal fired—an indirect view of contention. The -e lock mode specifically instruments lock acquisition and acquisition attempts, showing which locks are contested, how long acquisition waits, and which call paths lead to the contended lock.
The -e lock mode is more accurate for contention analysis because it instruments the exact point of contention rather than inferring it from thread state snapshots. Use it when lock contention is the suspected bottleneck.
Linux perf can profile Java processes but struggles to interpret Java stack traces—it sees native frames and JIT-compiled code but cannot map them back to Java method names without special configuration. Async-profiler has built-in knowledge of the JVM internal structures and can produce readable Java method names.
Async-profiler also provides Java-specific profiling modes like allocation profiling and provides allocation sites (where objects were created). Perf is useful for mixed native/Java profiles or when you need hardware PMU counters, but for pure Java profiling, async-profiler is more convenient.
The challenge is that async-profiler uses OS signals and needs access to /proc filesystem. In Docker, ensure the container runs with --privileged or at minimum --cap-add=SYS_PTRACE.
Bind-mount the async-profiler directory into the container and run it from inside the container rather than trying to profile across the container boundary. Alternatively, use docker exec to run the profiler inside the container namespace.
Yes. In CPU mode, async-profiler captures both Java methods and native code in the same flame graph. Native frames appear with their C/C++ function names. This is useful for diagnosing issues where Java code calls into native libraries.
Kernel time is also captured during CPU profiling—the flame graph shows kernel frames at the top for system calls. Use --exclude patterns to filter out noise from kernel frames if needed.
Allocation flame graphs show where objects are created, not where memory is consumed. Wide frames indicate high allocation rates at those call paths—the target for optimization. Focus on paths allocating large numbers of short-lived objects.
Reduce allocation rate by object pooling (for expensive objects), lazy initialization (defer creation until needed), or replacing collections with primitive alternatives. The goal is to reduce allocations in hot paths, not eliminate all allocations.
--alloc-thresh sets a threshold for allocation sampling—only samples allocations above the specified size. This reduces overhead when you only care about large allocations. For example, --alloc-thresh=16384 only samples allocations over 16KB.
Use this when optimizing for large object allocations specifically, or when profiling shows that a small number of large allocations are causing GC pressure. For general allocation profiling, leave the threshold at its default.
Async-profiler supports profiling virtual threads starting from version 2.21. In CPU profiling mode, it captures virtual thread stacks alongside platform threads. The flame graph shows virtual thread names versus platform thread names.
Note that virtual thread scheduling (virtual threads are scheduled on platform threads) means that CPU time attribution can be split between the virtual thread and the carrier thread. Version 2.21+ handles this with proper separation.
Generate flame graphs with the same filter and duration settings from different time periods. Save them as separate SVG files. Open both in a browser and compare the top frames visually, or use diff tools that compare SVG structures.
For programmatic comparison, use the text output format (-o text) and write a script to diff sample counts per method. This helps identify whether optimization efforts are reducing CPU time in targeted code paths.
"[21] idle" indicates CPU time spent in the idle threads when the processor would otherwise be idle. This is normal and often appears when the system has more CPU cores than the application uses. "process reaping" appears during process exit and is not actionable.
Use --exclude=idle to filter out idle time. Use --exclude='process reaping' to exclude process exit frames. These are housekeeping frames that do not represent actual application behavior.
In CPU profiling mode, async-profiler captures both kernel-mode and user-mode time separately. Flame graphs show kernel frames at the top of the stack for system calls. This distinction matters because high kernel time indicates excessive system calls or I/O operations, while high user time indicates CPU-bound Java code.
You can filter kernel frames with --exclude=kernel if you only want to analyze application code CPU usage.
The --reverse flag reverses the flame graph so the widest frames appear at the top instead of the bottom. This makes it easy to identify which methods are the ultimate consumers of CPU, rather than which are parents in call chains. Some engineers find this format more intuitive for identifying optimization targets.
Use the --threads option to filter profiling to specific threads. You can specify thread IDs or use regex patterns to match thread names. For example, --threads "pool-.*" profiles only threads in pools with names starting with "pool-".
This is useful when you suspect specific thread pools are causing issues and want to isolate the analysis.
The -f option specifies the output format: flamegraph (SVG), text (plain text), collapsed (stackcollapse format for use with flame graph tools), html, or syntax Highlight. The collapsed format is useful for piping to other tools or generating custom visualizations.
During application warmup, JIT compilation consumes significant CPU. In CPU flame graphs, look for frames like Runtime.compileMethod, Interpreter.compile, or C1compilerthread. These indicate compilation overhead. Use --include to filter to compiler threads specifically.
If compilation bottlenecks are severe, consider using tiered compilation with higher thresholds, or pre-warming the application with representative load before production traffic.
Further Reading
- async-profiler GitHub Repository — Official project page with releases and documentation
- async-profiler Wiki — Comprehensive wiki with usage examples and troubleshooting
- FlameGraph Project — Brendan Gregg’s flame graph tools that async-profiler integrates with
- [Profiling Java in Production](https:// brincomp.github.io/posts/redefining-profiling/) — Advanced techniques for production JVM profiling
- Java Performance Mastery — Book covering async-profiler and other JVM profiling tools
Conclusion
Async-profiler has become the standard for low-overhead production profiling because it uses OS-level signals rather than JVMTI instrumentation. For CPU profiling, use -e cpu with flame graph output. For allocation issues, use -e alloc. For lock contention, use -e lock. The --liveobjs mode provides memory leak detection with allocation sites.
The key to effective async-profiler usage is profiling during representative production load for at least 30 seconds. Read flame graphs from bottom to top to trace call paths, and exclude GC threads when investigating CPU issues. Async-profiler works without code changes or restarts, making it invaluable for investigating issues in live systems.
Category
Related Posts
CMS and G1 Collectors: Low-Latency Garbage Collection
How CMS and G1 garbage collectors reduce pause times through concurrent marking, region-based heap layout, and incremental compaction.
Deoptimization Debugging: When JIT Compiled Code Reverts
Learn what causes the JVM to deoptimize JIT-compiled code, how to detect deoptimization events, and how to fix the underlying issues.
Execution Engine: Interpreter, JIT Compiler, and Garbage Collector
Deep dive into the JVM Execution Engine covering bytecode interpretation, JIT compilation, and Garbage Collector architecture and algorithms.