DMA (Direct Memory Access)
Learn how DMA bypasses the CPU for device-memory transfers, DMA buffers, scatter-gather I/O, and IOMMU security.
Introduction
Every byte that travels between RAM and a disk, network card, or sound card must cross the memory bus. In a naive model, the CPU is the courier—reading from memory, writing to device registers, reading from device, writing to memory. For bulk transfers like writing a 4GB file to disk, this would consume the CPU for hundreds of milliseconds, rendering it useless for other work.
Direct Memory Access (DMA) is the solution: a DMA controller (either a discrete chip or integrated into the chipset/soc) takes control of the memory bus and transfers data directly between devices and RAM. The CPU configures the transfer, kicks off the DMA engine, then continues with useful work while the DMA controller handles the heavy lifting.
Understanding DMA is essential for driver developers, system performance engineers, and anyone debugging storage or networking performance issues. DMA misconfigurations cause some of the most frustrating and subtle system failures.
When to Use / When Not to Use
When DMA Is Essential
- Storage devices: HDDs, SSDs, RAID controllers—all use DMA for efficient data transfer
- Network cards: High-speed networking (10GbE+) requires DMA to handle packet buffers without CPU overhead
- Sound cards: Audio buffers must transfer continuously without causing CPU spikes
- Video capture devices: Frame buffers are too large for CPU-mediated copies
- Any high-bandwidth peripheral: If a device transfers more than ~1MB/s, DMA is virtually mandatory
When DMA Is Not Worth It
- Slow devices with small transfers: Keyboard input (bytes per second) costs more DMA setup than it saves
- Memory-constrained systems: DMA buffers consume non-swappable kernel memory
- Systems without IOMMU: Without an IOMMU, DMA access to wrong addresses causes memory corruption
- Simple microcontrollers: Some MCUs lack DMA controllers entirely; CPU polling is the only option
- Debugging/tracing scenarios: DMA transfers are opaque to standard debuggers; CPU copies are easier to trace
Bus Types and DMA Compatibility
| Bus Type | DMA Support | Notes |
|---|---|---|
| PCIe | Native DMA, MSI-X | Every PCIe device can DMA; addressed by BAR + offset |
| PCI | Native DMA, IRQ sharing | Legacy PCI has address space limitations |
| USB | Host-controller DMA | USB devices don’t DMA directly (host controller proxies) |
| NVMe | PCIe DMA with queues | Supports 64K command entries, huge ring buffers |
| SATA | AHCI DMA | Single queue, 32 command slots |
| FireWire | Bus mastering DMA | Devices can read/write host memory |
Architecture or Flow Diagram
The following diagram shows the DMA transfer process with and without an IOMMU:
flowchart TB
subgraph "Without IOMMU"
CPU1["CPU"] --> MEM1["Memory"]
DEV1["Device\n(DMA engine)"] -->|"1. Program DMA"| CPU1
DEV1 -->|2. Direct memory read/write| MEM1
end
subgraph "With IOMMU (VT-d / AMD-Vi)"
CPU2["CPU"] --> IOMMU["IOMMU"]
IOMMU --> MEM2["Memory"]
DEV2["Device\n(DMA engine)"] -->|"1. Program DMA with IOVA"| IOMMU
IOMMU -->|2. Translate IOVA to physical| MEM2
DEV2 -->|3. DMA transfer via IOMMU| MEM2
end
style IOMMU stroke:#ffffff
style DEV1 stroke:#00fff9
style DEV2 stroke:#00fff9
The IOMMU adds a translation layer: instead of using physical addresses (which would expose kernel layout and prevent memory isolation), the device uses I/O Virtual Addresses (IOVA). The IOMMU translates these to physical addresses, enforcing access boundaries and allowing memory overcommit via swizzling.
Core Concepts
DMA Transfer Types
Bus mastering (first party DMA): The device itself contains a DMA engine and controls the bus directly. Modern PCIe devices (NVMe, 10GbE NICs) all use bus mastering DMA.
Third-party DMA: A discrete DMA controller (like the legacy PC87338 chip) manages transfers between devices and memory, with the CPU acting as coordinator. This is rare in modern systems.
Transfer modes:
// Types of DMA transfers
enum dma_transfer_direction {
DMA_MEM_TO_MEM, // Memory to memory (copy)
DMA_MEM_TO_DEV, // Memory write to device
DMA_DEV_TO_MEM, // Device read to memory
DMA_DEV_TO_DEV, // Device to device via memory
DMA_XOR, // RAID XOR operations
DMA_MEMSET, // Memory set operations
};
DMA Mapping and Coherency
The fundamental problem: device and CPU may have different views of memory. CPU caches may hold stale data; the device may see different physical memory than the CPU.
Streaming DMA (most common): Driver owns the buffer, map it for DMA just for the transfer, then unmap.
// Typical streaming DMA sequence
struct scatterlist sg;
struct page *page = my_buffer_page;
unsigned int offset = my_buffer_offset;
size_t len = transfer_size;
// Build scatterlist
sg_init_table(&sg, 1);
sg_set_page(&sg, page, len, offset);
// Map for DMA - device can now access this memory
if (dma_map_sg(dev, &sg, 1, DMA_FROM_DEVICE) != 1) {
return -ENOMEM;
}
// The device now reads/writes the memory directly
// ... device does its DMA transfer ...
// Unmap after transfer completes - CPU cache coherency maintained
dma_unmap_sg(dev, &sg, 1, DMA_FROM_DEVICE);
// Now CPU can safely access the buffer
process_received_data(buffer);
Consistent DMA (also called coherent): Both device and CPU see the same memory with automatic coherency. Uses special non-cached or write-combining memory.
// Allocate coherent (cache-coherent) DMA buffer
void *coherent_buf = dma_alloc_coherent(
dev, // Device requesting allocation
size, // Size in bytes
&dma_handle, // Hardware address (bus address)
GFP_KERNEL // Allocation flags
);
// Device sees coherent_buf's physical address as dma_handle
write_reg(dev->regs + DATA_PTR_REG, dma_handle);
// When done, free the buffer
dma_free_coherent(dev, size, coherent_buf, dma_handle);
Scatter-Gather I/O
Scatter-gather allows a single DMA transfer to span multiple non-contiguous memory regions. Instead of copying scattered data into one buffer, the DMA engine reads from multiple buffer descriptors in one operation.
// Scatter-gather DMA with a network card
struct sk_buff *skb = alloc_skb MTU;
struct scatterlist sg[ MAX_SKB_FRAGS + 1 ]; // Headers + fragments
int nents;
// skb contains headers in linear part, payload in fragments
// Build scatterlist from all parts
nents = skb_to_sgvec(skb, sg, 0, skb->len);
// Program DMA to transfer entire packet in one operation
tx_desc = &ring[ring_head];
tx_desc->addr = sg_phys(&sg[0]);
tx_desc->len = sg[0].length;
if (nents > 1) {
// Set SGL bit for additional descriptors
tx_desc->flags = DMA_DESC_SGL;
for (i = 1; i < nents; i++) {
// Chain additional fragment descriptors
tx_desc[i].addr = sg_phys(&sg[i]);
tx_desc[i].len = sg[i].length;
}
}
// One DMA command triggers transfer of all fragments
writel(ring_head, dev->regs + TX_PUSH_REG);
Bus Address vs Physical Address
Critical concept: a device cannot use CPU physical addresses directly in most systems. The memory management unit (MMU) and IOMMU mean different things to CPU and device.
// CPU virtual address
void *kernel_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
uintptr_t cpu_va = (uintptr_t)kernel_buf;
// CPU physical address (what CPU actually accesses)
phys_addr_t cpu_pa = virt_to_phys(kernel_buf);
// Bus address (what device uses to access memory)
// On systems without IOMMU: bus_pa == cpu_pa
// On systems with IOMMU: bus_pa is the IOVA
dma_addr_t bus_addr = dma_map_single(dev, kernel_buf, size, direction);
// Use bus_addr to program DMA hardware
write_reg(dev->DMA_ADDR_REG, bus_addr);
// After DMA, unmap so IOVA space is released
dma_unmap_single(dev, bus_addr, size, direction);
Production Failure Scenarios
Scenario 1: DMA to Wrong Address Causing Memory Corruption
What happened: A driver bug programmed a DMA descriptor with a stale pointer after the buffer had been freed by another thread. The device wrote incoming network packets to arbitrary physical memory—overwriting kernel data structures, corrupting process memory, and causing sporadic crashes that appeared random.
Detection: Kernel panic with corrupted slab magic numbers, or silent data corruption manifesting as strange process behavior. Tools like dmadebug (kernel config) and iommu=nofault boot param can catch invalid DMA.
Mitigation:
// Use the DMA API correctly—never cache pointers across synchronization points
static void net_rx_process(struct net_device *ndev)
{
struct sk_buff *skb;
dma_addr_t mapping;
skb = napi_alloc_skb(&ndev->napi, len);
if (!skb)
return;
// Map for DMA BEFORE giving the address to hardware
mapping = dma_map_single(&ndev->dev, skb->data, len, DMA_FROM_DEVICE);
// Check mapping succeeded
if (dma_mapping_error(&ndev->dev, mapping)) {
dev_kfree_skb(skb);
return;
}
// Program DMA immediately after mapping
// Hardware must not see the mapping before it's valid
write_reg(ndev->RX_DESC_PTR, mapping);
write_reg(ndev->RX_LEN_REG, len);
// Save mapping with skb so we can unmap later
skb_shinfo(skb)->dma_mapping = mapping;
}
// Cleanup path must unmap before freeing
static void free_rx_skb(struct sk_buff *skb)
{
if (skb_shinfo(skb)->dma_mapping) {
dma_unmap_single(&ndev->dev, skb_shinfo(skb)->dma_mapping,
skb->len, DMA_FROM_DEVICE);
skb_shinfo(skb)->dma_mapping = 0;
}
dev_kfree_skb(skb);
}
Scenario 2: Cache Incoherency Leading to Stale Data
What happened: A driver processed DMA data immediately after the DMA transfer completed, but the CPU cache still held a stale copy of the memory. The device had written new data to RAM, but the CPU was reading the old cached data, causing protocol parsing failures and corrupted file system metadata.
Detection: Data corruption patterns that disappeared when DMA was artificially slowed (by adding udelay after DMA completion).
Mitigation:
// After DMA from device to memory, must sync before CPU access
// The DMA API handles cache coherency, but you must use it correctly
struct page *page = virt_to_page(buf);
size_t offset = offset_in_page(buf);
size_t len = min_t(size_t, PAGE_SIZE - offset, transfer_len);
if (direction == DMA_FROM_DEVICE) {
// Device wrote to memory, invalidate CPU cache so CPU reads new data
dma_sync_single_for_cpu(dev, bus_addr, len, DMA_FROM_DEVICE);
} else {
// CPU wrote to memory, flush cache so device sees new data
dma_sync_single_for_device(dev, bus_addr, len, DMA_TO_DEVICE);
}
// Now safe for CPU to access
process_data(buf, len);
Scenario 3: IOMMU Fault from Unaligned DMA Addresses
What happened: A driver passed a 4-byte aligned but not 8-byte aligned address to a PCIe device that required 64-byte alignment. The IOMMU reported a fault, causing the DMA transfer to fail silently and data to be lost.
Detection: dmesg showed “DMA-API: device driver mapping error” or “AMD-Vi: Event logged [invalid device request]” messages.
Mitigation:
// Always align DMA buffers to the device's required alignment
// Common requirements: 4, 8, 16, 64, 256 bytes
#define DMA_ALIGNMENT 64
void *alloc_aligned_dma_buffer(struct device *dev, size_t size)
{
void *buf;
dma_addr_t addr;
// Allocate extra space to allow alignment
buf = dma_alloc_coherent(dev, size + DMA_ALIGNMENT - 1,
&addr, GFP_KERNEL);
if (!buf)
return NULL;
// Realign to required boundary
if (addr & (DMA_ALIGNMENT - 1)) {
dma_addr_t aligned = (addr + DMA_ALIGNMENT - 1) &
~(DMA_ALIGNMENT - 1);
// Handle offset in buffer
return buf + (aligned - addr);
}
return buf;
}
// Or use streaming DMA with proper alignment
struct page *page = alloc_pages(GFP_KERNEL, order);
dma_addr_t addr = dma_map_page(dev, page, 0, PAGE_SIZE << order,
DMA_BIDIRECTIONAL);
BUG_ON(addr & (DMA_ALIGNMENT - 1)); // Verify alignment
Trade-off Table
| DMA Approach | Cache Coherency | Allocation Overhead | Flexible Layout | Use Case |
|---|---|---|---|---|
| dma_alloc_coherent | Automatic (hardware) | High (non-cached memory) | Low (contiguous only) | Frequent small transfers |
| dma_map_single | Manual sync needed | Low | High (any buffer) | One-time large transfers |
| dma_map_sg + SG | Manual sync | Low | Highest (fragmented) | Network packets |
| dma_pool | Automatic | Medium | Medium | Small frequent buffers |
| swiotlb (bounce buffer) | N/A | Very High (copy) | N/A | Systems without IOMMU |
Implementation Snippets
DMA Engine Framework (Linux Kernel)
#include <linux/dmaengine.h>
#include <linux/dma-mapping.h>
struct dma_device *chan->device;
struct dma_async_tx_descriptor *tx;
dma_cookie_t cookie;
enum dma_ctrl_flags flags = DMA_PREP_INTERRUPT | DMA_CTRL_ACK;
// Get available DMA channel
struct dma_chan *dma_chan = dma_request_chan(dev, "rx_memcpy");
if (!dma_chan)
return -ENODEV;
// Configure transfer
tx = dma_chan->device->device_prep_dma_memcpy(
dma_chan, // Channel
dest_phys, // Destination address (bus addr)
src_phys, // Source address (bus addr)
len, // Transfer length
flags // Flags (e.g., sync, interrupt)
);
if (!tx) {
dma_release_channel(dma_chan);
return -EIO;
}
// Set callback for completion notification
tx->callback = dma_done_callback;
tx->callback_param = dev;
// Submit to DMA engine
cookie = dmaengine_submit(tx);
if (dma_submit_error(cookie)) {
dma_release_channel(dma_chan);
return cookie;
}
// Start the transfer
dma_async_issue_pending(dma_chan);
// In callback:
static void dma_done_callback(void *param)
{
struct my_device *dev = param;
complete(&dev->dma_done);
// Signal waiting code that data is ready
}
Userspace DMA (with UIO)
#!/usr/bin/env python3
"""
Userspace access to DMA buffers via UIO framework.
UIO exposes DMA regions to userspace for simple devices.
"""
import mmap
import os
import ctypes
class DMARegion:
"""Represents a memory-mapped DMA accessible region."""
def __init__(self, uio_device: str = "/dev/uio0"):
self.fd = os.open(uio_device, os.O_RDWR)
self.map_count = 0
self.maps = []
def get_mapping(self, map_num: int = 0) -> tuple:
"""Get (address, size) of UIO mapping."""
with open(f"/sys/class/uio/{self._uio_name()}/maps/map{map_num}/addr") as f:
addr = int(f.read(), 16)
with open(f"/sys/class/uio/{self._uio_name()}/maps/map{map_num}/size") as f:
size = int(f.read(), 16)
return addr, size
def mmap_mapping(self, map_num: int = 0) -> memoryview:
"""Memory-map the DMA region for userspace access."""
addr, size = self.get_mapping(map_num)
self.maps.append(mmap.mmap(
self.fd,
size,
mmap.ACCESS_WRITE,
offset=addr
))
return memoryview(self.maps[-1])
def trigger_interrupt(self):
"""Signal device that data is ready."""
os.write(self.fd, b'\x00')
def wait_for_interrupt(self, timeout_ms: int = 1000) -> bool:
"""Wait for DMA completion interrupt."""
import select
r, _, _ = select.select([self.fd], [], [], timeout_ms / 1000)
return bool(r)
def _uio_name(self) -> str:
"""Get UIO device name from fd."""
# In real code, would use ioctl to get name
return "uio0"
def close(self):
for m in self.maps:
m.close()
os.close(self.fd)
if __name__ == "__main__":
# Example: accessing a DMA-accessible buffer
import numpy as np
# Assume UIO device with DMA-accessible memory region
region = DMARegion("/dev/uio0")
addr, size = region.get_mapping(0)
print(f"DMA region: address={hex(addr)}, size={size}")
# Map and access as numpy array
buf = np.array(mmap.mmap(region.fd, size, mmap.ACCESS_READ, offset=addr),
dtype=np.uint32)
print(f"First 8 words: {buf[:8]}")
region.close()
DMA Debugging Tools
#!/bin/bash
# DMA debugging on Linux
echo "=== DMA engine status ==="
ls -la /sys/class/dma/
echo ""
echo "=== DMA channel usage ==="
cat /proc/dma
echo ""
echo "=== IOMMU status ==="
if [ -d /sys/kernel/iommu_groups ]; then
for group in /sys/kernel/iommu_groups/*; do
echo "IOMMU Group $(basename $group):"
ls -la "$group/devices/" 2>/dev/null || true
done
else
echo "No IOMMU groups found (or IOMMU disabled)"
fi
echo ""
echo "=== DMA debug info (requires CONFIG_DMA_API_DEBUG) ==="
cat /sys/kernel/debug/dma/ 2>/dev/null || echo "DMA debugfs not available"
echo ""
echo "=== Check for DMA mapping errors in kernel log ==="
dmesg | grep -i "dma.*error" | tail -20
dmesg | grep -i "iommu.*fault" | tail -20
Observability Checklist
DMA Metrics in Linux
# View DMA channel assignments
cat /proc/dma
# Check IOMMU groups and which devices belong to each
ls -la /sys/kernel/iommu_groups/
# Read IOMMU event logs (AMD)
cat /sys/kernel/debug/iommu/amd/events 2>/dev/null
# Intel IOMMU status
cat /sys/kernel/debug/iommu/intel/ 2>/dev/null
# DMA API debugging (if enabled in kernel)
# Writes to dmesg when mapping errors occur
echo "module dma_debug +p" > /sys/kernel/debug/dynamic_debug/control
Key Metrics to Monitor
| Metric | Source | Alert Threshold | Indicates |
|---|---|---|---|
| IOMMU faults | dmesg | Any fault is concerning | DMA to unmapped address |
| DMA API errors | dmesg | Any error | Programming mistake |
| PCI errors | lspci -v | Any correctable errors > 0 | Potential data corruption |
| DMA transfer time | perf counter | > expected bandwidth | Device or bus issue |
| Swiotlb usage | /proc/swaps or boot | > 0 if not expected | IOMMU not available |
Common Pitfalls / Anti-Patterns
DMA Security Vulnerabilities
-
DMA Attacks: Malicious devices can read/write arbitrary system memory via DMA. The notorious Thunderbolt DMA attack allowed attackers with physical access to read memory in milliseconds. Mitigation: Enable
iommu=forcein kernel, use DMA request filtering, and disable Thunderbolt in BIOS. -
DMA to Released Memory: When user-space processes are terminated, their DMA-accessible memory may be reused before the device finishes reading it. Use
dma_unmap_*and ensure proper synchronization with device before memory release. -
DMA Ring Buffer Overflows: A device may write beyond allocated buffers if ring head/tail pointers get out of sync. Use IOMMU to enforce boundary checks and monitor DMA errors.
PCI Express Security Features
# Enable IOMMU (kernel boot parameters)
# Intel: iommu=force iommu=pt
# AMD: amd_iommu=force amd_iommu=pt
# Verify IOMMU is active
dmesg | grep -e "DMAR:" -e "AMD-Vi:"
# Check that PCIe ATS is enabled
lspci -vvv -s 00:02.0 | grep -i "ATS\|Process"
Common Pitfalls / Anti-patterns
| Anti-pattern | Why It’s Bad | Correct Approach |
|---|---|---|
| Using physical addresses directly | Breaks with IOMMU; exposes kernel layout | Always use dma_map_* APIs |
| Forgetting to unmap DMA buffers | IOVA leak; on some systems, bounce buffers accumulate | Use devm_dma_* managed versions |
| Not syncing before CPU access | CPU reads stale cached data | Call dma_sync_* appropriately |
| Assuming contiguous physical memory | Large allocations may not be physically contiguous | Use scatter-gather, not contiguous assumption |
| DMA to stack-allocated buffers | Stack pages may not be permanently mapped | Use dma_alloc_coherent or kmalloc |
| Ignoring DMA alignment requirements | IOMMU or device may reject misaligned addresses | Check device datasheet, align allocations |
Not checking dma_mapping_error() | Unchecked mapping failures cause silent corruption | Always validate return values |
Quick Recap Checklist
- DMA allows devices to transfer data without CPU intervention, critical for high-bandwidth I/O
- Bus mastering DMA: device controls the bus; third-party DMA: discrete controller
- DMA addresses (IOVA/bus addresses) differ from CPU virtual/physical addresses
- Streaming DMA maps buffers per-transfer; coherent DMA uses persistent cached mappings
- Scatter-gather enables DMA across non-contiguous buffers using descriptor lists
- IOMMU provides security and virtualization benefits by translating device addresses
- Always sync DMA buffers before CPU access and after CPU modifications
- Use
dma_map_*/dma_unmap_*or manageddevm_*variants to avoid leaks - On systems without IOMMU, swiotlb provides bounce buffering but with performance cost
- DMA debugging requires checking
dmesg,/proc/dma, and IOMMU status
Interview Questions
Streaming DMA (also called streaming mapping) is for temporary DMA mappings—driver maps a buffer, device uses it, then driver unmaps. The buffer can be any kernel memory; coherency is manual via dma_sync_*. Coherent DMA uses persistent mappings where both device and CPU see the same data with automatic hardware coherency (typically uncached or write-combining memory). Use streaming for bulk transfers like network packets. Use coherent for small, frequent, cache-coherent buffers like ring descriptors. Streaming has lower overhead per transfer; coherent has simpler code but higher allocation overhead.
An IOMMU (Input-Output Memory Management Unit) maps device-visible addresses to physical memory addresses, analogous to how the CPU MMU maps virtual addresses. For security, it prevents DMA attacks: without an IOMMU, a malicious or buggy device can write to any physical memory address, potentially reading encrypted keys, sensitive data, or corrupting kernel structures. The IOMMU enforces that a device can only access memory regions explicitly mapped for it, provides address boundary enforcement, and can reset errant devices. It also enables virtualizing devices that don't natively support virtualization and supports memory overcommit through address swizzling.
dma_sync_* before and after CPU access to streaming DMA buffers?Because of CPU cache coherency. When a device writes to memory via DMA, that data goes to RAM but the CPU cache (cpu data cache) may still hold an older copy. If the CPU reads the data without syncing, it reads the stale cache line, not the new data from RAM. Conversely, before a device DMA read, the CPU writes data to memory but it sits in cache—the device would read old data from RAM. dma_sync_for_device() flushes modified cache lines to RAM. dma_sync_for_cpu() invalidates stale cache lines so the CPU reads fresh data from RAM. The DMA API handles this, but only if you call the sync functions correctly.
Scatter-gather DMA allows a single DMA operation to read from or write to multiple non-contiguous memory buffers, using a list of (address, length) descriptors. Instead of copying scattered buffers into one contiguous area before DMA, the device transfers each buffer directly. This is essential for network packets (headers in one region, payload fragments in others), RAID operations (XOR across multiple disks), and any protocol where data is naturally fragmented. Without scatter-gather, drivers would need to copy data into contiguous buffers before transfer, defeating much of DMA's performance benefit. Linux uses struct scatterlist and dma_map_sg()/dma_unmap_sg() for this.
A bounce buffer is a temporary contiguous buffer in memory that sits between a device and a high-memory region that the device cannot directly access (or between incompatible address spaces). When the device needs to DMA to/from an address it can't reach, the kernel copies the data to/from the bounce buffer, and the DMA occurs from/to the bounce buffer instead. This commonly happens on 32-bit systems with limited DMA address space (they can only address the first 4GB), or when IOMMU translation isn't available. The swiotlb (Software IOMMU) implements bounce buffering. Bounce buffers add a memcpy overhead—bad for performance—motivating the use of 64-bit systems with IOMMU for high-throughput storage.
Cache-coherent DMA hardware automatically maintains coherency between CPU cache and device DMA—the device's memory access is visible to the CPU without explicit cache management. Most modern platforms (x86, ARM with CCI, AMD Zen) have hardware coherency. Non-cache-coherent DMA requires software-managed coherency: before DMA from device to memory (device writing), you must invalidate CPU caches so the CPU sees the new data. After DMA to device (CPU writing), you must flush caches so the device sees the latest data. The software must call `dma_sync_*` functions. Some ARM SoCs and older architectures lack hardware coherency and require careful software management. Using streaming DMA APIs (`dma_map_single`, `dma_sync_*`) handles coherency correctly on both coherent and non-coherent platforms—the API abstracts the difference.
A DMA ring buffer (also called a descriptor ring) is a circular queue of DMA descriptors in memory that the CPU fills and the hardware reads. The CPU populates a descriptor with a memory address and length, then increments the ring tail pointer—the hardware reads descriptors, DMA transfers data to/from the address, and writes completion status back to the descriptor. The ring structure allows the CPU to batch submissions (preparing many descriptors at once) while the hardware works through them in parallel. For 10GbE+ NICs, the ring size (number of descriptors) directly limits throughput—too small a ring causes starvation when the hardware outpaces CPU descriptor preparation. Modern NICs support thousands of concurrent in-flight DMA operations via large rings. The alternative (single-descriptor blocking DMA) would limit throughput to one DMA operation at a time.
In virtualization with device passthrough (VT-d, AMD-Vi), a physical device is assigned directly to a VM. Without an IOMMU, the VM's DMA requests could reach any physical memory—accessing other VMs' memory or host kernel memory. The IOMMU restricts DMA to only the memory regions the VM has been allocated, via IOVA (I/O virtual address) translation. The hypervisor programs the IOMMU with a page table mapping the VM's assigned memory, and all DMA from the passthrough device goes through this translation. This isolation is what makes PCI passthrough safe—a compromised VM driver cannot DMA into host memory. IOMMU also supports interrupt remapping, preventing malicious VMs from injecting interrupts to other VMs. However, IOMMU adds latency (additional translation step) and requires the IOMMU to be programmed for each memory region the device accesses.
A DMA address mask (from `dma_get_mask()`) tells you the addressable range of a device—the highest address the device can reach. A 32-bit mask (`0xFFFFFFFF`) means the device can only address the first 4GB. A 64-bit mask means it can reach any address. A DMA address boundary (from `dma_get_required_mask()`) tells you the minimum mask the platform needs given available memory—above this boundary, you need bounce buffering or IOMMU translation. When allocating buffers for DMA, you must ensure the allocated memory falls within the device's addressable range. On 32-bit systems without IOMMU, if you allocate a buffer above the 4GB boundary and the device has a 32-bit mask, DMA to that buffer fails silently. Use `dma_alloc_coherent()` which returns an address the device can reach (or use IOMMU for address translation).
When a DMA buffer shares a cache line with other data, the cache line bounces between CPU cores when either the CPU or the DMA engine modifies it. If your buffer starts at offset 0 within a cache line and another frequently-written variable is at offset 64, they share the same line—DMA writes to the buffer invalidate the CPU's cache line, slowing subsequent CPU reads. The fix is to align the DMA buffer to a cache line boundary (typically 64 bytes) and size it as a multiple of cache lines. `posix_memalign()` with alignment of 64 handles this on x86. On ARM, cache line size varies—check `getconf LEVEL1_DCACHE_LINESIZE`. Misaligned DMA buffers cause performance degradation proportional to access frequency—every DMA completion invalidates the CPU's cache line if the buffer shares with hot data, causing extra cache misses.
`dma_mapping_error()` checks whether a DMA mapping operation succeeded. Even with `dma_map_single()` returning a non-zero address, the mapping can fail (especially with IOMMU translation failures or when DMA mask constraints cannot be satisfied). Returning a zero address from `dma_map_single()` indicates failure on some platforms, but not all—IOMMU failures may return a non-zero address that will cause DMA to the wrong location. Always check: `dma_addr = dma_map_single(...); if (dma_mapping_error(dev, dma_addr)) { /* handle error */ }`. Some platforms require calling this after every mapping, others only for `dma_map_sg()`. Not checking can cause silent data corruption where DMA goes to the wrong address. The correct pattern also applies to `dma_map_page()`, `dma_map_single()`, and `dma_map_sg()`.
`dma_alloc_coherent()` allocates memory that is simultaneously accessible by the CPU and the device with cache coherency. It returns both the kernel virtual address and the bus (DMA) address. It uses the device's DMA mask to determine whether the memory must come from a specific region. `dma_alloc_attrs()` is the more general version—it takes an `attrs` argument (bit flags) that control allocation behavior. For example, you can pass `DMA_ATTR_WRITE_COMBINE` to get write-combining memory (faster for device writes but not cache-coherent), or `DMA_ATTR_NO_WARN` to suppress allocation failure warnings. Use `dma_alloc_coherent()` for simple cases where you need cache-coherent memory. Use `dma_alloc_attrs()` when you need specialized allocation attributes for performance or when working with older platform-specific quirks.
Streaming DMA (via `dma_map_single`, `dma_map_page`, `dma_map_sg`) maps a buffer for the duration of one DMA transfer, then unmaps. Cache coherency must be manually managed: before device DMA write, `dma_sync_for_device()` flushes CPU caches; after device DMA read, `dma_sync_for_cpu()` invalidates caches. This is efficient when you have one-shot transfers and the buffer address changes frequently. Coherent DMA (via `dma_alloc_coherent`, `dma_zalloc_coherent`) allocates memory that stays mapped and is automatically coherent—neither the CPU nor the device needs to sync. Coherent memory is more expensive to allocate (usually non-cached or write-combining memory), cannot be swapped, and limits you to smaller allocations. Streaming DMA is more flexible (any buffer can be used), lower overhead per transfer, but requires explicit sync calls. Use streaming for bulk transfers; use coherent for small, persistent, frequently-accessed structures like ring descriptors.
On systems with both IOMMU and a limited DMA mask (e.g., a 32-bit device on a 64-bit system), the DMA API uses IOMMU translation to allow the device to access memory beyond its natural addressable range. The device uses an IOVA (I/O virtual address) that the IOMMU translates to a physical address. Even if the device's DMA mask is 32-bit, IOMMU translation allows it to access any physical memory, because the device only sees the IOVA (which is within its mask) and the IOMMU does the translation to the actual physical address. The kernel's DMA API hides this complexity—when you call `dma_map_single()`, it returns an address within the device's mask, but the IOMMU handles the translation to the actual memory. This is how 32-bit PCIe devices can access memory above 4GB on systems with an IOMMU—critical for high-memory systems with legacy devices.
A DMA controller (third-party DMA, like legacy PC87338 chips) is a separate chip that performs memory transfers on behalf of the CPU. The CPU programs the DMA controller with source, destination, and count, then starts the transfer. The DMA controller arbitrates for the bus and performs the transfer, signaling an interrupt on completion. Bus mastering DMA (first-party DMA) means the device itself contains the DMA engine and arbitrates for the bus directly—the CPU programs the device (not a separate controller). Modern PCIe devices (NVMe, network cards, SATA controllers) all use bus mastering DMA. Third-party DMA controllers are largely historical, found in older ISA/PCI systems or specialized embedded contexts. The programming model difference: third-party DMA has a centralized controller shared across devices; bus mastering DMA distributes the DMA engine to each device for higher bandwidth and lower latency.
Stack-allocated memory (local variables) is problematic for DMA because: (1) Stack pages may not be permanently mapped in the kernel page tables—accessing them outside the current context may cause page faults. DMA hardware bypasses the CPU and accesses memory directly using physical addresses; if the stack page is not mapped in the IOMMU or is paged out, the DMA reads/writes garbage. (2) Stack memory can be reclaimed and reused by other functions while the device is still processing a DMA from the original stack memory. (3) In embedded systems, stack memory may be in a different physical region unsuitable for DMA. Use `dma_alloc_coherent()` or `kmalloc()` for DMA buffers—these are guaranteed to be in stable, kernel-addressable, DMA-safe memory. Never pass a stack address to `dma_map_single()`.
Every DMA engine has a maximum transfer size (maximum segment size it can handle in a single descriptor). For PCIe NVMe devices, this is typically 128KB or 256KB per descriptor. Larger transfers must be split into multiple DMA operations—either by chaining multiple descriptors in a scatter-gather list, or by breaking the transfer into multiple DMA calls. Software handles this by splitting large buffers into PAGE_SIZE (4KB) or larger chunks. For file storage, the kernel's block layer handles splitting automatically for block devices. For character devices or network buffers, the driver must implement the splitting logic. If a device's maximum segment size is smaller than your buffer, the DMA engine may silently truncate or fail—always check the device datasheet for the limit and implement proper chunking.
These direction flags tell the kernel the direction of the DMA transfer so it can manage cache coherency correctly. `DMA_FROM_DEVICE` means the device writes to memory (device → CPU memory)—you must invalidate CPU caches before CPU reads. `DMA_TO_DEVICE` means the CPU writes to memory and the device reads (CPU → device)—you must flush CPU caches before the DMA read. `DMA_BIDIRECTIONAL` means the direction can change and the caller takes responsibility for proper sync. Using the wrong direction causes coherency bugs: if you use `DMA_TO_DEVICE` but the device actually wrote to memory, the CPU reads stale data from cache. Some platforms enforce direction checking at the IOMMU level—if you misdeclare the direction, the IOMMU may block the transfer. Always use the correct direction for your actual data flow.
`dma_unmap_*` (including `dma_unmap_single`, `dma_unmap_page`, `dma_unmap_sg`) unmaps a previously mapped DMA buffer, releasing the IOVA space and invalidating CPU cache state. If you unmap while the device is still using the buffer (DMA in flight), the device may read/write to memory that is no longer properly mapped—causing corruption, IOMMU faults, or system crashes. The correct sequence: (1) ensure device has finished the DMA (poll completion registers, use DMA fence operations), (2) call the appropriate `dma_unmap_*` function, (3) now CPU can safely access the buffer. For streaming DMA, unmap should happen as soon as the device signals completion. For coherent DMA, you typically never unmap (the memory is persistently mapped). Not unmaping causes IOVA leaks and (on systems without IOMMU) prevents that address space from being reused.
`dma_buf` is a kernel mechanism for sharing DMA-capable memory buffers between different drivers or subsystems (e.g., between a camera driver and a video encoder). The problem it solves: Driver A allocates a buffer suitable for DMA, but Driver B also needs access to the same buffer for a different operation. `dma_buf` wraps the buffer in a file descriptor-like interface with three operations: `attach` (allows another device to access the buffer), `map_attachment` (creates a DMA mapping for the attaching device), and `unmap_attachment`. This allows the buffer to be shared across devices with potentially different DMA constraints (different masks, different IOMMU requirements), as each device can create its own DMA mapping. The `dma_buf` abstraction is the backbone of the Linux media controller API (camera → ISP → encoder pipeline) and the GPU render buffer sharing in the graphics stack.
Further Reading
- Linux Kernel Documentation: DMA API — Official guide to DMA mapping, coherent buffers, and streaming DMA
- Linux Kernel Documentation: DMA Engine — DMA engine framework and client APIs
- Intel VT-d Architecture — Intel’s IOMMU specification and programming guide
- AMD-Vi Architecture — AMD’s IOMMU documentation for virtualization
- Thunderbolt Security Research — Open-source Thunderstrike DMA attack research and defense mechanisms
- PCI-SIG DMA Attestation — PCI Express security specifications including ATS and PRI
- CXL 3.0 Specification — CXL interconnect with cache coherency for DMA-like access with memory semantics
Conclusion
DMA transforms device I/O from CPU-bound copying into direct device-memory transactions, enabling storage, network, and audio systems to operate at full bandwidth without CPU involvement. Bus mastering DMA dominates modern systems, with devices containing their own DMA engines that control the memory bus directly. The complexity lies in managing the translation between CPU addresses (virtual and physical) and device-visible bus addresses, especially with IOMMU-mediated translation providing security boundaries.
Scatter-gather extends DMA across non-contiguous buffers, essential for network packets and fragmented data. Cache coherency management (via dmasync* calls) ensures the CPU and device see consistent data. The swiotlb bounce buffer mechanism provides compatibility for systems without IOMMU, at the cost of extra copies.
Looking forward, CXL (Compute Express Link) promises DMA-like access with cache coherency built into the interconnect, simplifying driver code while enabling heterogeneous computing. DMA security concerns grow as Thunderbolt and external device DMA remain attack vectors, making IOMMU enablement increasingly critical.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.