Threads & Lightweight Processes
Understanding user threads vs kernel threads, thread pools, and thread models for concurrent programming in modern operating systems.
Threads & Lightweight Processes
Picture a factory floor. Each process is like a separate building with its own entrance, utilities, and security clearance. Inside each building, workers (threads) share the same electricity, water, and access cards — they can communicate instantly without going outside.
In computing terms: threads are the unit of execution within a process. They share the process’s address space, open files, and global variables, but each thread has its own stack, program counter, and registers. This shared everything makes threads lightweight compared to processes, and it’s why modern servers can handle millions of concurrent connections with a handful of processes.
Introduction
When you run a web server, each incoming connection doesn’t need a separate process. A single process can handle thousands of connections using threads — each thread processes a different client, but all threads share the same memory and file descriptors. This dramatically reduces overhead compared to forking a new process per connection.
Threads enable concurrency (one CPU doing multiple things by interleaving) and parallelism (multiple CPUs doing multiple things simultaneously). Understanding when to use each — and which threading model to use — is crucial for building scalable systems.
User Threads vs Kernel Threads
The fundamental distinction in threading is where threads are managed:
| Aspect | User Threads | Kernel Threads |
|---|---|---|
| Managed by | User-level library (pthread, greenlet) | Operating system kernel |
| Scheduling | User-space library decides | Kernel scheduler decides |
| Context switch | Fast (no kernel involvement) | Slower (kernel mode required) |
| Blocking I/O | Can block entire process | Can block single thread |
| Multi-core | Cannot run in parallel | Can run in parallel |
| Example | GNU Pth, Green threads | POSIX pthread, Windows threads |
User Threads
User threads are managed entirely in user space without kernel involvement. The thread library (e.g., GNU Pth, goroutine runtime) handles thread creation, scheduling, and synchronization. The kernel sees only the single process — it knows nothing about the user-level threads within it.
Advantages:
- Fast creation and context switching (no system calls)
- Custom scheduling policies
- Portable across operating systems
Disadvantages:
- If any user thread blocks on I/O, the entire process blocks (kernel sees only one runnable task)
- Cannot take advantage of multi-core parallelism
- If the user thread library has bugs, the whole process crashes
// User-level threading example (GNU Pth)
#include <pth.h>
void *worker(void *arg) {
pth_sleep(1); // Cooperative yield to other user threads
return NULL;
}
int main() {
pth_init();
pth_t t1 = pth_spawn(PTH_ATTR_DEFAULT, worker, NULL);
pth_join(t1, NULL);
pth_exit(0);
}
Kernel Threads
Kernel threads are managed by the OS scheduler — each thread appears as a schedulable entity. The kernel handles context switching between threads, and threads from the same process can run on different CPU cores.
Advantages:
- True parallelism on multi-core systems
- Individual threads can block on I/O without affecting others
- Kernel handles all scheduling, no custom library needed
Disadvantages:
- Slower context switches (requires kernel mode transition)
- Higher memory overhead (kernel data structures per thread)
- More expensive creation
// Kernel threading example (POSIX pthread)
#include <pthread.h>
void *worker(void *arg) {
sleep(1); // Kernel preempts, other threads run
return NULL;
}
int main() {
pthread_t t1, t2;
pthread_create(&t1, NULL, worker, NULL);
pthread_create(&t2, NULL, worker, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
When to Use
- I/O-bound concurrent workloads: Web servers, API handlers, database connection pools
- CPU-bound parallel computation: Scientific computing, image processing, ML inference
- Real-time systems: Hard deadlines require kernel-level thread control
- High-throughput network services: Millions of connections need efficient threading
When Not to Use
- Simple scripts: Using threads adds complexity without benefit for sequential tasks
- CPU-bound single-task: A single-threaded process often outperforms threaded code for compute-heavy single tasks due to avoiding synchronization overhead
- Debugging complexity: Threading bugs (deadlocks, race conditions) are notoriously hard to reproduce — only use when necessary
Thread Models
Modern systems implement threading in several models, each with tradeoffs:
graph BT
subgraph "Threading Models"
M1[1:1 Model<br/>One kernel thread<br/>per user thread]
M2[N:1 Model<br/>Many user threads<br/>to one kernel thread]
M3[N:M Model<br/>Many user threads<br/>to many kernel threads]
end
K[Kernel Space]
U[User Space]
U --> M1
U --> M2
U --> M3
M1 --> K
M3 --> K
style M1 stroke:#1a3a1a
style M2 stroke:#3a1a1a
style M3 stroke:#1a1a3a
1:1 Model (One-to-One)
Each user-level thread maps to one kernel thread.
- Used by: Linux (NPTL), Windows (Win32 Threads), Solaris
- Pros: True parallelism, individual thread blocking works
- Cons: Higher overhead (kernel thread creation), limited scalability (thread count limited)
N:1 Model (Many-to-One)
Many user threads map to a single kernel thread.
- Used by: GNU Pth, early green thread implementations
- Pros: Fast, scalable to millions of fibers
- Cons: Cannot use multi-core, one blocked user thread blocks all
N:M Model (Many-to-Many)
Many user threads are distributed across a pool of kernel threads.
- Used by: Solaris (LWPs), some Go runtimes (goroutines on GOMAXPROCS > 1)
- Pros: Best of both worlds — parallelism without unlimited kernel threads
- Cons: Complex implementation, requires coordination between user and kernel scheduling
Architecture Diagram
graph TB
subgraph Process_Address_Space
subgraph Code_Region[Code (.text)]
C1[Function A]
C2[Function B]
end
subgraph Shared_Data[Shared Data]
G1[Global Variables]
G2[Heap]
end
subgraph Thread_Stacks[Per-Thread Stacks]
S1[Thread 1 Stack<br/>SP, BP, Return Addr]
S2[Thread 2 Stack<br/>SP, BP, Return Addr]
S3[Thread 3 Stack<br/>SP, BP, Return Addr]
end
subgraph Per_Thread_Registers[Per-Thread Registers]
R1[Thread 1: PC=0x401234]
R2[Thread 2: PC=0x402345]
R3[Thread 3: PC=0x403456]
end
end
S1 --> G1
S2 --> G1
S3 --> G1
S1 --> C1
S2 --> C2
S3 --> C1
style Shared_Data stroke:#1a3a1a
style Thread_Stacks stroke:#3a1a1a
style Per_Thread_Registers stroke:#1a1a3a
Each thread has its own stack and registers, but all threads share the code and global data regions. This shared access is what makes threads efficient for communication but also what creates synchronization challenges.
Core Concepts
Thread-Specific Data
Each thread needs its own stack for local variables, but often needs “global” data that’s actually per-thread (thread-local storage):
// Thread-local storage example
#include <pthread.h>
__thread int my_thread_id = 0; // GCC extension
void *worker(void *arg) {
my_thread_id = *(int *)arg;
printf("Thread %d sees my_thread_id = %d\n", pthread_self(), my_thread_id);
return NULL;
}
Thread Local Storage (TLS)
TLS provides thread-safe access to global-like data without explicit parameter passing. The OS allocates a separate slot for each thread’s copy:
# Python threading with local context
import threading
local_data = threading.local()
def worker(n):
local_data.value = n
import time
time.sleep(0.1)
print(f"Thread {n} sees value = {local_data.value}")
threads = [threading.Thread(target=worker, args=(i,)) for i in range(5)]
for t in threads: t.start()
for t in threads: t.join()
Thread States
Threads have states similar to processes:
- New: Thread being initialized
- Ready: Thread waiting to run
- Running: Thread executing on a CPU
- Blocked: Thread waiting for I/O or synchronization
- Terminated: Thread finished, waiting to be joined
Thread Cancellation
Threads can be cancelled, but careful handling is required:
#include <pthread.h>
void *worker(void *arg) {
while (1) {
// Check for cancellation point
pthread_testcancel();
// Do work
printf("Working...\n");
sleep(1);
}
return NULL;
}
int main() {
pthread_t t;
pthread_create(&t, NULL, worker, NULL);
sleep(5);
pthread_cancel(t); // Async cancellation
pthread_join(t, NULL);
return 0;
}
Thread Pools
Creating a new thread per task is expensive. Thread pools solve this by pre-creating threads and reusing them:
#!/usr/bin/env python3
"""Thread pool implementation in Python."""
from queue import Queue, Empty
from threading import Thread, Condition
import time
class ThreadPool:
def __init__(self, num_threads):
self.tasks = Queue()
self.workers = []
self.shutdown = False
self.lock = Condition()
for _ in range(num_threads):
t = Thread(target=self._worker_loop)
t.daemon = True
t.start()
self.workers.append(t)
def _worker_loop(self):
while True:
try:
task = self.tasks.get(timeout=0.1)
if task is None: # Poison pill
break
func, args, kwargs = task
try:
func(*args, **kwargs)
except Exception as e:
print(f"Task failed: {e}")
finally:
self.tasks.task_done()
except Empty:
if self.shutdown:
break
def submit(self, func, *args, **kwargs):
if self.shutdown:
raise RuntimeError("Pool is shutting down")
self.tasks.put((func, args, kwargs))
def wait_completion(self):
self.tasks.join()
def shutdown_pool(self):
self.shutdown = True
for _ in self.workers:
self.tasks.put(None) # Poison pills
for t in self.workers:
t.join()
# Usage
def cpu_task(n):
result = sum(i*i for i in range(1000000))
print(f"Task {n} completed, result starts with {result}")
pool = ThreadPool(4)
for i in range(8):
pool.submit(cpu_task, i)
pool.wait_completion()
pool.shutdown_pool()
Production Failure Scenarios
Deadlocks
Problem: Two threads each hold a lock the other needs, and neither will release until it acquires the other.
Symptoms: Application hangs with 100% CPU utilization (spinning on locks) or no CPU utilization (deadlocked with sleeping threads).
Mitigation:
- Always acquire locks in a consistent global order
- Use lock timeout with
pthread_mutex_timedlock() - Use lock-free data structures when possible
- Detect potential deadlocks with thread sanitizers
// Lock ordering to prevent deadlock
pthread_mutex_lock(&lock_a);
pthread_mutex_lock(&lock_b);
// Do work
pthread_mutex_unlock(&lock_b);
pthread_mutex_unlock(&lock_a);
// NEVER do this in another thread:
// pthread_mutex_lock(&lock_b);
// pthread_mutex_lock(&lock_a);
Race Conditions
Problem: Multiple threads access shared data without synchronization, leading to unpredictable results.
Symptoms: Intermittent crashes, data corruption, bizarre behavior only in production.
Mitigation:
- Use proper synchronization (mutexes, atomic operations)
- Prefer lock-free data structures
- Use thread-safe primitives from day one
- Run with
TSAN_OPTIONS=halt_on_error=1(ThreadSanitizer)
# Compile with ThreadSanitizer
gcc -fsanitize=thread -g program.c -o program
# Run to detect race conditions
./program
Thread Starvation
Problem: Some threads never get CPU time because other threads monopolize resources.
Symptoms: Certain requests timeout while others complete normally.
Mitigation:
- Set thread priorities appropriately
- Use fair scheduling policies (SCHED_FIFO, SCHED_OTHER)
- Implement work-stealing queues
Premature Thread Termination
Problem: Process exits before spawned threads complete.
Mitigation: Always call pthread_join() or use pthread_detach().
Trade-off Table
| Thread Model | Parallelism | Scalability | Complexity |
|---|---|---|---|
| 1:1 (Kernel) | Full | Limited by thread count | Simple |
| N:1 (User) | None | High | Simple |
| N:M (Hybrid) | Full | High | Complex |
| Thread Creation | Time (µs) | Memory |
|---|---|---|
| pthread_create | ~10-50 | ~8MB stack |
| goroutine | ~0.2 | ~2KB stack |
| fiber (fthread) | ~0.1 | ~4KB stack |
| Use Case | Best Model | Reason |
|---|---|---|
| Web server (blocking I/O) | 1:1 | True parallelism needed |
| Event-driven server | N:1 (green threads) | Millions of connections |
| CPU-bound parallel | N:M | Balance parallelism and overhead |
Implementation Snippets
Basic Pthread Creation and Join
#include <stdio.h>
#include <pthread.h>
#include <stdlib.h>
#define NUM_THREADS 4
void *compute(void *arg) {
int n = *(int *)arg;
long result = 0;
for (int i = 0; i < 1000000; i++) {
result += i * i;
}
printf("Thread %d computed %ld\n", n, result);
return NULL; // Return value passed to pthread_join()
}
int main() {
pthread_t threads[NUM_THREADS];
int args[NUM_THREADS];
// Create threads
for (int i = 0; i < NUM_THREADS; i++) {
args[i] = i;
if (pthread_create(&threads[i], NULL, compute, &args[i]) != 0) {
perror("pthread_create failed");
return 1;
}
}
// Wait for all threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL); // NULL = don't capture return
}
printf("All threads completed\n");
return 0;
}
Mutex for Shared Data
#include <pthread.h>
#include <stdio.h>
typedef struct {
pthread_mutex_t mutex;
int counter;
} shared_data_t;
void *increment(void *arg) {
shared_data_t *data = (shared_data_t *)arg;
for (int i = 0; i < 1000; i++) {
pthread_mutex_lock(&data->mutex);
data->counter++;
pthread_mutex_unlock(&data->mutex);
}
return NULL;
}
int main() {
shared_data_t data = {.counter = 0};
pthread_mutex_init(&data.mutex, NULL);
pthread_t t1, t2;
pthread_create(&t1, NULL, increment, &data);
pthread_create(&t2, NULL, increment, &data);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
printf("Final counter: %d (expected: 2000)\n", data.counter);
pthread_mutex_destroy(&data.mutex);
return 0;
}
Condition Variables for Thread Coordination
#include <pthread.h>
#include <stdio.h>
typedef struct {
pthread_mutex_t mutex;
pthread_cond_t cond;
int ready;
} sync_data_t;
void *producer(void *arg) {
sync_data_t *data = (sync_data_t *)arg;
printf("Producer: Doing work...\n");
sleep(1);
pthread_mutex_lock(&data->mutex);
data->ready = 1;
pthread_cond_signal(&data->cond); // Wake up consumer
pthread_mutex_unlock(&data->mutex);
return NULL;
}
void *consumer(void *arg) {
sync_data_t *data = (sync_data_t *)arg;
pthread_mutex_lock(&data->mutex);
while (!data->ready) {
pthread_cond_wait(&data->cond, &data->mutex); // Atomically unlock and wait
}
pthread_mutex_unlock(&data->mutex);
printf("Consumer: Data is ready!\n");
return NULL;
}
int main() {
sync_data_t data = {.ready = 0};
pthread_mutex_init(&data.mutex, NULL);
pthread_cond_init(&data.cond, NULL);
pthread_t prod, cons;
pthread_create(&cons, NULL, consumer, &data);
pthread_create(&prod, NULL, producer, &data);
pthread_join(prod, NULL);
pthread_join(cons, NULL);
pthread_mutex_destroy(&data.mutex);
pthread_cond_destroy(&data->cond);
return 0;
}
Observability Checklist
Key Metrics
# Show all threads (LWPs) for a process
ps -eLf -p <pid>
# Show thread-specific CPU usage
top -H -p <pid>
# Count threads
cat /proc/<pid>/status | grep Threads
# Check thread IDs
cat /proc/<pid>/task/*/tid
Trace Commands
# Trace thread creation
sudo perf trace -e clone
# Trace mutex operations
perf record -e 'mutex:*' -a -- sleep 10
perf report
# Show lock contention
perf sched record -- sleep 5
perf sched latency
Debugging Commands
# Attach to process and view threads
gdb -p <pid>
(gdb) info threads
# View thread stack traces
pstack <pid>
# Show where threads are blocked
cat /proc/<pid>/stack
Common Pitfalls / Anti-Patterns
Thread Safety in Security-Critical Code
Security-sensitive data (keys, tokens, credentials) must be protected with proper synchronization. Use thread-local storage for sensitive per-thread data to avoid accidentally leaking between threads.
Resource Limits
Thread counts are limited by system resources:
# Check thread limits
cat /proc/sys/kernel/threads-max
cat /proc/sys/vm/max_map_count
# Set soft limit
ulimit -u
Race Conditions as Security Vulnerabilities
Race conditions in privilege operations can lead to TOCTOU (Time-of-check to time-of-use) attacks. Use atomic operations for security-critical code paths.
Common Pitfalls / Anti-patterns
Creating too many threads
Each thread has a stack (default 8MB on Linux). Creating 1000 threads consumes 8GB just for stacks. Use thread pools instead.
Not handling thread cancellation safely
If you call pthread_cancel(), ensure cancellation points exist. Async cancellation can leave data structures corrupted.
Ignoring false sharing
Threads modifying data in the same cache line (even with different variables) cause cache line invalidation storms. Pad data structures to cache line boundaries (64 bytes).
// False sharing example
struct {
pthread_mutex_t mutex;
long counter; // These two might share a cache line
} data[4]; // Each thread modifies data[i].counter
// Better: pad to prevent cache line sharing
struct {
char pad[64]; // Cache line padding
long counter;
};
Mixing locked and lock-free access
Don’t use mutexes for some accesses to shared data and atomics for others — this creates undefined behavior.
Quick Recap Checklist
- Threads share process address space but have own stack and registers
- User threads: fast but can’t use multi-core; kernel threads: slower but truly parallel
- 1:1 model (Linux, Windows) is most common; N:M (Solaris, some Go) balances tradeoffs
- Thread pools avoid creation overhead and limit thread count
- Deadlocks occur when locks are acquired in inconsistent orders
- Race conditions cause intermittent failures; use ThreadSanitizer to detect
- Each thread needs ~8MB stack on Linux; too many threads = memory exhaustion
- Always join or detach threads to prevent resource leaks
- Use thread-local storage for per-thread globals without explicit parameter passing
Interview Questions
A process is an independent execution context with its own address space, while a thread is the unit of execution within a process.
Key differences:
- Memory: Processes have separate address spaces; threads share the same address space
- Communication: Processes require IPC (pipes, sockets, shared memory); threads can communicate directly via shared memory
- Creation overhead: Process creation duplicates memory tables; thread creation only allocates a stack and registers
- Fault isolation: A crash in one process doesn't affect others; a crash in one thread can crash the entire process
Think of processes as separate houses (each with own utilities) and threads as roommates in the same house (sharing electricity, water, internet).
1:1 Model (One-to-One): Each user thread maps to one kernel thread. Used by Linux (NPTL), Windows. Allows true parallelism and individual thread blocking, but kernel thread count is limited.
N:1 Model (Many-to-One): Many user threads map to a single kernel thread. Used by GNU Pth and green thread implementations. Very scalable (millions of fibers) but cannot use multi-core, and one blocking user thread blocks the entire process.
N:M Model (Many-to-Many): User threads are multiplexed over a pool of kernel threads. Used by Solaris LWPs. Best of both worlds — parallelism without unbounded kernel thread count. Most complex to implement.
Modern server operating systems predominantly use the 1:1 model due to its simplicity and the fact that kernel thread limits (thousands) are sufficient for most workloads.
A deadlock occurs when two or more threads are permanently blocked, each holding a lock the others need. Classic example: Thread A holds Lock 1 and waits for Lock 2, while Thread B holds Lock 2 and waits for Lock 1.
Prevention strategies:
- Lock ordering: Always acquire multiple locks in a consistent global order. If all threads follow the same order, circular wait is impossible.
- Lock timeout: Use
pthread_mutex_timedlock()to detect potential deadlocks before they occur. - Fine-grained locking: Reduce the number of locks held simultaneously.
- Lock-free data structures: Use atomic operations and lock-free algorithms when possible.
- Avoid nested locks: If possible, design code to need only one lock at a time.
Detection tools like ThreadSanitizer can identify potential deadlock patterns before they occur in production.
A thread pool is a pattern where a fixed number of threads are pre-created and reused to execute multiple tasks. Instead of creating a new thread per task (expensive), tasks are placed in a queue and picked up by idle threads.
Benefits:
- Avoids thread creation overhead (10-50 µs per pthread_create)
- Bounds memory usage (fixed thread count, each ~8MB stack)
- Provides backpressure (queue full = reject or block new tasks)
- Simplifies concurrency management
Use a thread pool when:
- Handling many short-lived tasks
- Processing items in parallel without unbounded thread growth
- Web servers handling many concurrent requests
- Database connection pools
Typical pool size: 1-2x CPU core count for CPU-bound work, higher for I/O-bound work. Many languages have built-in thread pools: Java's ExecutorService, Python's concurrent.futures.ThreadPoolExecutor, Go's goroutine pools.
False sharing occurs when multiple threads modify data that happens to reside in the same CPU cache line, even though the threads are accessing different variables. This causes cache line invalidation on every write, effectively serializing access.
Example: Thread 0 modifies counter[0], Thread 1 modifies counter[1]. If both are in the same 64-byte cache line, every write by one thread invalidates the cache line for the other, forcing memory round-trips.
Solutions:
- Pad data structures to cache line boundaries (64 bytes on x86)
- Use per-thread local data and aggregate later
- Use atomic operations with cache-aligned structures
False sharing can cause 10-100x performance degradation in high-frequency code paths. It's often seen in statistics counters, ring buffers, and any frequently-updated shared state.
A mutex (mutual exclusion lock) is a synchronization primitive that puts a thread to sleep when the lock is not available. When the lock is released, the kernel wakes up one of the waiting threads.
A spinlock busy-waits (spins) on a atomic variable until the lock becomes available. It does not sleep. Spinlocks are only efficient when the critical section is very short (less than the cost of a context switch) and lock contention is low.
Using a spinlock while holding a lock that might sleep (or in interrupt context) causes deadlock. In Linux kernel code, spinlocks are used in interrupt handlers because they cannot sleep, while mutexes are used in process context.
Thread-local storage (TLS) provides each thread with its own independent copy of a global variable. Instead of sharing a single variable across all threads, each thread sees its own instance.
Use cases:
- errno — each thread needs its own error code
- Performance counters per thread
- Caching per-thread state without locking
Implementations: GCC's __thread keyword, pthread_key_create()/pthread_getspecific(), thread_local (C++11). The compiler/linker allocates space in the thread's stack segment for each TLS variable.
A condition variable (condvar) allows a thread to wait for some arbitrary condition to become true, atomically releasing the associated mutex and adding itself to the wait queue.
Pattern:
- Thread acquires mutex
- Checks condition — if false, calls pthread_cond_wait()
- pthread_cond_wait() atomically releases the mutex and adds the thread to the wait queue
- When another thread signals (pthread_cond_signal) or broadcasts (pthread_cond_broadcast), the waiting thread wakes up and re-acquires the mutex
- Thread re-checks the condition and proceeds if true
Always re-check the condition in a loop — spurious wakeups can occur and another thread may have changed the condition between the wakeup and mutex acquisition.
The producer-consumer problem: one or more threads produce items and place them in a bounded buffer, while one or more threads consume items from the buffer. The buffer must not overflow ( producers must wait if full) and must not underflow ( consumers must wait if empty).
A semaphore is a non-negative integer with atomic wait() and signal() operations. A semaphore initialized to N allows up to N concurrent wait() calls before blocking.
Solution: Use two semaphores — one counting the number of items in the buffer (for consumers to wait on), and one counting the number of empty slots (for producers to wait on). A mutex provides mutual exclusion for the buffer itself.
This avoids the need for explicit while loops checking buffer state — the semaphore count tracks availability.
The readers-writers problem: multiple readers can access data concurrently, but writers need exclusive access. This is common for cached data that is read frequently but updated less often.
Solutions:
- Reader-preferring: Allow readers to proceed as long as no writer holds the lock. Can starve writers indefinitely.
- Writer-preferring: Once a writer is waiting, new readers are blocked. Can starve readers.
- Fair/ordered: Both readers and writers are queued FIFO. First to request the lock gets it next.
pthread provides pthread_rwlock_init() for reader-writer locks with optional writer-preference scheduling via pthread_rwlockattr_setkind_np().
pthread_join() blocks the caller until the target thread terminates. It also retrieves the thread's return value (passed to pthread_exit()). After join(), the thread's resources are freed.
pthread_detach() marks the thread as detached — its resources are freed immediately when it terminates, without needing to be joined. You cannot join a detached thread.
If you neither join nor detach a thread, its resources leak (remain allocated) after termination until the process exits. Always either join or detach every created thread.
Common pattern: main thread creates worker threads, workers detach themselves with pthread_detach() if they don't return a value that main needs.
A barrier is a synchronization primitive that blocks threads until a fixed number of threads have reached the barrier. Then all threads are released simultaneously.
Use case: parallel algorithms where work is divided into phases — all threads must complete phase 1 before any can start phase 2.
Example: parallel sort where threads sort their chunks, then all threads must merge — no thread can merge until all chunks are sorted.
pthread_barrier_init(N) creates a barrier for N threads. Each thread calls pthread_barrier_wait(). When the Nth thread calls wait(), all N threads are released.
Many C standard library functions use static internal state and are not thread-safe. Examples: strtok() (uses internal pointer), rand() (uses internal state), getenv() (implementation-dependent).
The errno variable is thread-local — each thread gets its own errno, so it is thread-safe even though it looks like a global.
Linux provides reentrant versions: strtok_r(), rand_r(), and async-signal-safe alternatives for signal handlers. man 7 pthreads documents which functions are thread-safe.
In C++11 and later, most standard library functions are specified as thread-safe. std::locale, std::cout (with some synchronization), and most containers require explicit synchronization for concurrent access.
The ABA problem occurs in lock-free data structures when using compare-and-swap (CAS):
- Thread A reads location X, sees value A
- Thread B changes X from A to B and back to A
- Thread A reads X again, sees A (same as before), and assumes no change occurred
If A's CAS checks only the value (A), it will succeed even though X changed twice. This can cause data corruption in queues, stacks, and linked lists.
Solutions: Double-width CAS (storing a counter with the pointer), tagged pointers (using extra bits as a tag that increments on each change), or using hazard pointers to track nodes currently being accessed.
RCU is a synchronization mechanism used in the Linux kernel for read-heavy workloads. Readers access data without locking — they just read. Writers make a private copy, modify it, then atomically replace the pointer to the old data with the new copy.
The key insight: old readers are guaranteed to finish their read before the old data is freed. This is called "grace period" — the writer waits for all pre-existing readers to complete.
RCU is used for routing tables, file system dentries, and other read-heavy structures in the Linux kernel. It provides extremely low overhead for readers (no atomic operations, no memory barriers on most architectures) at the cost of more complex writer logic.
Deadlock detection: use tools like helgrind (part of Valgrind), ThreadSanitizer, or the libasan thread checker. For production, use the deadlock detection options in your mutex implementation or instrument lock acquisition with stack traces.
Race conditions: ThreadSanitizer (tsan) detects races by tracking memory access and synchronization events. Compile with -fsanitize=thread. Helgrind detects lock order inversions and races in pthread programs.
For production debugging: use SIGABRT handlers that print thread backtraces (especially useful for deadlocks — pressing Ctrl+\\ sends SIGQUIT to dump all thread stacks). Use GDB's thread apply all bt command.
From the kernel scheduler's perspective, only kernel threads (lightweight processes, LWPs) are schedulable entities. User threads exist only in user space — the kernel knows nothing about them.
In the 1:1 model, each user thread maps to a kernel thread, so each is schedulable. In the N:1 model, all user threads map to one kernel thread — only one can run at a time. In the N:M model, a pool of kernel threads serves many user threads.
The scheduler makes decisions about which kernel thread to run on which CPU core. The threading library's user-space scheduler decides which of its user threads runs on the next available kernel thread.
A futex (Fast Userspace Mutex) is a Linux-specific syscall that provides efficient synchronization without kernel involvement in the common case.
The problem with traditional mutexes: every lock acquisition and release required a syscall (to enter the kernel), even when there was no contention. On an uncontended mutex, this was pure overhead.
Futex works by keeping the lock state in userspace memory (an atomic integer). Contention is detected via atomic operations. Only when contention is detected does the thread enter the kernel to wait on a kernel queue. When the lock is released, the releaser notifies the kernel, which wakes one waiter.
Modern mutexes (pthread_mutex, std::mutex) on Linux are all implemented on top of futexes.
A lock-free data structure uses atomic operations (CAS, fetch_add, etc.) instead of locks to ensure that at least one thread can make progress in any operation, regardless of other threads' state.
Lock-free is stronger than wait-free (every operation completes in bounded steps) and lock-free (some thread always makes progress).
Use lock-free structures when:
- Multiple threads update a shared structure with minimal contention
- You need better latency predictability than locks provide (no priority inversion)
- Real-time systems where priority inheritance overhead is unacceptable
Trade-offs: lock-free algorithms are significantly more complex to write correctly. Memory management is particularly hard (hazard pointers, RCU, or epoch-based reclamation are common approaches).
Go uses an M:N threading model — M goroutines multiplexed onto N OS threads (typically GOMAXPROCS threads). The Go runtime scheduler manages goroutine scheduling in user space, without kernel involvement for most scheduling decisions.
Unlike pthreads where each thread is a kernel thread, Go's scheduler runs goroutines on a fixed number of OS threads (configurable, defaults to number of CPU cores). When a goroutine blocks on a syscall, Go may spin up additional OS threads.
Key advantages: creating thousands of goroutines is cheap (kilobytes of stack vs megabytes for threads). The scheduler can make better decisions than the OS because it understands Go's channel synchronization.
However, CPU-bound goroutines can starve each other without yielding, unlike preemptive OS thread scheduling. Go 1.14 introduced asynchronous preemption to handle busy-waiting loops.
Further Reading
- Process Concept — Understand the PCB, process states, and fork-exec pattern
- Process Scheduling — How schedulers allocate CPU time to processes
- Process Scheduling Algorithms — Deep dive into scheduling algorithm implementations
Conclusion
Threads are the unit of execution within a process — they share the process’s address space, open files, and global variables while each maintaining its own stack, program counter, and registers. This shared-everything model makes threads lightweight compared to processes and enables efficient communication between concurrent tasks.
User threads (managed in library space) are fast but cannot achieve true multi-core parallelism because the kernel sees only one schedulable entity. Kernel threads are managed by the OS scheduler and can run on different CPU cores simultaneously. The 1:1 model (used by Linux and Windows) is the most common implementation — each user thread maps to one kernel thread. The N:M model (used by some Go runtimes) multiplexes many user threads over a pool of kernel threads for best-of-both-worlds performance.
Thread pools avoid the overhead of creating and destroying threads for short-lived tasks, while bounding memory usage by limiting the number of concurrent threads. Synchronization primitives (mutexes, condition variables, semaphores) protect shared data, but incorrect usage leads to deadlocks, race conditions, and performance pathologies like false sharing.
For your next step, explore process scheduling algorithms to understand how the OS decides which thread runs on each CPU core, or synchronization primitives to master the tools that make multi-threaded programming safe and correct.
Category
Related Posts
ASLR & Stack Protection
Address Space Layout Randomization, stack canaries, and exploit mitigation techniques
Assembly Language Basics: Writing Code the CPU Understands
Learn to read and write simple programs in x86 and ARM assembly, understanding registers, instructions, and the art of thinking in low-level operations.
Boolean Logic & Gates
Understanding AND, OR, NOT gates and how they combine into arithmetic logic units — the building blocks of every processor.