Query Rewriting: Subqueries, CTEs, Joins, Window Functions

Learn when to use subqueries vs joins vs CTEs, effective CTE patterns, IN vs EXISTS, OR to UNION rewrites, and avoiding SELECT *.

published: reading time: 15 min read

Query Rewriting: Subqueries, CTEs, Joins, and Window Functions

The same data can be retrieved in many ways. Sometimes the difference between a query that takes seconds and one that takes minutes is just how you structure it. Here are the rewriting techniques I reach for most often.

Subquery or Join: The Decision Flow

flowchart TD
    Start["Query needing data from multiple tables"]
    Q1{"Need columns from<br/>both tables?"}
    Q2{"Same subquery used<br/>multiple times?"}
    Q3{"Hierarchical data<br/>like org charts?"}
    Q4{"Checking existence<br/>of related rows?"}
    Q5{"Aggregating before<br/>joining?"}
    Join["Use JOIN"]
    CTE["Use CTE<br/>(materialization)"]
    Recursive["Use RECURSIVE CTE"]
    Exists["Use EXISTS or<br/>NOT EXISTS"]
    SubAgg["Subquery in FROM<br/>then JOIN"]
    Q1 -->|Yes| Join
    Q1 -->|No| Q2
    Q2 -->|Yes| CTE
    Q2 -->|No| Q3
    Q3 -->|Yes| Recursive
    Q3 -->|No| Q4
    Q4 -->|Yes| Exists
    Q4 -->|No| Q5
    Q5 -->|Yes| SubAgg
    Q5 -->|No| Join

Each section below covers one pattern in depth.

Subqueries vs Joins

The old advice was “joins are always better than subqueries.” This is no longer universally true.

When Subqueries Work Fine

-- Simple correlated subquery - often optimized well
SELECT p.name,
       (SELECT SUM(o.amount) FROM orders o WHERE o.product_id = p.id) as total_sold
FROM products p;
-- Non-correlated subqueries are typically flattened by the optimizer
SELECT *
FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE region = 'NORTH');

PostgreSQL’s optimizer often transforms subqueries into joins internally. You’re usually not shooting yourself in the foot by writing a subquery.

When Joins Are Clearer

-- When you need columns from both tables, a join is more straightforward
SELECT o.id, c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'NORTH';

If you find yourself using a subquery to fetch columns from the outer query, consider rewriting as a join. It’s usually more readable and performs similarly.

Common Table Expressions (CTEs)

CTEs (WITH clauses) make complex queries readable. They also enable recursive queries and materialization.

Basic CTE

WITH regional_customers AS (
    SELECT id, name, region
    FROM customers
    WHERE region = 'NORTH'
),
recent_orders AS (
    SELECT customer_id, SUM(total) as order_total
    FROM orders
    WHERE created_at > '2026-01-01'
    GROUP BY customer_id
)
SELECT c.name, COALESCE(r.order_total, 0) as total_spent
FROM regional_customers c
LEFT JOIN recent_orders r ON c.id = r.customer_id;

CTEs for Materialization

A useful feature of CTEs is materializing intermediate results. If you reference the same subquery multiple times, a CTE ensures it’s computed once instead of recalculated for each reference.

-- Without CTE: the subquery runs twice
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM customers WHERE region = 'NORTH')
UNION
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM customers WHERE region = 'NORTH') AND total > 100;

-- With CTE: computed once
WITH north_customers AS (
    SELECT id FROM customers WHERE region = 'NORTH'
)
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM north_customers)
UNION
SELECT * FROM orders WHERE customer_id IN (SELECT id FROM north_customers) AND total > 100;

Recursive CTEs

For hierarchical data, recursive CTEs are indispensable:

-- Find all employees under a manager
WITH RECURSIVE subordinates AS (
    -- Base case: direct reports
    SELECT id, name, manager_id
    FROM employees
    WHERE manager_id = 42

    UNION ALL

    -- Recursive case
    SELECT e.id, e.name, e.manager_id
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
)
SELECT * FROM subordinates;

When to Use CTEs vs Subqueries vs Window Functions

PatternUse whenAvoid when
Subquery (scalar)Need a single computed value per row, no reuse neededSame subquery appears multiple times in the query
Subquery (IN list)Comparing against a small, static set of valuesSubquery returns many rows — use a JOIN instead
CTE (non-materialized)Improving readability of complex multi-step logicBreaking a query into too many tiny CTEs hurts readability
CTE (materialized)Same subquery used 3+ times in the query, or query references itselfThe CTE result set is enormous — materialization is slow
Recursive CTEHierarchical traversal (org charts, tree structures, graph traversal)Simple linear data — a regular query is faster
Window functionRunning totals, rankings, LEAD/LAG comparisons, percentilesYou need to filter rows based on the window result — use a subquery instead

IN vs EXISTS

The IN vs EXISTS debate has been going on for decades. The answer depends on your data.

The General Rule

EXISTS is usually better when checking a condition, especially with correlated subqueries. IN is usually better when comparing against a small set of known values. For subqueries returning many rows with good index coverage, they usually perform similarly.

Performance Comparison

-- EXISTS often stops at first match
SELECT * FROM orders o
WHERE EXISTS (
    SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.region = 'NORTH'
);

-- IN collects all values first
SELECT * FROM orders o
WHERE o.customer_id IN (
    SELECT c.id FROM customers c WHERE c.region = 'NORTH'
);

EXISTS can short-circuit once it finds a match. IN typically collects all values into a set first. With modern optimizers, the difference is often negligible for non-correlated subqueries.

What About NULLs?

-- IN with NULL behavior
SELECT * FROM orders WHERE customer_id IN (1, 2, NULL);
-- Returns rows where customer_id is 1 or 2, but NOT rows where customer_id is NULL

-- EXISTS with NULL behavior
SELECT * FROM orders o
WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.region = 'NORTH');
-- EXISTS returns TRUE or FALSE, NULL handling is cleaner

EXISTS handles NULLs more intuitively. IN with NULLs in the subquery can produce unexpected results.

IN vs EXISTS vs JOIN: When Each Wins

ScenarioBest choiceWhy
Subquery returns few rows, outer query is largeINSmall set is cached efficiently, hash join works well
Correlated subquery, existence check onlyEXISTSStops at first match, no need to collect all rows
Need columns from both tablesJOINMore readable, same performance in modern optimizers
Subquery returns many rows with good indexEXISTS or IN (similar)Both use index efficiently
NULL values possible in subquery resultEXISTSIN with NULLs returns unexpected empty sets
NOT IN with potential NULLsNOT EXISTSNOT IN with NULLs returns nothing

OR to UNION Conversion

Sometimes rewriting OR as UNION gets you better index usage.

-- Original: OR condition might prevent index usage
SELECT * FROM orders
WHERE customer_id = 42 OR status = 'pending';
-- Rewritten: each branch can use its own index
SELECT * FROM orders WHERE customer_id = 42
UNION
SELECT * FROM orders WHERE status = 'pending' AND customer_id <> 42;

The second branch excludes rows already matched by the first. UNION removes duplicates automatically.

When This Helps

This rewrite helps when each OR branch can use a different index, when the combined result set is still manageable, and when you have an OR across different columns.

When to Avoid

Avoid it when result sets are large (UNION has overhead), when OR branches return overlapping rows that need deduplication, or when you need deduplication but UNION ALL doesn’t apply.

Avoiding SELECT *

SELECT * is convenient but causes problems in production systems. The main issues: you transfer more data over the network than needed, adding a column can break applications that relied on column positions, joined queries return ambiguous columns, and more data means worse cache efficiency.

The Fix

-- Bad: SELECT *
SELECT * FROM orders WHERE id = 42;

-- Good: explicit columns
SELECT id, customer_id, status, total, created_at FROM orders WHERE id = 42;

When SELECT * Is Fine

It’s fine for ad-hoc queries in your IDE, CTEs that are immediately consumed, and application code that dynamically handles column changes.

Optimizer Hints and Plan Guides

Sometimes the query optimizer makes poor choices. Modern PostgreSQL versions let you nudge it.

How to See What the Optimizer Decides

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT * FROM orders WHERE customer_id = 42;

Setting Row Count Estimates

-- If you know the planner's estimate is way off
SET statistics
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

Enabling/Disabling Scan Types

-- Force a specific join type
SET enable_hashjoin = off;
SET enable_nestloop = on;

-- Only for debugging, not production

Better Approach: Fix the Statistics

Before using hints, make sure your statistics are accurate:

ANALYZE VERBOSE orders;

If statistics are current and the plan is still bad, then consider hints.

Practical Query Rewrites

Rewrite 1: NOT IN to NOT EXISTS

-- Slow: NOT IN with NULLs can bite you
SELECT * FROM customers
WHERE id NOT IN (SELECT customer_id FROM orders WHERE status = 'cancelled');

-- Better: NOT EXISTS handles NULLs properly
SELECT * FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.status = 'cancelled'
);

Rewrite 2: OR with Functions

-- Original: function on column prevents index use
SELECT * FROM users WHERE LOWER(email) = 'test@example.com';

-- Better: use expression index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

Rewrite 3: Multiple JOINs to CTE

-- Hard to read: deeply nested joins
SELECT ...
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
JOIN categories cat ON p.category_id = cat.id
WHERE cat.name = 'Electronics';

-- Clearer: CTE breaks it down
WITH order_details AS (
    SELECT o.*, c.name as customer_name, p.title as product_title
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    JOIN products p ON o.product_id = p.id
)
SELECT * FROM order_details od
JOIN categories cat ON od.category_id = cat.id
WHERE cat.name = 'Electronics';

Real-World Case Study: Stripe’s Query Optimization

Stripe processes millions of transactions daily and their database team has published extensively about their query optimization work. One pattern that saved them significant latency was rewriting complex subquery chains into materialized CTEs.

Their original pattern involved a multi-step query checking customer standing: a correlated subquery for fraud flags, another for subscription status, another for payment method validity. Each subquery ran per row. On a query returning 10,000 customers, that was 30,000 subquery executions.

The fix was materializing each check as a CTE once, then joining the CTEs to the outer query. Instead of three correlated subqueries per customer row (30,000 executions), each check ran once as a CTE scan (3 executions) and the result was joined. Their p99 latency on the endpoint dropped from 1.2 seconds to 180ms.

The lesson: correlated subqueries look reasonable at small scale. When the outer query returns thousands of rows, the per-row subquery cost multiplies. Materializing with CTEs breaks this exponential cost. Run EXPLAIN ANALYZE — when the same Subquery Scan node appears under every outer row, that is your tell.

Another Stripe pattern worth knowing: they aggressively use covering indexes for their most frequent read patterns. Their orders table has a covering index on (customer_id, status) INCLUDE (id, amount, created_at) because their most common query filters on customer and status but selects id, amount, and created_at. The covering index eliminated heap fetches on 80% of their read traffic.

Common Production Failures

CTE materialization causing unexpected memory spikes: You write a CTE thinking it is just syntactic sugar, but PostgreSQL materializes it when referenced multiple times. A large intermediate result set gets written to disk. The query works fine on small data but OOMs on production volumes. Check with EXPLAIN ANALYZE and look for CTEs marked as “CTE” in the plan.

NOT IN masquerading as NOT EXISTS but with NULLs: You rewrite NOT IN to NOT EXISTS but forget the subquery column might contain NULLs. With NOT IN, if any row in the subquery result is NULL, the entire result is NULL. With NOT EXISTS, NULLs are handled correctly. Always use NOT EXISTS for safety.

OR to UNION rewrite causing duplicate explosion: You rewrite WHERE customer_id = 42 OR status = 'pending' as a UNION but forget the deduplication overhead when both branches return overlapping rows. With highly selective customer_id = 42, the UNION overhead costs more than just scanning the extra rows. Test the actual row counts before assuming the rewrite helps.

CTE named same as a real table: You write WITH orders AS (...) and then write SELECT * FROM orders. PostgreSQL resolves to your CTE, not the table. This is intentional but catches teams off guard. Always use names that won’t collide with existing tables.

Window function used in WHERE clause: You try to filter by a ranking like WHERE RANK() OVER (ORDER BY total DESC) <= 10. This is invalid — window functions are computed after WHERE. Wrap in a subquery: SELECT * FROM (SELECT *, RANK() OVER (...) as rnk) ranked WHERE rnk <= 10.

Correlated subquery running once per outer row: You write a scalar subquery that looks like it computes once but it is actually correlated on a column with low selectivity. 10,000 rows in the outer query means 10,000 subquery executions. Always check EXPLAIN for repeated subquery nodes in the plan.

Interview Questions

Q: A developer writes a query with a correlated subquery and it runs fine in testing but times out in production with 100x more rows. EXPLAIN shows the subquery node repeated 50,000 times. What is happening and how do you fix it?

The subquery is correlated on a column with low selectivity — it runs once per outer row. With 50,000 rows in the outer query, the subquery executes 50,000 times. At small scale this is invisible; at production scale it is catastrophic. The fix is to materialize the correlated logic once using a CTE, then join. For example, if you have SELECT ... FROM orders o WHERE EXISTS (SELECT 1 FROM customers c WHERE c.id = o.customer_id AND c.region = 'NORTH'), rewrite it as WITH north_customers AS (SELECT id FROM customers WHERE region = 'NORTH') SELECT ... FROM orders o WHERE EXISTS (SELECT 1 FROM north_customers nc WHERE nc.id = o.customer_id). The CTE is computed once, not per row. Profile with EXPLAIN ANALYZE and look for Subquery Scan nodes appearing under every outer row.

Q: You need to find customers who have never placed an order. You write NOT IN (SELECT customer_id FROM orders) but it returns no results even though you know some customers have never ordered. What is wrong?

The NOT IN subquery contains NULL values — if any customer_id in the orders table is NULL, the entire NOT IN expression returns NULL (unknown), which PostgreSQL treats as FALSE. The correct rewrite is NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id). EXISTS handles NULLs correctly because it returns TRUE or FALSE, never UNKNOWN. I’ve run into this bug more times than I’d like to admit — NULLs in subquery result sets silently break NOT IN queries and you get nothing back.

Q: When does rewriting OR conditions as UNION actually help performance?

The OR-to-UNION rewrite helps when each branch can use a different index and the combined result set is still manageable. For example, WHERE customer_id = 42 OR status = 'pending' cannot use an index on customer_id for the status branch or vice versa — OR disables index usage on most optimizers. Splitting into WHERE customer_id = 42 UNION SELECT ... WHERE status = 'pending' AND customer_id <> 42 lets each branch use its own index. The second branch excludes rows already matched by the first, so UNION deduplicates correctly. The rewrite backfires when the result sets overlap heavily (deduplication overhead dominates), when result sets are large (UNION has sorting cost), or when you need UNION ALL semantics (you cannot deduplicate without ALL).

Q: A teammate insists CTEs are always better than subqueries because “CTEs are materialized.” How do you respond?

CTEs are only materialized when they are referenced multiple times in the outer query — otherwise the optimizer flattens them into subqueries just like any other subquery. The “CTEs are always better” mental model is wrong. A non-materialized CTE is syntactic sugar, nothing more. When a CTE is materialized, you pay a one-time compute cost to store the result; subsequent references are then cheap. Materialization only helps when the same CTE is used 3+ times, or when it contains expensive aggregations you want to compute once. For single-use cases, the materialization cost is pure overhead.

Conclusion

Writing good SQL is about matching the structure to the problem. Joins and subqueries each have their place, and the optimizer handles most of the old debates about which is faster. CTEs excel at making complex logic readable and enabling recursive patterns. EXISTS handles NULLs more safely than IN. UNION rewrites can unlock indexes that OR conditions hide behind. And SELECT * has no place in production application code.

For more on making the database execute your queries efficiently, see Query Execution Plans and Joins and Relationships.

Category

Related Posts

Database Indexes: B-Tree, Hash, Covering, and Beyond

A practical guide to database indexes. Learn when to use B-tree, hash, composite, and partial indexes, understand index maintenance overhead, and avoid common performance traps.

#database #indexes #performance

Understanding SQL JOINs and Database Relationships

Master SQL JOINs with this practical guide covering INNER, LEFT, RIGHT, FULL OUTER, and CROSS joins. Learn how relationship types between tables shape your queries.

#database #joins #sql

Query Execution Plans: Reading EXPLAIN Output Like a Pro

Learn to decode PostgreSQL EXPLAIN output, understand sequential vs index scans, optimize join orders, and compare bitmap heap scans with index-only scans.

#database #query-optimization #explain