Joins and Aggregations: SQL Patterns for Data Analysis

Master SQL joins and aggregation techniques for building efficient analytical queries in data warehouses and analytical databases.

published: reading time: 13 min read

Joins and Aggregations: SQL Patterns for Data Analysis

SQL is the lingua franca of data analysis. Whether you are querying a data warehouse, building a report, or exploring datasets, SQL gets the work done. But SQL queries can get complicated fast, especially when you start combining multiple tables and rolling up data.

This post covers the essential patterns for joins and aggregations that data engineers and analysts use daily. Understanding these patterns deeply will make you faster at writing queries and better at diagnosing issues when they arise.

Understanding Joins

A join combines rows from two or more tables based on a related column. The key insight is that joins are about relationships, and understanding the relationship between tables determines which join type to use.

flowchart LR
    subgraph A["Table A (customers)"]
        A1["id=1 Alice"]
        A2["id=2 Bob"]
        A3["id=3 Carol"]
    end
    subgraph B["Table B (orders)"]
        B1["id=1 Alice → order 101"]
        B2["id=1 Alice → order 102"]
        B3["id=4 Dave (no match)"]
    end
    subgraph IJ["INNER JOIN"]
        IJ1["Alice order 101"]
        IJ2["Alice order 102"]
    end
    subgraph LJ["LEFT JOIN"]
        LJ1["Alice order 101"]
        LJ2["Alice order 102"]
        LJ3["Bob → NULL"]
        LJ4["Carol → NULL"]
    end
    subgraph OJ["FULL OUTER JOIN"]
        OJ1["Alice order 101"]
        OJ2["Alice order 102"]
        OJ3["Bob → NULL"]
        OJ4["Carol → NULL"]
        OJ5["NULL → Dave"]
    end
    A1 --> IJ1
    B1 --> IJ1
    A1 --> LJ1
    B1 --> LJ1
    A2 --> LJ3
    A3 --> LJ4
    A1 --> OJ1
    B1 --> OJ1
    A2 --> OJ3
    A3 --> OJ4
    B3 --> OJ5

Inner Join

An inner join returns only rows that have matches in both tables. If a customer has no orders, they do not appear in the result.

SELECT
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

This query returns customers who have placed orders. The customer_id exists in both tables.

Left Join

A left join returns all rows from the left table, with NULL values for the right table when there is no match.

SELECT
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.total_amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;

Now every customer appears, even those who have never placed an order. Their order_id and total_amount are NULL.

This is useful for finding customers who have not taken a specific action. The NULL values flag the non-matching rows.

Right Join

A right join returns all rows from the right table. This is less commonly used and often indicates a schema design worth reconsidering.

-- Every order appears, even if the customer is somehow missing
SELECT
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.total_amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;

Full Outer Join

A full outer join returns all rows from both tables, with NULLs where there is no match.

SELECT
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.total_amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;

This shows all customers and all orders, with NULLs where there is no relationship. Useful for data quality checks to find orphaned records.

Cross Join

A cross join produces a Cartesian product: every row from the first table joined with every row from the second table.

SELECT
    c.customer_name,
    p.product_name
FROM customers c
CROSS JOIN products p;

If you have 100 customers and 50 products, you get 5,000 rows. Use this carefully. Cross joins are sometimes useful for generating combinations, like all date/product pairs for a reporting skeleton.

Multi-Table Joins

Real queries often involve more than two tables. The join order matters for both correctness and performance.

The Logical Flow

When you join three tables, you are joining two tables first, then joining the result to the third. SQL semantics do not guarantee which pair joins first, but the result is the same regardless.

SELECT
    o.order_id,
    c.customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2026-01-01';

This query joins orders to customers, then to order items, then to products. Each join narrows the result set.

Aliases for Readability

Long table names get tedious. Use aliases.

SELECT
    o.order_id,
    c.customer_name,
    p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id

Aliases like o, c, and p are short but meaningful once you are familiar with the schema.

Understanding Aggregations

Aggregations summarize data across multiple rows. They transform rows into summary values.

Basic Aggregations

SELECT
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_revenue,
    AVG(total_amount) AS average_order_value,
    MIN(order_date) AS first_order_date,
    MAX(order_date) AS last_order_date
FROM orders;

These scalar aggregations return a single row with summary statistics.

GROUP BY

GROUP BY creates groups and returns one row per group.

SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS lifetime_value
FROM orders
GROUP BY customer_id;

One row per customer, with aggregated order statistics.

Multiple Columns in GROUP BY

Group by as many columns as you need.

SELECT
    customer_id,
    DATE_TRUNC('month', order_date) AS order_month,
    COUNT(*) AS order_count
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
ORDER BY customer_id, order_month;

This gives you monthly order counts per customer.

Filtering Aggregates with HAVING

WHERE filters rows before aggregation. HAVING filters groups after aggregation.

SELECT
    customer_id,
    SUM(total_amount) AS lifetime_value
FROM orders
GROUP BY customer_id
HAVING SUM(total_amount) > 10000;

This finds high-value customers. Only customers with more than $10,000 in total orders appear.

Combining Joins and Aggregations

The real power comes from joining tables, then aggregating the result.

SELECT
    c.customer_name,
    c.customer_region,
    SUM(o.total_amount) AS total_revenue,
    COUNT(o.order_id) AS order_count,
    AVG(o.total_amount) AS avg_order_value
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2026-01-01'
GROUP BY c.customer_name, c.customer_region
ORDER BY total_revenue DESC
LIMIT 20;

This returns the top 20 customers by revenue, with their region and order statistics.

Window Functions

Window functions perform calculations across a set of rows related to the current row. Unlike aggregate functions, they do not collapse rows.

Running Totals

SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM daily_sales;

The running total sums all revenue from the beginning through the current row.

Rank and Row Number

SELECT
    customer_name,
    lifetime_value,
    ROW_NUMBER() OVER (ORDER BY lifetime_value DESC) AS row_num,
    RANK() OVER (ORDER BY lifetime_value DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY lifetime_value DESC) AS dense_rank
FROM customer_lifetime_values;
  • ROW_NUMBER: sequential numbering with no ties
  • RANK: same rank for ties, with gaps
  • DENSE_RANK: same rank for ties, without gaps

Partition By

PARTITION BY restarts the window for each group.

SELECT
    customer_name,
    order_month,
    monthly_revenue,
    SUM(monthly_revenue) OVER (
        PARTITION BY customer_name
        ORDER BY order_month
    ) AS cumulative_revenue_per_customer
FROM customer_monthly_revenue;

The running total restarts for each customer.

Common Analytical Patterns

Year-over-Year Comparison

SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS current_month_revenue,
    LAG(SUM(total_amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) AS previous_month_revenue,
    SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) AS month_over_month_change
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

Percentage of Total

SELECT
    product_category,
    SUM(revenue) AS category_revenue,
    SUM(SUM(revenue)) OVER () AS total_revenue,
    SUM(revenue) * 100.0 / SUM(SUM(revenue)) OVER () AS percent_of_total
FROM product_sales
GROUP BY product_category
ORDER BY percent_of_total DESC;

The nested SUM aggregate computes the total, and the window function makes it available to every row.

Moving Averages

SELECT
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS seven_day_moving_average
FROM daily_sales;

This computes a 7-day moving average, including the current day and the 6 preceding days.

Join Performance Considerations

Joins can be expensive. A few principles help keep queries fast.

Join on Indexed Columns

Ensure join keys have indexes in both tables. This affects join performance dramatically.

-- Index on foreign key columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);

Filter Early

Apply WHERE clauses before joins when possible. This reduces the number of rows that need to be joined.

-- Better: filter before join
SELECT c.customer_name, SUM(o.total_amount)
FROM customers c
JOIN (SELECT * FROM orders WHERE order_date >= '2026-01-01') o
    ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

-- Filters all orders first, then joins

Beware of Cross Joins

A cross join with large tables produces enormous result sets. Always verify your join conditions before running.

-- This is almost certainly wrong if both tables have millions of rows
SELECT * FROM customers CROSS JOIN orders;

Common Pitfalls

Ambiguous Column Names

When two tables have columns with the same name, qualify them with table aliases.

-- WRONG if both tables have customer_id
SELECT customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id;

-- RIGHT: specify which table
SELECT c.customer_id FROM customers c JOIN orders o ON c.customer_id = o.customer_id;

NULL in Join Conditions

NULL does not equal NULL in SQL. A join on NULL values returns no matches.

-- This may not match rows where region is NULL
SELECT * FROM customers c JOIN regions r ON c.region = r.region;

-- Handle NULLs explicitly if needed
SELECT * FROM customers c
JOIN regions r ON COALESCE(c.region, 'Unknown') = COALESCE(r.region, 'Unknown');

Aggregation Without GROUP BY

-- WRONG: will fail or return unexpected results
SELECT customer_name, COUNT(*) FROM orders;

-- RIGHT: include all non-aggregated columns in GROUP BY
SELECT customer_name, COUNT(*) FROM orders GROUP BY customer_name;

Modern SQL modes like ONLY_FULL_GROUP_BY reject queries where non-aggregated columns are not in GROUP BY.

When to Use Each Join Type

Join TypeWhen to UseKey Risk
INNER JOINYou only want matching rows from both tables; drop rows with no matchSilent data loss if one side has unexpected NULLs or missing keys
LEFT JOINPreserve all left table rows; right table is supplementary contextRight side NULLs can confuse analysts; use COALESCE for display
RIGHT JOINAlmost never—rewrite as LEFT JOIN with table order reversedHarder to read and maintain
FULL OUTER JOINUnion of both tables when rows may exist in only one sideResult can be large; NULLs complicate downstream aggregation
CROSS JOINCartesian product (rarely intentional—test data, grid combinations)Always verify intent; check row count before running on real tables
SELF JOINHierarchies (employee-manager), same-table comparisonsComplex aliasing; easy to accidentally cross-join

SQL Observability Hooks

Track these metrics to diagnose query performance issues:

-- Identify large scans driving slow joins
SELECT
    schemaname,
    tablename,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    n_tup_ins,
    n_tup_upd,
    n_tup_del
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

-- Find missing indexes on foreign keys (tables with high seq_scan but no index on FK column)
SELECT
    t.tablename,
    t.seq_scan,
    i.indexname,
    i.indexdef
FROM pg_stat_user_tables t
LEFT JOIN pg_indexes i ON t.schemaname = i.schemaname
    AND t.tablename = i.tablename
WHERE t.seq_scan > 1000
ORDER BY t.seq_scan DESC;

Log every join query: tables involved, row counts before and after join, execution time. Alert on: queries returning >10x expected rows (possible Cartesian product), missing index warnings, NULL join key counts > 0.

SQL Production Failure Scenarios

Cartesian product explosion

A query joins customers (10,000 rows) to orders (100,000 rows) without a join condition. The result is 1 billion rows. The query runs for 30 minutes consuming memory until the database kills it or returns a disk-full error. The analyst intended WHERE customers.id = orders.customer_id but accidentally used a comma-separated FROM clause.

Mitigation: Always use explicit JOIN syntax with ON conditions. Enable sql_require_full_join_condition in strict SQL modes. Before running ad-hoc joins, estimate result size: COUNT(*) without SELECT columns first.

NULL join key causing silent row drops

Two tables both have region as a VARCHAR column used in the join. Rows where region IS NULL in either table never match because NULL = NULL returns NULL in SQL. A report of orders by region silently drops all orders from regions not yet assigned.

Mitigation: Use COALESCE(column, 'UNKNOWN') or NULLIF(column, '') on join keys to normalize empty strings and NULLs. Add a check query SELECT COUNT(*) FROM t1 WHERE join_key IS NULL before joining and alert if counts are non-zero.

Missing index causing sequential scan on large table

A fact table with 500 million rows has no index on customer_id. A LEFT JOIN to dim_customer (10,000 rows) triggers a sequential scan of the entire fact table. The query takes 45 minutes instead of 5 seconds.

Mitigation: Create indexes on all foreign key columns. Profile query plans with EXPLAIN ANALYZE before deploying. Use pg_stat_user_indexes to find unused indexes and pg_stat_user_tables to find high sequential scan tables.

FULL OUTER JOIN producing unexpected NULLs downstream

A FULL OUTER JOIN between fact_sales and fact_returns produces rows with NULLs on both sides. An analyst runs SUM(amount) without handling NULLs—the NULL rows are silently dropped, producing lower totals than expected.

Mitigation: Use COALESCE(column, 0) around all columns from FULL OUTER JOIN results. Add a validation query comparing row counts and sums to expected ranges before surfacing results to end users.

SQL Joins and Aggregations Quick Recap

  • INNER JOIN: only matching rows from both tables; watch for silent data loss.
  • LEFT JOIN: preserve all left table rows; use COALESCE to handle NULLs from non-matching right side.
  • RIGHT JOIN: rewrite as LEFT JOIN with table order flipped—never use RIGHT JOIN.
  • CROSS JOIN: accidental Cartesian products destroy query performance; always verify join conditions.
  • FILTER before JOIN when possible to reduce rows that need joining.
  • Window functions (ROW_NUMBER, SUM OVER) let you aggregate without collapsing rows.
  • Index foreign key columns; profile query plans with EXPLAIN ANALYZE before production.
  • NULL in join keys never matches—normalize with COALESCE before joining.

For deeper dives into the data modeling that sits behind these queries, see Kimball Dimensional Modeling for star schema design, or Slowly Changing Dimensions for handling attribute changes over time.

Joins and aggregations are the foundation of analytical SQL. The key concepts:

  • INNER JOIN returns matching rows only
  • LEFT JOIN preserves left table rows with NULLs for non-matches
  • FULL OUTER JOIN shows all rows from both tables
  • GROUP BY creates groups for aggregate calculations
  • WHERE filters before aggregation; HAVING filters after
  • Window functions perform calculations across related rows without collapsing them

These patterns combine to answer complex analytical questions. A typical business intelligence query joins dimension tables to fact tables, filters by date range, groups by business entities, and computes summary statistics.

For deeper dives into the data modeling that sits behind these queries, see Kimball Dimensional Modeling for star schema design, or Slowly Changing Dimensions for handling attribute changes over time.

Category

Related Posts

DuckDB: The SQLite for Analytical Workloads

Explore DuckDB, the in-process analytical database that runs anywhere, handles columnar storage efficiently, and brings analytics to where your data lives.

#data-engineering #duckdb #olap

Data Warehousing

OLAP vs OLTP comparison. Star and snowflake schemas, fact and dimension tables, slowly changing dimensions, and columnar storage in data warehouses.

#database #data-warehouse #olap

Data Vault: Scalable Enterprise Data Modeling

Learn Data Vault modeling methodology for building auditable, scalable enterprise data warehouses with hash keys and satellite tables.

#data-engineering #data-modeling #data-vault