Index Design Clinic: Composite Indexes, Covering Indexes, and Partial Indexes

Master composite index column ordering, covering indexes for query optimization, partial indexes for partitioned data, expression indexes, and selectivity.

published: reading time: 25 min read author: GeekWorkBench

Index Design Clinic: Composite Indexes, Covering Indexes, and Partial Indexes

Indexes are a double-edged sword. The right index makes queries fly. The wrong one wastes storage, slows down writes, and gives you nothing.

Introduction

Most queries need indexes to perform well, but adding indexes without understanding column order or selectivity is like navigating with a broken map. Composite indexes are the most misunderstood part of database performance — get the column order wrong and your index becomes useless; get it right and one index serves multiple query patterns.

This clinic is about composite index design, covering indexes for index-only scans, partial indexes for selective conditions, and expression indexes for filtered columns. The goal: build an indexing strategy that matches your actual query patterns rather than guessing blindly.

How Composite Index Ordering Affects Queries

flowchart LR
    subgraph Index["Index: (customer_id, status, created_at)"]
    C1["customer_id=1"]
    C2["customer_id=2"]
    C3["customer_id=3"]
    end
    C1 -->|= 1| S1["status=shipped"]
    C1 -->|= 1| S2["status=pending"]
    C2 -->|= 2| S3["status=shipped"]
    C2 -->|= 2| S4["status=cancelled"]
    C3 -->|= 3| S5["status=pending"]
    S1 -->|Range >| R1["created_at > Jan"]
    S3 -->|Range >| R2["created_at > Jan"]

This shows how a composite index on (customer_id, status, created_at) narrows down. Equality on customer_id first, then equality on status, then range on created_at. Skip the leftmost column and the index cannot help at all.

Composite Index Column Ordering

A composite index covers multiple columns. The order matters more than most people realize.

Consider a query:

SELECT * FROM orders
WHERE customer_id = 42
  AND status = 'shipped'
ORDER BY created_at;

The intuitive approach is to index status first if it seems more selective. Do not do this.

The Leftmost Prefix Rule

PostgreSQL can only use a composite index for a range scan or equality check if you’re querying from the leftmost column. Given this index:

CREATE INDEX idx_orders_composite ON orders(customer_id, status, created_at);

The index supports these query patterns:

  • WHERE customer_id = 42
  • WHERE customer_id = 42 AND status = 'shipped'
  • WHERE customer_id = 42 AND status = 'shipped' AND created_at > '2026-01-01'

But it does NOT support:

  • WHERE status = 'shipped' (missing leftmost column)
  • WHERE status = 'shipped' AND created_at > '2026-01-01' (missing leftmost column)

Equality First, Range Last

When you have both equality and range conditions, put equality columns first:

-- Good: equality columns (customer_id, status) before range (created_at)
CREATE INDEX idx_orders_good ON orders(customer_id, status, created_at);

-- Bad: range column in the middle breaks index usage for later columns
CREATE INDEX idx_orders_bad ON orders(customer_id, created_at, status);

Which Column Goes First?

When multiple equality columns exist, put the more selective column first. You want the index to narrow down results as quickly as possible at the first level.

-- If customers have many orders but status filters significantly
-- More selective: customer_id (1 customer vs many statuses)
CREATE INDEX idx_orders ON orders(customer_id, status, created_at);

-- If statuses are roughly equally distributed
-- Consider which column has better statistics in your version of PostgreSQL

Covering Indexes for Query Optimization

A covering index includes all the columns a query needs, so PostgreSQL never has to touch the heap.

-- Without covering index, query requires heap fetch
CREATE INDEX idx_orders_customer ON orders(customer_id);

SELECT id, created_at FROM orders WHERE customer_id = 42;
-- Index scan + heap fetch for id, created_at

-- With covering index, query is index-only
CREATE INDEX idx_orders_customer_covering ON orders(customer_id) INCLUDE (id, created_at);

SELECT id, created_at FROM orders WHERE customer_id = 42;
-- Index Only Scan - no heap access needed

When Covering Indexes Help

Covering indexes shine when your queries return a small number of columns compared to the table width, and when those queries run frequently (the index cost spreads across many executions). They also help when heap access is expensive due to large rows or slow storage.

When Covering Indexes Hurt

They can backfire on wide tables where including columns duplicates significant data. Heavy UPDATE workloads suffer because every index must be updated. And if your queries are already selective enough to use the index efficiently, the covering index adds maintenance cost for little gain.

The INCLUDE clause is PostgreSQL’s way to add columns without making them part of the index key. This avoids the sorting and B-tree maintenance overhead of full index columns.

Partial Indexes for Partitioned Data

A partial index only covers rows that match a condition. This makes them smaller and faster to maintain.

Common Use Cases

Recent Data

-- Only index recent orders, assuming old ones are rarely queried
CREATE INDEX idx_orders_recent ON orders(created_at, customer_id)
WHERE created_at > '2025-01-01';

Active Status

-- Only index active users
CREATE INDEX idx_users_active ON users(email) WHERE status = 'active';

Specific Values

-- Index only failed transactions for debugging
CREATE INDEX idx_transactions_failed
ON transactions(transaction_id, created_at)
WHERE status = 'failed';

Partial Index Trade-offs

The downside: the optimizer might not choose a partial index when the WHERE clause doesn’t match your filter. PostgreSQL has to prove the partial index is relevant.

-- This query uses idx_orders_recent (good)
SELECT * FROM orders WHERE created_at > '2026-02-01';

-- This query probably won't use idx_orders_recent
SELECT * FROM orders WHERE created_at > '2025-06-01';  -- too old

Expression Indexes

Sometimes you need to index a computed value. Expression indexes let you index functions and expressions.

Common Examples

-- Without expression index, LIKE lower() can't use a standard index
SELECT * FROM users WHERE LOWER(email) = LOWER('test@example.com');

-- With expression index
CREATE INDEX idx_users_email_lower ON users(LOWER(email));

SELECT * FROM users WHERE LOWER(email) = LOWER('test@example.com');
-- Now uses the index

Date Truncation

-- If you frequently query by day
CREATE INDEX idx_orders_day ON orders(DATE_TRUNC('day', created_at));

SELECT * FROM orders
WHERE DATE_TRUNC('day', created_at) = '2026-03-26';

JSON Extraction

-- Index a specific JSON field
CREATE INDEX idx_orders_metadata ON orders((metadata->>'customer_type'));

SELECT * FROM orders WHERE metadata->>'customer_type' = 'premium';

Expression Index Considerations

  • The expression is evaluated for every row on INSERT and UPDATE. Complex expressions add overhead.
  • The optimizer has to match the exact expression. Small differences prevent usage.
  • Expression indexes cannot be used for ORDER BY unless the expression is used in the ORDER BY clause.

Index Selectivity

Selectivity is the ratio of distinct index values to total rows. High selectivity (close to 1.0) means each value is nearly unique. Low selectivity means many rows share values.

Low Selectivity Example

-- If 95% of orders are 'pending', indexing status has poor selectivity
CREATE INDEX idx_orders_status ON orders(status);
-- Most queries still have to scan most of the index

High Selectivity Example

-- customer_id might be unique or near-unique
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Each lookup returns very few rows

When Low Selectivity Can Still Help

Low-selectivity indexes sometimes pay off in unexpected ways. If your pending orders are always recent and you mostly query recent pending orders, the index gives you locality even though many rows share the status value. Composite indexes can also rescue poor selectivity: (status, created_at) might be highly selective even if status alone is not. And for sorting, an index on status pre-sorts your results regardless of how many rows share each status value.

Index Selectivity vs. Query Selectivity

The key distinction: index selectivity is about the index itself, but query selectivity is about your specific query. A low-selectivity index on status is useless for SELECT * FROM orders, but if your app always queries WHERE status = 'pending' plus a date range, the index becomes more useful because the date range reduces the result set.

Index Design: When to Use Each Type

Index TypeUse whenAvoid when
B-tree composite (equality first)Queries filter on leading column(s) then range, plus ORDER BYQueries skip the leading column
Covering index with INCLUDEHigh-frequency queries selecting few columns, table is wideWrites dominate reads, or query selects many columns
Partial indexColumn has skewed distribution, only a subset is queriedMost queries hit the excluded rows, or condition changes frequently
Expression indexQuery uses function on column, cannot use regular indexExpression is expensive to compute, changes frequently
Partial coveringHigh-frequency query on subset of rows with small SELECTSubset is not stable, or query Selects many columns

Common Indexing Mistakes

-- Don't do this
CREATE INDEX idx_orders_everything ON orders(id, customer_id, status, created_at, updated_at, total, ...);

This bloats your index size and slows down inserts. Only index columns you actually filter or sort by.

Ignoring the query pattern

-- Query: WHERE status = 'shipped' AND customer_id = 42
CREATE INDEX idx_wrong ON orders(status, customer_id);

This order is wrong even though it technically works due to leftmost prefix. Put the more selective column first.

Not considering write overhead

Every index adds write overhead. A table with 5 indexes takes longer to UPDATE than a table with 2. Profile your workload before adding indexes.

Assuming the planner knows best

PostgreSQL’s planner doesn’t always pick the optimal index. Check with EXPLAIN and test queries directly.

Common Production Failures

Composite index with wrong column order: You create (status, customer_id) when queries filter by customer_id first. The index is useless for those queries. Always index based on your actual query filter order, not some theoretical selectivity ranking.

Expression index desynchronized from queries: You create an expression index on LOWER(email) but queries sometimes use lower(email) with different capitalization, or the application sends email = 'User@Example.com'. Expression indexes must match the query expression exactly. Audit all code paths that query the column.

Partial index not matching the query filter: You create a partial index for WHERE status = 'active' but queries use WHERE status = 'ACTIVE' (uppercase) or WHERE status != 'inactive'. The partial index is ignored. Be consistent with the values your application actually uses.

Covering index missing frequently selected columns: You include id and created_at in the covering index but the query also selects status. The heap is still touched. Verify the covering index covers the entire SELECT list before assuming it eliminates heap access.

Expression index bloating on heavy UPDATE: The expression LOWER(email) must be recomputed on every INSERT and UPDATE to every row. On a table with 100M rows and heavy write volume, the maintenance cost outweighs the query benefit. Measure write overhead before deploying expression indexes at scale.

Capacity Estimation: Composite Index Sizing

Composite indexes grow faster than single-column indexes. Here is how to estimate the cost.

A composite index on (customer_id, status, created_at) where customer_id is INT (4 bytes), status is VARCHAR(20) in the index (up to 20 bytes), and created_at is TIMESTAMP (8 bytes) totals roughly 32 bytes per index entry. For 100 million rows, that is 3.2 GB for the leaf level alone. Add the internal B-tree node overhead of roughly 5–10% and you are near 3.5 GB. A single-column index on just customer_id would be roughly 1.2 GB for the same rows.

Every INSERT to the orders table must now update all three columns in the composite index. If you also have individual indexes on each column, a single INSERT could trigger four separate index updates. On a write-heavy workload inserting 50,000 rows per minute, the index maintenance becomes measurable CPU overhead. Use pg_stat_user_indexes to verify the index is actually used before accepting that write cost.

PostgreSQL estimates plan costs using statistics on the leading column. If the leading column has high cardinality (many distinct values), the planner estimates narrow result sets and prefers index scans. If statistics are stale on the leading column, the planner may assume a composite index is more selective than it actually is and choose it when a simpler index would be faster.

If a composite index is not reducing the result set significantly over the leading column alone, the composite is not earning its storage and write cost. Use EXPLAIN to compare plans with and without the composite index.

Quick Recap Checklist

  • Put equality columns before range columns in composite indexes
  • Leftmost prefix rule: skipping the leading column breaks index usage
  • Covering indexes with INCLUDE eliminate heap fetches but add write overhead
  • Partial indexes only help when the WHERE clause matches the index predicate
  • Expression indexes must match the query expression exactly
  • Low selectivity indexes can still help for sorting and locality
  • Run ANALYZE after bulk loads to keep statistics current
  • VACUUM keeps visibility maps fresh for index-only scan efficiency
  • Compare estimated rows vs actual rows to detect stale statistics
  • Index-only scan requires all columns in SELECT plus current visibility map

Interview Questions

1. You are designing an index for a query that filters by status, customer_id, and created_at. The application frequently changes — sometimes it queries by status alone, sometimes by customer_id alone, sometimes by all three. How do you approach index design?

The hard constraint is that PostgreSQL needs the leading column to be present for the index to help. If the application queries all three columns together, a composite index on (status, customer_id, created_at) works for those cases. But if queries also filter by customer_id alone, you need a separate index on (customer_id) because a composite index with status as the leading column cannot help when status is omitted. The practical answer is to profile the actual query mix, identify the most frequent combination, and start with a composite index on that combination. Add secondary single-column indexes only when profiling shows the planner is making bad choices without them.

2. A partial covering index sounds ideal for a query on recent pending orders. When would it backfire?

Partial covering indexes backfire when the predicate is not stable. If status = 'pending' represents 5% of rows today but grows to 40% after a product launch, the partial index no longer reflects actual query behavior and queries slow down unexpectedly. The partial index also requires the planner to prove the query matches the index predicate. If your ORM generates queries with different literal casing or slightly different date ranges, the partial index silently gets ignored. And on tables with high INSERT rates, the partial index offers no write savings — new rows are simply excluded from the index, which is correct behavior but does not reduce write amplification unless the WHERE clause matches on a stable, infrequent value.

3. How do expression indexes behave differently from regular indexes in terms of maintenance and planner usage?

Expression indexes must be updated on every INSERT and UPDATE even when the expression value does not change. If you index LOWER(email) and a user updates their address, the expression gets recomputed even though email changed in a case-insensitive way. On large tables with frequent writes, this adds up. For the planner, the expression must match exactly — WHERE LOWER(email) = 'test' uses the index but WHERE email ILIKE 'test' does not. This means expression indexes are fragile when application code changes how it formats queries. Regular indexes on columns are immune to this because the planner can use them for any operator that works on the column type.

4. You are reviewing a schema with 12 indexes on a table that handles 30,000 INSERTs per minute. The application team says reads are slow. What do you check?

Start with pg_stat_user_indexes to see which indexes actually get scanned and which get updated. Indexes with zero scans but constant idx_scan counts are pure write overhead. 12 indexes on 30,000 INSERTs per minute means roughly 360,000 index updates per minute — at that write volume, index overhead is measurable. Check whether some indexes are redundant: a composite index on (a, b) makes a separate index on (a) redundant for most queries. Check whether covering indexes are in use — if queries are doing index scans plus heap fetches, the covering index might eliminate the heap access without needing a separate query optimization.

5. You add a composite index on (customer_id, status, created_at) but EXPLAIN shows the planner is doing a sequential scan anyway. The query filters on customer_id=42 and status='shipped'. What could be causing this?

Several things: first, check if statistics are stale — the planner might estimate that the filtered result set is large (say, 80% of the table) and decide a sequential scan is faster. Run ANALYZE on the table. Second, check work_mem — if the intermediate result set is large and work_mem is small, the planner might avoid the index because it would require too many index-only scans. Third, check whether the index is actually smaller than the table — on small tables, sequential scans are always faster. Fourth, verify the index exists and is not disabled via enable_seqscan=off or similar. Finally, compare estimated rows vs actual rows — if they differ by orders of magnitude, the statistics are stale.

6. You create a covering index with INCLUDE (id, created_at) but the query still does an index scan, not an index-only scan. The SELECT is SELECT id, created_at FROM orders WHERE customer_id = 42. Why isn't index-only scan being chosen?

Index-only scans require a current visibility map. If the visibility map is stale, PostgreSQL cannot guarantee the rows are visible without checking the heap. Run VACUUM on the table to refresh the visibility map. On tables with heavy UPDATE or DELETE traffic, the visibility map can become stale quickly — the planner falls back to a regular index scan rather than risk returning invisible rows. If VACUUM does not help, the table might have too many dead tuples preventing a fresh visibility map. The other possibility is that the visibility map coverage is poor — if most pages have visible rows, PostgreSQL still has to check the heap for pages where visibility is unknown.

7. A query has WHERE LOWER(email) = $1 and you create an expression index on LOWER(email). The ORM generates queries with ILIKE instead of =. Does the expression index help?

No. Expression indexes match the exact expression — WHERE LOWER(email) = 'test' uses the index, but WHERE email ILIKE 'test' does not. ILIKE with a leading wildcard (%test%) cannot use any index anyway. Even ILIKE 'test%' (without leading wildcard) would not use a B-tree index on LOWER(email) because ILIKE is case-insensitive but the index stores lowercase values — PostgreSQL cannot convert ILIKE to a simple index lookup. The fix is to audit the application for consistent query patterns and either fix the ORM-generated code or document the exact query shape the expression index supports.

8. You have a high-selectivity composite index that seems correct for the query, but the planner ignores it in favor of a sequential scan. You run ANALYZE and the statistics are current. What configuration setting might be causing this?

Check enable_indexscan, enable_bitmapscan, and enable_seqscan — if index scanning is disabled, the planner falls back to sequential scan. Check random_page_cost — if it is set too high relative to seq_page_cost, the planner assumes random I/O is expensive and prefers sequential scans even for selective queries. On SSDs, set random_page_cost = 1.1 (or lower) instead of the default 4.0. Check effective_cache_size — if it is set too low, the planner assumes the working set does not fit in memory and overestimates the cost of index operations. Finally, check constraint_exclusion if you are using partitioning — misconfiguration there can cause the planner to ignore indexes on child tables.

9. You have a composite index on (a, b, c) and query WHERE a = AND c = . The query does not use the index efficiently for the c filter. Why?

The leftmost prefix rule — the index can use columns from left to right without gaps, but skipping b breaks the index for c. With WHERE a = AND c = , PostgreSQL uses the index for the equality on a, but c is evaluated as a filter after the index returns rows matching a. The index cannot help c directly because b is the middle column and was not in the query. The fix is either to add b to the query (if possible), or create a separate index on (a, c) if c filtering is important enough.

10. A partial index is not being used even though the WHERE clause matches. What are the most likely causes?

Three common causes: first, the partial index predicate does not match the query — if you created WHERE status = 'active' but the query uses WHERE status = 'ACTIVE' (different case), the index is ignored. Second, the query planner may estimate that the partial index is not selective enough to be useful — if the predicate matches 40% of rows, a sequential scan may be faster. Third, statistics may not be gathered on the partial index since partial indexes are not auto-analyzed. Run ANALYZE on the table and check the pg_indexes view to confirm the index exists and the predicate matches exactly.

11. You create an expression index on LOWER(email) but the query uses LOWER(email) = 'test' but the index is not used. What do you check?

The expression must match exactly, character for character. Check for trailing spaces, different function variants (e.g., lower() vs LOWER() — PostgreSQL normalizes function names), or invisible characters in the query. Also verify the index exists and is not disabled. Run EXPLAIN to see whether the index is being considered. If the ORM generates different SQL (e.g., using ILIKE instead of =), the index will not be used because ILIKE on an expression index is not supported. Audit the actual query text being generated versus what you expect.

12. When should you use a BRIN index instead of a B-tree index?

BRIN (Block Range Index) indexes are designed for naturally ordered data where the physical order of rows correlates with the indexed value — typically timestamp-based data where recent rows are appended at the end. BRIN stores the min/max values for a range of pages rather than individual rows, making it extremely small. Use BRIN for time-series tables where you query recent data often (e.g., WHERE created_at > '2026-01-01'), or for append-only tables with monotonic insertion. Do not use BRIN for high-cardinality columns, data that is frequently updated in-place, or queries that need random access patterns.

13. What is index bloat and how does it affect query performance? How do you detect and fix it?

Index bloat accumulates when pages in the index become dead due to UPDATEs and DELETEs — the index entries pointing to dead row versions are not immediately removed. Bloat increases index size beyond what is necessary and causes extra I/O during index scans (more pages to read). Detect bloat with pgstattuple extension or by comparing pg_relation_size vs the expected size based on row count. Fix bloat with REINDEX (blocks concurrent access) or REINDEX CONCURRENTLY (PostgreSQL 13+, non-blocking but slower). Regular autovacuum helps prevent bloat from accumulating.

14. A query filtering on a range of dates runs faster after you add an index on the date column. But a query filtering on EXTRACT(YEAR FROM created_at) = 2026 does not use the index. Why?

The range query WHERE created_at > '2026-01-01' uses the B-tree index directly because the comparison operators match the index. EXTRACT(YEAR FROM created_at) is a function applied to the column — a standard B-tree index on created_at cannot be used because the indexed values are not being compared directly. The fix is to create an expression index: CREATE INDEX idx_year ON orders(EXTRACT(YEAR FROM created_at)). Note that expression indexes only help when the expression matches exactly, including the EXTRACT syntax.

15. What is the difference between a covering index with INCLUDE and a composite index for making an index-only scan possible?

A composite index (a, b, c) includes all columns as index keys — the B-tree stores all three values in every index entry, sorted by a then b then c. A covering index with INCLUDE (a) INCLUDE (b, c) makes only the first column part of the index key while b and c are appended without being sorted. The INCLUDE approach is better when you want to avoid sorting overhead for non-key columns or when you want a smaller index (included columns are not used for sorting or equality lookups). For index-only scans, INCLUDE lets you add frequently-selected columns without bloating the index key structure.

16. You have a table with 50 indexes on it. What are the concrete write performance costs?

Every INSERT must update all 50 indexes — that is 50 index entries written per row. Every UPDATE must update all indexes where any indexed column changed. On a table with heavy write volume, index overhead is measurable: if each index entry takes 50 microseconds to write, 50 indexes means 2.5ms overhead per INSERT just for indexing. For 10,000 INSERTs per minute, that is 25 seconds of index overhead per minute. Use pg_stat_user_indexes to identify unused indexes and drop them. Also consider whether composite indexes can replace multiple single-column indexes.

17. How do you decide the order of columns in a composite index when both columns have equality conditions?

When both columns have equality conditions, put the more selective (higher cardinality) column first. The index narrows down faster at the first level, reducing the number of rows the second level must process. However, also consider query patterns — if you frequently query by column A alone but rarely query by column B alone, put A first regardless of selectivity. If you query both equally but also filter by A alone frequently, A-first is still correct. Test with EXPLAIN comparing both orderings and check estimated rows at each index level.

18. What happens to your partial covering index if the predicate column is updated to no longer match the partial index condition?

The index entry is removed when the row no longer matches the partial index predicate — the row is no longer part of the indexed subset. For example, with WHERE status = 'active', if you update status to 'inactive', that row's entry is removed from the index. This is correct behavior and means the partial index stays accurate. The row still exists in the table but is not in the index. Queries that rely on the partial index for data retrieval will simply not find those updated rows, which is the intended behavior.

19. Can PostgreSQL use multiple indexes simultaneously for a single query? Under what conditions?

PostgreSQL can combine multiple indexes using bitmap index scans. It builds a bitmap of matching row pointers from each index, then intersects (AND) or unions (OR) the bitmaps based on the WHERE clause. This works when you have WHERE a = 1 AND b = 2 with separate indexes on a and b. Bitmap scan is chosen when neither index is selective enough alone but together they narrow down results efficiently. It is efficient for moderately selective queries but can be slower than a single index scan for highly selective queries.

20. An index exists on (created_at DESC) but a query with ORDER BY created_at DESC is doing a sort instead of using the index. Why?

B-tree indexes store values in sorted order and can be scanned forward or backward — the index on (created_at DESC) can satisfy both ORDER BY created_at DESC and ORDER BY created_at ASC. If the query is doing a sort instead, it is likely because the index is not selective enough to be useful (the planner estimates a sequential scan with sort is faster), or the LIMIT is small relative to the overall table making index scan less beneficial. Check enable_indexscan and verify the planner is considering the index. Also confirm the statistics are current with ANALYZE.

Further Reading

Conclusion

Good index design comes down to understanding your query patterns. Composite index order follows the leftmost prefix rule: equality columns first, range columns last. Covering indexes eliminate heap fetches for frequently-run queries. Partial indexes keep index size manageable when you only need a subset of data. Expression indexes handle computed values but come with write overhead.

Category

Related Posts

Database Indexes: B-Tree, Hash, Covering, and Beyond

A practical guide to database indexes. Learn when to use B-tree, hash, composite, and partial indexes, understand index maintenance overhead, and avoid common performance traps.

#database #indexes #performance

Denormalization

When to intentionally duplicate data for read performance. Tradeoffs with normalization, update anomalies, and application-level denormalization strategies.

#database #denormalization #performance

Query Execution Plans: Reading EXPLAIN Output Like a Pro

Learn to decode PostgreSQL EXPLAIN output, understand sequential vs index scans, optimize join orders, and compare bitmap heap scans with index-only scans.

#database #query-optimization #explain