Rewrite Playbook

Optimization is usually about reducing rows, columns, shuffles, sorts, and ambiguity as early as semantics allow.

1. Pre-Aggregate Before Joining

Bad when order_items has many rows per order and you only need product totals:

SELECT
	p.category,
	SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi
	ON oi.order_id = o.order_id
JOIN products p
	ON p.product_id = oi.product_id
WHERE o.status = 'paid'
GROUP BY p.category;

This is not automatically wrong, but if orders has many columns or filters are complex, isolate the small paid order keyset first.

WITH paid_orders AS (
	SELECT order_id
	FROM orders
	WHERE status = 'paid'
),
product_revenue AS (
	SELECT
		oi.product_id,
		SUM(oi.quantity * oi.unit_price) AS revenue
	FROM order_items oi
	JOIN paid_orders po
		ON po.order_id = oi.order_id
	GROUP BY oi.product_id
)
SELECT p.category, SUM(pr.revenue) AS revenue
FROM product_revenue pr
JOIN products p
	ON p.product_id = pr.product_id
GROUP BY p.category;

2. Use EXISTS For Existence

Bad if it multiplies customers by paid orders:

SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o
	ON o.customer_id = c.customer_id
WHERE o.status = 'paid';

Better:

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.status = 'paid'
);

3. Replace OR With UNION ALL When Selectivity Differs

Sometimes:

WHERE country = 'US' OR acquisition_channel = 'paid_search'

prevents clean access paths. Consider:

SELECT customer_id
FROM customers
WHERE country = 'US'

UNION

SELECT customer_id
FROM customers
WHERE acquisition_channel = 'paid_search';

Use UNION to deduplicate. Use UNION ALL only when overlap is impossible or duplicates are intended.

4. Move Filters Into The Earliest Safe CTE

WITH recent_paid_orders AS (
	SELECT order_id, customer_id, total_amount
	FROM orders
	WHERE status = 'paid'
		AND order_ts >= TIMESTAMP '2026-01-01 00:00:00'
)
SELECT customer_id, SUM(total_amount) AS revenue
FROM recent_paid_orders
GROUP BY customer_id;

The key word is safe. Do not move a filter across an outer join if it changes preserved rows.

5. Avoid Repeated Full Scans

Bad:

SELECT
	(SELECT COUNT(*) FROM orders WHERE status = 'paid') AS paid_orders,
	(SELECT COUNT(*) FROM orders WHERE status = 'cancelled') AS cancelled_orders;

Better:

SELECT
	SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders,
	SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled_orders
FROM orders;

6. Use Window Filtering Instead Of Self-Joins

Find latest order per customer.

WITH ranked AS (
	SELECT
		o.*,
		ROW_NUMBER() OVER (
			PARTITION BY customer_id
			ORDER BY order_ts DESC, order_id DESC
		) AS rn
	FROM orders o
)
SELECT *
FROM ranked
WHERE rn = 1;

This is usually clearer than a self-join on max timestamp, and it handles ties explicitly.

7. Materialize Expensive Reused Intermediates

If many downstream queries recompute daily user activity, make it a modeled table:

daily_user_activity(user_id, activity_date, event_count, first_event_ts, last_event_ts)

This shifts cost from repeated ad hoc queries to a tested pipeline with freshness and quality checks.

8. Reduce Columns Early

In column stores and distributed systems, selecting fewer columns can reduce I/O and network.

WITH required_events AS (
	SELECT user_id, event_ts, event_name
	FROM events
	WHERE event_date >= DATE '2026-01-01'
)
SELECT ...

9. Handle Skew Explicitly

If one customer_id or user_id dominates the data, normal hash partitioning can create a hot task.

Possible mitigations:

pre-aggregate by finer keys
isolate hot keys
salt skewed joins
broadcast small side
repartition on a better key
change the data model

10. Know When SQL Is Not Enough

If correctness depends on fuzzy matching, recursive graph traversal, or streaming state with exactly-once semantics, SQL may still be part of the solution, but not the entire system. Say that clearly.

Practice

1. Rewrite a JOIN + DISTINCT that only checks whether a customer has paid.

SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.status = 'paid'
);

This expresses a semijoin and avoids row multiplication.

2. A dashboard has five metrics from the same events table and currently scans it five times. What rewrite do you propose?

Use one filtered scan and conditional aggregation.

WITH base AS (
	SELECT event_name, user_id
	FROM events
	WHERE event_date = DATE '2026-06-01'
)
SELECT
	COUNT(*) AS total_events,
	COUNT(DISTINCT user_id) AS active_users,
	SUM(CASE WHEN event_name = 'signup' THEN 1 ELSE 0 END) AS signups,
	SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
	SUM(CASE WHEN event_name = 'refund' THEN 1 ELSE 0 END) AS refunds
FROM base;

3. When is pre-aggregation before join invalid?

When downstream logic needs row-level detail that would be lost, or when the aggregation grain does not preserve the join semantics. Example: pre-aggregating order items by product before joining to order-level customer attributes loses customer allocation.