Rewrite Playbook
Optimization is usually about reducing rows, columns, shuffles, sorts, and ambiguity as early as semantics allow.
1. Pre-Aggregate Before Joining
Section titled “1. Pre-Aggregate Before Joining”Bad when order_items has many rows per order and you only need product totals:
SELECT
p.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM orders o
JOIN order_items oi
ON oi.order_id = o.order_id
JOIN products p
ON p.product_id = oi.product_id
WHERE o.status = 'paid'
GROUP BY p.category;
This is not automatically wrong, but if orders has many columns or filters are complex, isolate the small paid order keyset first.
WITH paid_orders AS (
SELECT order_id
FROM orders
WHERE status = 'paid'
),
product_revenue AS (
SELECT
oi.product_id,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN paid_orders po
ON po.order_id = oi.order_id
GROUP BY oi.product_id
)
SELECT p.category, SUM(pr.revenue) AS revenue
FROM product_revenue pr
JOIN products p
ON p.product_id = pr.product_id
GROUP BY p.category;
2. Use EXISTS For Existence
Section titled “2. Use EXISTS For Existence”Bad if it multiplies customers by paid orders:
SELECT DISTINCT c.customer_id
FROM customers c
JOIN orders o
ON o.customer_id = c.customer_id
WHERE o.status = 'paid';
Better:
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.status = 'paid'
);
3. Replace OR With UNION ALL When Selectivity Differs
Section titled “3. Replace OR With UNION ALL When Selectivity Differs”Sometimes:
WHERE country = 'US' OR acquisition_channel = 'paid_search'
prevents clean access paths. Consider:
SELECT customer_id
FROM customers
WHERE country = 'US'
UNION
SELECT customer_id
FROM customers
WHERE acquisition_channel = 'paid_search';
Use UNION to deduplicate. Use UNION ALL only when overlap is impossible or duplicates are intended.
4. Move Filters Into The Earliest Safe CTE
Section titled “4. Move Filters Into The Earliest Safe CTE”WITH recent_paid_orders AS (
SELECT order_id, customer_id, total_amount
FROM orders
WHERE status = 'paid'
AND order_ts >= TIMESTAMP '2026-01-01 00:00:00'
)
SELECT customer_id, SUM(total_amount) AS revenue
FROM recent_paid_orders
GROUP BY customer_id;
The key word is safe. Do not move a filter across an outer join if it changes preserved rows.
5. Avoid Repeated Full Scans
Section titled “5. Avoid Repeated Full Scans”Bad:
SELECT
(SELECT COUNT(*) FROM orders WHERE status = 'paid') AS paid_orders,
(SELECT COUNT(*) FROM orders WHERE status = 'cancelled') AS cancelled_orders;
Better:
SELECT
SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders,
SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled_orders
FROM orders;
6. Use Window Filtering Instead Of Self-Joins
Section titled “6. Use Window Filtering Instead Of Self-Joins”Find latest order per customer.
WITH ranked AS (
SELECT
o.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_ts DESC, order_id DESC
) AS rn
FROM orders o
)
SELECT *
FROM ranked
WHERE rn = 1;
This is usually clearer than a self-join on max timestamp, and it handles ties explicitly.
7. Materialize Expensive Reused Intermediates
Section titled “7. Materialize Expensive Reused Intermediates”If many downstream queries recompute daily user activity, make it a modeled table:
daily_user_activity(user_id, activity_date, event_count, first_event_ts, last_event_ts)
This shifts cost from repeated ad hoc queries to a tested pipeline with freshness and quality checks.
8. Reduce Columns Early
Section titled “8. Reduce Columns Early”In column stores and distributed systems, selecting fewer columns can reduce I/O and network.
WITH required_events AS (
SELECT user_id, event_ts, event_name
FROM events
WHERE event_date >= DATE '2026-01-01'
)
SELECT ...
9. Handle Skew Explicitly
Section titled “9. Handle Skew Explicitly”If one customer_id or user_id dominates the data, normal hash partitioning can create a hot task.
Possible mitigations:
- pre-aggregate by finer keys
- isolate hot keys
- salt skewed joins
- broadcast small side
- repartition on a better key
- change the data model
10. Know When SQL Is Not Enough
Section titled “10. Know When SQL Is Not Enough”If correctness depends on fuzzy matching, recursive graph traversal, or streaming state with exactly-once semantics, SQL may still be part of the solution, but not the entire system. Say that clearly.
Practice
Section titled “Practice”1. Rewrite a JOIN + DISTINCT that only checks whether a customer has paid.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.status = 'paid'
);
This expresses a semijoin and avoids row multiplication.
2. A dashboard has five metrics from the same events table and currently scans it five times. What rewrite do you propose?
Use one filtered scan and conditional aggregation.
WITH base AS (
SELECT event_name, user_id
FROM events
WHERE event_date = DATE '2026-06-01'
)
SELECT
COUNT(*) AS total_events,
COUNT(DISTINCT user_id) AS active_users,
SUM(CASE WHEN event_name = 'signup' THEN 1 ELSE 0 END) AS signups,
SUM(CASE WHEN event_name = 'purchase' THEN 1 ELSE 0 END) AS purchases,
SUM(CASE WHEN event_name = 'refund' THEN 1 ELSE 0 END) AS refunds
FROM base;
3. When is pre-aggregation before join invalid?
When downstream logic needs row-level detail that would be lost, or when the aggregation grain does not preserve the join semantics. Example: pre-aggregating order items by product before joining to order-level customer attributes loses customer allocation.