Optimizer and Plan Drills
For each drill, name the likely plan smell first.
1. A date filter wraps the timestamp column in a function.
Rewrite to a half-open range.
WHERE order_ts >= TIMESTAMP '2026-06-01 00:00:00'
AND order_ts < TIMESTAMP '2026-06-02 00:00:00'
This is more likely to enable index range scans, partition pruning, or file skipping.
2. A query selects every column from a wide event table.
Project only required columns.
SELECT user_id, event_ts, event_name
FROM events
WHERE event_date = DATE '2026-06-01';
Column pruning reduces I/O, memory, network, and spill risk.
3. A query uses JOIN only to test existence.
Use EXISTS.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);
This avoids row multiplication and expresses a semijoin.
4. A query does NOT IN against a nullable subquery column.
Use NOT EXISTS.
SELECT c.customer_id
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM banned_customers b
WHERE b.customer_id = c.customer_id
);
NOT IN can return no rows if the subquery contains NULL.
5. A plan estimates 100 rows but returns 100 million rows.
Suspect stale stats, missing constraints, correlated predicates, data skew, many-to-many joins, or non-sargable filters. Validate cardinality at each join and update stats before forcing hints.
6. A nested-loop join appears between two large tables.
Check for missing join predicates, bad row estimates, missing index on the inner side, or disabled hash/merge joins. A large-large nested loop is usually a plan bug unless filters make one side tiny.
7. A hash join spills to disk.
Reduce build-side size, project fewer columns, filter earlier, pre-aggregate, improve join order, increase memory if appropriate, or use a different join strategy. In distributed engines, also check partition size and skew.
8. A query sorts a billion rows for a dashboard.
Ask whether the dashboard needs global order. If it only needs top N, use an access path or partition-pruned recent subset when semantically valid.
SELECT order_id, order_ts, total_amount
FROM orders
WHERE order_ts >= TIMESTAMP '2026-06-01 00:00:00'
ORDER BY order_ts DESC
FETCH FIRST 100 ROWS ONLY;
9. A low-cardinality boolean column has its own index.
It may not help because the predicate is not selective. A full scan may be cheaper. A composite index or filtered/partial index can be better, but syntax and support are engine-dependent.
10. A composite index is on customer and order time. Which predicate benefits most?
For an index conceptually ordered as (customer_id, order_ts), this shape is strong:
WHERE customer_id = 123
AND order_ts >= TIMESTAMP '2026-06-01 00:00:00'
Filtering only on order_ts may not use the leading customer_id ordering effectively in many row stores.
11. A dashboard query repeats the same expensive subquery three times.
Check whether the engine materializes or inlines the CTE. If repeated work appears in the plan, materialize the intermediate or rewrite to compute it once.
12. A plan scans every partition despite a date predicate.
Check whether the predicate is on the partition column, whether types match, whether the column is wrapped in a function, and whether the partition filter is hidden behind a non-deterministic expression.
13. A query filters with a leading wildcard string match.
WHERE properties LIKE '%purchase%'
This usually prevents normal index seeks and forces scans. Extract important properties into typed columns or use a search/indexing system if substring search is the requirement.
14. A query has an OR across unrelated columns.
Consider splitting into separate selective branches.
SELECT customer_id
FROM customers
WHERE country = 'US'
UNION
SELECT customer_id
FROM customers
WHERE acquisition_channel = 'paid_search';
Use UNION ALL only if duplicates are impossible or acceptable.
15. A query counts distinct users over raw events every minute.
Use a pre-aggregated user-period table or approximate distinct sketches if exactness is not required. Exact distinct over raw high-volume events is often the real bottleneck.
16. The optimizer scans the fact table before filtering the dimension.
Check stats and join selectivity. A filtered small dimension can sometimes be joined first to reduce fact rows, but the optimizer needs accurate stats and visible predicates.
17. A query uses DISTINCT at the end to fix duplicates.
Find the fanout source. Check uniqueness of each join key and pre-aggregate or rank to one row per key before joining. DISTINCT may hide wrong metrics.
18. A query aggregates after joining a huge fact to a dimension with duplicate keys.
First validate the dimension key. If duplicates are valid history rows, add a temporal predicate. If duplicates are invalid, fix data quality before tuning the query.
19. A plan shows a large exchange or repartition before filtering.
Push filters and projections below the exchange if semantics allow. In distributed systems, bytes before shuffle often dominate runtime.
20. A materialized view would speed a dashboard. What do you ask first?
Ask about freshness SLO, refresh cost, backfill behavior, invalidation, ownership, metric definition stability, and whether users need drill-down beyond the aggregate grain.
21. A query joins on expressions instead of raw keys.
Example:
ON LOWER(a.email) = LOWER(b.email)
This can block normal index usage. Normalize keys upstream or use generated/functional indexes if the engine supports them.
22. A query applies a filter on the right side of a left join in WHERE.
Move the filter into ON if left rows must be preserved.
LEFT JOIN orders o
ON o.customer_id = c.customer_id
AND o.status = 'paid'
23. A plan chooses broadcast join but the job fails with memory pressure.
The broadcast side may be larger than estimated after projection/filtering, stats may be stale, or executor memory is insufficient. Reduce columns, filter earlier, update stats, or force a shuffle join only after understanding the data size.
24. A query is fast alone but slow under concurrency.
Look for shared resource contention: slots, memory, temp disk, metadata service, source-system limits, lock contention, cache eviction, or workload management queues.
25. Give a staff-level answer to "how would you optimize this query?"
State the semantic grain first, inspect EXPLAIN, compare estimated vs actual rows, identify the dominant cost, rewrite to reduce rows/columns/shuffle/sort, adjust layout or indexes if the workload is recurring, and add tests so correctness is not traded for speed.