Final 24 Hours
Use this as a compact rehearsal sheet.
Default Opening For Any SQL Problem
Section titled “Default Opening For Any SQL Problem”Say this before writing:
I’ll first define the output grain and time boundary, then build the query in layers. I’ll watch for join fanout, NULL behavior, and whether filters can be pushed before expensive joins or shuffles.
Query Correctness Checklist
Section titled “Query Correctness Checklist”- What is one output row?
- Which rows must be preserved?
- Is this event time or processing time?
- Does any join fan out?
- Are NULLs included or excluded?
- Are duplicates meaningful?
- Are ties deterministic?
- Does an outer join accidentally become inner?
- Does the query need a date spine?
- Is current-state dimension data leaking into history?
Performance Checklist
Section titled “Performance Checklist”- Are filters sargable?
- Are date filters half-open ranges?
- Are only needed columns selected?
- Are partitions/files pruned?
- Is a join used only for existence?
- Can a large input be pre-aggregated?
- Are stats stale?
- Is there a global sort/window/distinct?
- Is a distributed query shuffling huge data?
- Is there key skew?
- Is
COUNT(DISTINCT)the real bottleneck?
Staff-Level Close
Section titled “Staff-Level Close”End performance answers with:
I would verify the rewrite with
EXPLAIN, compare estimated and actual row counts at each stage, and add a test or modeled table if this query expresses a recurring business contract.
End ML dataset answers with:
I would enforce point-in-time joins, define label windows explicitly, test for feature availability after prediction time, and use a time-based validation split to reduce leakage.
End distributed answers with:
I would identify whether the dominant cost is scan, shuffle, skew, spill, or write, because each has a different fix.
High-Yield Patterns To Rehearse
Section titled “High-Yield Patterns To Rehearse”- Latest row per key with
ROW_NUMBER. - Existence with
EXISTS. - Anti join with
NOT EXISTS. - Conditional aggregation.
- Top N per group.
- Gaps and islands.
- Sessionization.
- Point-in-time feature join.
- SCD type 2 interval join.
- Revenue by dimension without double counting.
- Funnel with ordered steps.
- Retention cohort.
- Confusion matrix and calibration buckets.
- Query rewrite from non-sargable date filter.
- Distributed skew diagnosis.
Five Questions You Should Ask Interviewers
Section titled “Five Questions You Should Ask Interviewers”- What is the expected output grain?
- Should rows with no activity appear as zero?
- Which timestamp defines the business event?
- Are duplicate events possible?
- Is exactness required, or are approximate/sketched metrics acceptable?
Red-Flag Smells
Section titled “Red-Flag Smells”| Smell | Likely issue |
|---|---|
SELECT DISTINCT after many joins | hidden fanout |
NOT IN (subquery) | NULL trap |
WHERE right_table.col = ... after left join | outer join turned inner |
DATE(timestamp_col) in predicate | non-sargable filter |
| current dimension join for historical data | temporal leakage |
| random train/test split for time data | future leakage |
unbounded COUNT(DISTINCT) on raw events | expensive global aggregation |
| one distributed task runs forever | skew |
dashboard over raw SELECT * events | missing curated mart |
Mini Mock
Section titled “Mini Mock”Prompt: "Write SQL for weekly active users and explain how you would make it production-grade."
Start with correctness:
WITH user_days AS (
SELECT DISTINCT user_id, event_date
FROM events
),
weekly AS (
SELECT
d.week_start_date,
u.user_id
FROM week_spine d
JOIN user_days u
ON u.event_date >= d.week_start_date
AND u.event_date < d.week_start_date + INTERVAL '7' DAY
)
SELECT week_start_date, COUNT(DISTINCT user_id) AS wau
FROM weekly
GROUP BY week_start_date;
Production-grade answer:
- define user identity rules
- exclude bots/test traffic if needed
- use event time
- handle late data with a watermark/backfill
- pre-aggregate user-day activity
- partition by event date
- test row counts and duplicate event rates
- publish a metric contract