Command Center

SQL Interview Prep

Target: staff data engineer / staff ML engineer interviews where SQL is a major signal.

This is not a syntax crib sheet. The goal is to move from “I can write queries” to “I can reason about correctness, performance, distributed execution, feature pipelines, and data product tradeoffs under interview pressure.”

Interview Thesis

Your default message:

I write SQL as a specification of data intent, then validate whether the physical plan, data model, and operational constraints can actually support that intent at production scale.

At staff level, interviewers are not only testing whether the query returns the right rows. They are testing whether you catch grain mismatches, duplicate amplification, NULL traps, temporal leakage, skew, shuffle cost, stale stats, bad partitioning, and the difference between a query that works once and a data contract that can run every day.

Fast Path

First 90 minutes

Read Query Order/NULLs/Joins, Grouping/Windowing, and the Beginner plus Intermediate problem banks. Speak the answers before opening each reveal.

Optimization loop

Read Optimizer/Indexes/EXPLAIN and Rewrite Playbook, then practice the Optimization Problems page until you can explain plan shape without running it.

Volume reps

Use the topic drill banks for 260+ revealable SQL prompts across joins, windows, temporal analytics, ML datasets, distributed execution, data quality, and staff scenarios.

Staff loop

Use Distributed SQL, Feature/Label SQL, and Staff Scenarios to practice architecture-grade answers with SQL examples.

Final polish

Use Final 24 Hours and Glossary to tighten vocabulary, answer shape, and failure-mode checklists.

Universal Practice Schema

Most examples use this conceptual schema. Types are intentionally generic.

customers(customer_id, signup_ts, country, acquisition_channel)
orders(order_id, customer_id, order_ts, status, currency, total_amount)
order_items(order_id, product_id, quantity, unit_price)
products(product_id, category, created_ts, is_active)
events(event_id, user_id, event_ts, event_name, session_id, properties)
sessions(session_id, user_id, started_at, ended_at, device_type)
payments(payment_id, order_id, payment_ts, amount, status)
experiments(user_id, experiment_name, variant, assigned_ts)
predictions(entity_id, prediction_ts, score, model_version)
labels(entity_id, label_ts, label_value)
feature_snapshots(entity_id, snapshot_ts, feature_name, feature_value)

Staff-Level SQL Answer Shape

State the grain: one row per what?
State the time boundary: event time, processing time, snapshot time, or validity interval?
State the join cardinality: one-to-one, many-to-one, one-to-many, or many-to-many?
Write the query in layers: filter, normalize grain, join, aggregate, window, final projection.
Check correctness: duplicates, NULLs, missing rows, late data, time zones, and leakage.
Check performance: partition pruning, column pruning, join strategy, shuffle, skew, stats, and indexes.
Close with productionization: tests, invariants, backfill plan, observability, ownership.

Core Invariants

Invariant	Why it matters
Grain before join	Most wrong SQL comes from multiplying rows before aggregation.
Time before label	ML features must be computed from information available before prediction time.
Filter before shuffle	In distributed systems, reducing bytes early usually dominates clever syntax.
Semantics before tuning	A fast wrong query is worse than a slow visible one.
Plan before guess	Use `EXPLAIN` and row counts to verify optimizer assumptions.

Senior candidates optimize syntax. Staff candidates optimize the contract between SQL semantics, physical execution, data model, and downstream consumers.