Skip to content

Command Center

Target: staff data engineer / staff ML engineer interviews where SQL is a major signal.

This is not a syntax crib sheet. The goal is to move from “I can write queries” to “I can reason about correctness, performance, distributed execution, feature pipelines, and data product tradeoffs under interview pressure.”

Your default message:

I write SQL as a specification of data intent, then validate whether the physical plan, data model, and operational constraints can actually support that intent at production scale.

At staff level, interviewers are not only testing whether the query returns the right rows. They are testing whether you catch grain mismatches, duplicate amplification, NULL traps, temporal leakage, skew, shuffle cost, stale stats, bad partitioning, and the difference between a query that works once and a data contract that can run every day.

First 90 minutes

Read Query Order/NULLs/Joins, Grouping/Windowing, and the Beginner plus Intermediate problem banks. Speak the answers before opening each reveal.

Optimization loop

Read Optimizer/Indexes/EXPLAIN and Rewrite Playbook, then practice the Optimization Problems page until you can explain plan shape without running it.

Volume reps

Use the topic drill banks for 260+ revealable SQL prompts across joins, windows, temporal analytics, ML datasets, distributed execution, data quality, and staff scenarios.

Staff loop

Use Distributed SQL, Feature/Label SQL, and Staff Scenarios to practice architecture-grade answers with SQL examples.

Final polish

Use Final 24 Hours and Glossary to tighten vocabulary, answer shape, and failure-mode checklists.

Most examples use this conceptual schema. Types are intentionally generic.

customers(customer_id, signup_ts, country, acquisition_channel)
orders(order_id, customer_id, order_ts, status, currency, total_amount)
order_items(order_id, product_id, quantity, unit_price)
products(product_id, category, created_ts, is_active)
events(event_id, user_id, event_ts, event_name, session_id, properties)
sessions(session_id, user_id, started_at, ended_at, device_type)
payments(payment_id, order_id, payment_ts, amount, status)
experiments(user_id, experiment_name, variant, assigned_ts)
predictions(entity_id, prediction_ts, score, model_version)
labels(entity_id, label_ts, label_value)
feature_snapshots(entity_id, snapshot_ts, feature_name, feature_value)
  1. State the grain: one row per what?
  2. State the time boundary: event time, processing time, snapshot time, or validity interval?
  3. State the join cardinality: one-to-one, many-to-one, one-to-many, or many-to-many?
  4. Write the query in layers: filter, normalize grain, join, aggregate, window, final projection.
  5. Check correctness: duplicates, NULLs, missing rows, late data, time zones, and leakage.
  6. Check performance: partition pruning, column pruning, join strategy, shuffle, skew, stats, and indexes.
  7. Close with productionization: tests, invariants, backfill plan, observability, ownership.
InvariantWhy it matters
Grain before joinMost wrong SQL comes from multiplying rows before aggregation.
Time before labelML features must be computed from information available before prediction time.
Filter before shuffleIn distributed systems, reducing bytes early usually dominates clever syntax.
Semantics before tuningA fast wrong query is worse than a slow visible one.
Plan before guessUse EXPLAIN and row counts to verify optimizer assumptions.

Senior candidates optimize syntax. Staff candidates optimize the contract between SQL semantics, physical execution, data model, and downstream consumers.