Staff SQL Scenario Drills

These are meant to be answered out loud in five minutes.

1. Finance and product dashboards disagree on revenue.

Align definitions: booked vs paid vs recognized, refunds, tax, currency, event time, and grain. Reconcile from a shared base fact by day and dimension. Then publish a governed metric contract with ownership and tests.

2. A CEO asks for "active users" and teams define it differently.

Clarify the business action that counts as active, identity resolution, bot/test filtering, time zone, date grain, and late-data handling. Create one metric definition and expose variants only when intentionally named.

3. A daily feature pipeline jumps from 1 hour to 8 hours.

Compare plan and data volume before/after. Identify scan, shuffle, skew, spill, or write bottleneck. Look for new distincts, windows, many-to-many joins, partition misses, tiny files, or backfill range expansion.

4. Offline model metrics are excellent but online metrics are poor.

Investigate point-in-time leakage, training-serving skew, random temporal splits, label leakage, population shift, duplicate entities across splits, delayed labels, and incorrect joins between predictions and outcomes.

5. A dashboard query takes down shared warehouse capacity.

Add workload isolation, query cost limits, concurrency controls, curated marts, materialized aggregates, partition pruning guidance, and observability for top queries by bytes scanned, spill, and runtime.

6. A team wants raw event access for every metric.

Offer a curated event mart with typed columns, documented grains, privacy controls, and common metric tables. Keep raw access for exploration with quotas, but do not make raw events the production dashboard contract.

7. Late-arriving events keep changing historical metrics.

Define watermark and finalization policy. Publish provisional vs finalized metrics if needed. Backfill impacted partitions, track late-arrival volume, and make downstream consumers aware of mutability.

8. A dimension attribute changed and historical reporting shifted.

The model likely used current-state dimension joins. Use SCD type 2 or snapshot dimensions and join facts to the version valid at event time. Add tests for interval overlap and current-row uniqueness.

9. A query is correct but too expensive for daily production.

Move recurring logic into modeled intermediates. Pre-aggregate at stable grains, partition and cluster for access patterns, materialize dashboards, and add freshness/quality tests.

10. A team wants exact real-time global distinct users.

Clarify exactness, latency, and cost. Exact global distinct in real time is expensive. Consider approximate sketches, bounded windows, pre-aggregated user activity, or delayed exact reporting.

11. A migration changes user IDs and breaks historical metrics.

Introduce an identity mapping table with validity intervals and source-system lineage. Backfill curated marts, reconcile old vs new counts, and define how merged/split identities are handled.

12. A new metric requires joining three many-to-many tables.

Stop and define the business grain. Resolve each many-to-many relationship with bridge rules, attribution, deduplication, or pre-aggregation before computing measures.

13. Analysts copy-paste slightly different revenue SQL everywhere.

Centralize the metric in a semantic layer or curated mart. Document definition, owner, grain, filters, exclusions, and tests. Deprecate ad hoc versions with migration support.

14. A backfill must recompute two years of event data.

Plan partitioned backfill, resource isolation, validation samples, reconciliation totals, idempotent writes, checkpointing, rollback, and communication to downstream consumers.

15. A table is cheap for one team but expensive for everyone else.

Separate workload-specific layout from canonical data. Maintain a canonical fact and derived serving marts with different partitioning, clustering, or aggregation for major access patterns.

16. An experiment result flips after removing duplicate users.

Investigate assignment uniqueness, unit of randomization, exposure vs assignment, bot/test traffic, repeated conversions, and whether metrics were computed at event grain instead of user grain.

17. A feature has high null rate in production but not training.

Check training-serving skew, source freshness, online availability, join keys, default policies, feature backfill logic, and population shift. Add null-rate monitors by feature and segment.

18. A product manager asks for a metric that cannot be answered from current data.

Explain the missing event/grain/timestamp, propose instrumentation or modeling changes, define interim proxy metrics if useful, and state limitations clearly.

19. A warehouse migration changes query results slightly.

Check NULL semantics, timestamp/timezone behavior, decimal precision, integer division, approximate functions, window frame defaults, and collation. Run reconciliation by metric and partition.

20. A streaming table and batch table disagree.

Compare event-time windows, watermark, deduplication, late-data handling, source offsets, retries, and exactly-once assumptions. Reconcile by event ID and partition.

21. A single tenant dominates query cost.

Measure cost by tenant, identify access pattern, add quotas or workload isolation, create tenant-specific aggregates if justified, and address skew in partitioning or joins.

22. ML asks for fresher features than batch pipelines provide.

Separate features by freshness need. Keep stable historical batch definitions, add online/streaming paths only for high-value low-latency features, and monitor training-serving skew.

23. A critical metric has no owner.

Assign ownership, document definition, add tests, define freshness/error SLOs, publish lineage, and create an escalation path. A metric without ownership is operational debt.

24. A query optimization changes results.

Treat it as a correctness regression. Compare row counts and checksums at each CTE, inspect changed join cardinality and NULL behavior, and add tests before accepting any performance improvement.

25. What makes an SQL answer staff-level?

It covers semantics, grain, temporal boundaries, performance, data model, operationalization, tests, and downstream impact. The answer should reduce future ambiguity, not just solve the immediate query.