Skip to content

Staff Scenarios

These are not “write one query” prompts. They test whether you can lead from ambiguous data questions to robust systems.

Prompt:

Revenue in the finance dashboard and product analytics dashboard differ by 8%. Both teams say their SQL is correct.

Answer frame:

  1. Align the business definition: booked, paid, recognized, refunded, tax included, currency conversion.
  2. Align grain: order, payment, invoice line, subscription period.
  3. Align time: order time, payment time, settlement time, recognition period.
  4. Reconcile from a shared base fact.
  5. Build diff queries by dimension and time.
  6. Establish a governed metric contract.

SQL diff pattern:

WITH finance AS (
	SELECT order_date, SUM(revenue) AS finance_revenue
	FROM finance_daily_revenue
	GROUP BY order_date
),
product AS (
	SELECT order_date, SUM(revenue) AS product_revenue
	FROM product_daily_revenue
	GROUP BY order_date
)
SELECT
	COALESCE(f.order_date, p.order_date) AS order_date,
	f.finance_revenue,
	p.product_revenue,
	f.finance_revenue - p.product_revenue AS delta
FROM finance f
FULL OUTER JOIN product p
	ON p.order_date = f.order_date
ORDER BY ABS(f.finance_revenue - p.product_revenue) DESC;

If FULL OUTER JOIN is unavailable, emulate with UNION ALL of left/right anti joins.

Prompt:

A daily feature job grew from 45 minutes to 6 hours. It is mostly SQL over event data.

Investigation:

  • Did input volume grow or did plan shape change?
  • Are partitions pruned?
  • Did a new feature add a global distinct, window, or many-to-many join?
  • Is shuffle skewed?
  • Are small files causing scheduler overhead?
  • Are stats stale?
  • Did late-data backfill expand the processing range?

Staff answer:

I would separate semantic growth from physical regression. First compare row counts per stage and plan changes. Then identify whether the dominant cost is scan, shuffle, skew, sort, spill, or write. The fix could be a query rewrite, pre-aggregated feature table, better partitioning, compaction, or a change to the feature contract.

Scenario 3: Feature Store Point-In-Time Bug

Section titled “Scenario 3: Feature Store Point-In-Time Bug”

Prompt:

A fraud model looks excellent offline. A reviewer suspects features leak future information.

Approach:

  • Identify prediction points.
  • For every feature, require feature_available_ts <= prediction_ts.
  • Check whether dimensions are current-state joins.
  • Check label windows and observation windows.
  • Rebuild a small sample with strict as-of joins.
  • Add leakage tests to the dataset build.

Test query:

SELECT feature_name, COUNT(*) AS leaked_rows
FROM feature_values
WHERE feature_available_ts > prediction_ts
GROUP BY feature_name
HAVING COUNT(*) > 0;

Scenario 4: Multi-Tenant Data Platform Query Meltdown

Section titled “Scenario 4: Multi-Tenant Data Platform Query Meltdown”

Prompt:

One tenant runs a dashboard query that slows the shared warehouse for everyone.

Answer:

  • classify workloads and isolate critical jobs
  • set quotas/concurrency limits
  • use materialized aggregates for expensive dashboards
  • add query review for unbounded scans
  • expose cost/bytes scanned to users
  • create tenant-aware partitions/clustering where justified
  • monitor top queries by spend, spill, and scanned bytes

SQL smell:

SELECT *
FROM events
WHERE properties LIKE '%purchase%';

Better product/system fix:

  • parse important properties into typed columns
  • partition by event date
  • cluster by tenant and event name where useful
  • publish curated marts with narrow schemas

Prompt:

ML wants fresher features; data platform says current pipeline cannot support it.

Tradeoff answer:

  • define freshness SLO and model value from freshness
  • identify features needing low latency vs batch freshness
  • separate online-critical features from batch aggregates
  • design backfill-compatible feature definitions
  • measure training/serving skew
  • phase rollout with shadow evaluation

SQL-ish contract:

feature_definitions(
	feature_name,
	entity_type,
	feature_grain,
	max_lookback,
	availability_lag,
	owner,
	quality_slo
)
1. How do you answer "this query is slow" at staff level?

Ask what changed and what “slow” means: runtime, cost, p99, deadline miss, or cluster impact. Then inspect plan shape, input bytes, partition pruning, join order, shuffle/sort/spill, skew, stats, and output cardinality. Separate semantic fixes from physical tuning, and close with tests or data-model changes that prevent the pattern from recurring.

2. How do you push back when someone asks for SELECT * access to raw events for all dashboards?

Explain blast radius: cost, privacy, schema instability, duplicate definitions, and bad performance. Offer curated marts with explicit grains, typed columns, documented metrics, row-level access controls if needed, and an escape hatch for exploration.

3. How do you design SQL quality tests for a critical fact table?

Test uniqueness at declared grain, non-null keys, valid foreign-key coverage, accepted status values, event-time bounds, duplicate source events, reconciliation against upstream totals, late-data volume, and metric deltas versus historical baselines.

4. What staff signal should you show in SQL interviews?

You should move fluidly between query semantics, data modeling, execution plans, distributed cost, and production ownership. The interviewer should hear that you can prevent classes of bugs, not just solve a one-off puzzle.