Modeling and Temporal SQL
Staff-level SQL interviews often hide a modeling problem inside a query problem.
Facts and Dimensions
Section titled “Facts and Dimensions”| Type | Grain | Examples |
|---|---|---|
| transaction fact | one row per business event | order, payment, shipment |
| periodic snapshot fact | one row per entity-period | account balance by day |
| accumulating snapshot | one row per process instance | order lifecycle milestones |
| dimension | one row per entity or entity-version | customer, product, store |
The right model makes the query obvious. The wrong model forces every query to re-derive business meaning.
Slowly Changing Dimensions
Section titled “Slowly Changing Dimensions”Type 2 dimension pattern:
customer_dim(
customer_sk,
customer_id,
country,
plan,
valid_from,
valid_to,
is_current
)
Join facts to the dimension version valid at event time.
SELECT
o.order_id,
o.order_ts,
d.plan,
o.total_amount
FROM orders o
JOIN customer_dim d
ON d.customer_id = o.customer_id
AND o.order_ts >= d.valid_from
AND o.order_ts < d.valid_to;
Use half-open intervals: [valid_from, valid_to). They avoid boundary overlap.
Late-Arriving Data
Section titled “Late-Arriving Data”Late events force a choice:
- recompute impacted partitions
- maintain correction tables
- use watermarking and accept bounded lateness
- publish both provisional and finalized metrics
Interview phrase:
I would separate event-time correctness from processing-time availability, then define a watermark and a backfill policy.
Date Spine
Section titled “Date Spine”Use a date spine when you need zero rows to appear as zero metrics.
SELECT
d.calendar_date,
COUNT(o.order_id) AS paid_orders
FROM date_spine d
LEFT JOIN orders o
ON o.order_date = d.calendar_date
AND o.status = 'paid'
GROUP BY d.calendar_date;
Snapshot vs Event Log
Section titled “Snapshot vs Event Log”Event log:
- append-only
- good for audits and replay
- harder for current state
Snapshot:
- easier current-state querying
- can hide history
- needs freshness and reconciliation checks
Staff answer: keep immutable events as source of truth and derived snapshots for serving, with tests that reconcile the two.
Practice
Section titled “Practice”1. A product's category can change over time. How do you compute revenue by category as of the order date?
Use a type 2 product dimension and join by validity interval.
SELECT
p.category,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o
ON o.order_id = oi.order_id
JOIN product_dim p
ON p.product_id = oi.product_id
AND o.order_ts >= p.valid_from
AND o.order_ts < p.valid_to
WHERE o.status = 'paid'
GROUP BY p.category;
Do not join only to the current product category unless the question explicitly wants current taxonomy.
2. How do you detect overlapping SCD intervals?
WITH ordered AS (
SELECT
customer_id,
valid_from,
valid_to,
LAG(valid_to) OVER (
PARTITION BY customer_id
ORDER BY valid_from
) AS prev_valid_to
FROM customer_dim
)
SELECT *
FROM ordered
WHERE prev_valid_to > valid_from;
This finds intervals where the previous row ends after the current row starts.
3. Why is using processing date for business metrics risky?
Processing date measures when the pipeline saw the data, not when the business event happened. Late data, retries, backfills, and outages can distort event-time metrics. Use processing date for pipeline operations; use event date for business reporting unless explicitly stated otherwise.