Feature and Label SQL
SQL-heavy ML engineering interviews often test whether you can build datasets without leakage and explain how they run at scale.
Prediction-Time Grain
Section titled “Prediction-Time Grain”Start with the prediction set.
prediction_points(entity_id, prediction_ts)
Every feature must be computed using information available at or before prediction_ts. Every label must be computed from an outcome window after prediction_ts.
Point-In-Time Feature Join
Section titled “Point-In-Time Feature Join”WITH candidate_features AS (
SELECT
p.entity_id,
p.prediction_ts,
f.feature_name,
f.feature_value,
ROW_NUMBER() OVER (
PARTITION BY p.entity_id, p.prediction_ts, f.feature_name
ORDER BY f.snapshot_ts DESC
) AS rn
FROM prediction_points p
JOIN feature_snapshots f
ON f.entity_id = p.entity_id
AND f.snapshot_ts <= p.prediction_ts
)
SELECT entity_id, prediction_ts, feature_name, feature_value
FROM candidate_features
WHERE rn = 1;
Rolling Features
Section titled “Rolling Features”Example: orders in previous 30 days. Interval syntax varies by engine.
SELECT
p.entity_id,
p.prediction_ts,
COUNT(o.order_id) AS orders_30d,
SUM(o.total_amount) AS revenue_30d
FROM prediction_points p
LEFT JOIN orders o
ON o.customer_id = p.entity_id
AND o.order_ts < p.prediction_ts
AND o.order_ts >= p.prediction_ts - INTERVAL '30' DAY
AND o.status = 'paid'
GROUP BY p.entity_id, p.prediction_ts;
Use < prediction_ts, not <=, unless events at the exact prediction timestamp are truly known before prediction.
Label Generation
Section titled “Label Generation”Example: churn label if no activity in the next 30 days after a 7-day observation gap.
SELECT
p.entity_id,
p.prediction_ts,
CASE
WHEN EXISTS (
SELECT 1
FROM events e
WHERE e.user_id = p.entity_id
AND e.event_ts >= p.prediction_ts + INTERVAL '7' DAY
AND e.event_ts < p.prediction_ts + INTERVAL '37' DAY
)
THEN 0
ELSE 1
END AS churn_label
FROM prediction_points p;
The gap avoids labeling behavior that may not be observable yet.
Leakage Checklist
Section titled “Leakage Checklist”- Feature computed after prediction time.
- Dimension joined to current state instead of historical state.
- Label window overlaps feature window.
- Backfilled data includes corrections unavailable at prediction time.
- Aggregates use full-history table without time predicate.
- Train/test split randomly separates rows from the same entity across time.
- Target-derived fields sneak into features.
Evaluation SQL
Section titled “Evaluation SQL”Confusion matrix at threshold:
SELECT
SUM(CASE WHEN score >= 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS tp,
SUM(CASE WHEN score >= 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS fp,
SUM(CASE WHEN score < 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS fn,
SUM(CASE WHEN score < 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS tn
FROM predictions p
JOIN labels l
ON l.entity_id = p.entity_id
AND l.label_ts = p.prediction_ts;
Calibration buckets:
WITH scored AS (
SELECT
score,
label_value,
CASE
WHEN score < 0.1 THEN '00-10'
WHEN score < 0.2 THEN '10-20'
WHEN score < 0.3 THEN '20-30'
WHEN score < 0.4 THEN '30-40'
WHEN score < 0.5 THEN '40-50'
WHEN score < 0.6 THEN '50-60'
WHEN score < 0.7 THEN '60-70'
WHEN score < 0.8 THEN '70-80'
WHEN score < 0.9 THEN '80-90'
ELSE '90-100'
END AS bucket
FROM predictions p
JOIN labels l
ON l.entity_id = p.entity_id
)
SELECT
bucket,
COUNT(*) AS examples,
AVG(score) AS avg_score,
AVG(label_value) AS observed_rate
FROM scored
GROUP BY bucket
ORDER BY bucket;
Production Dataset Contract
Section titled “Production Dataset Contract”For every training dataset, define:
- entity grain
- prediction timestamp
- feature availability timestamp
- label window
- exclusion criteria
- backfill range
- leakage tests
- row count expectations
- null/default policy
- train/validation/test split strategy
Practice
Section titled “Practice”1. Build a feature: number of purchases in the 7 days before prediction time.
SELECT
p.entity_id,
p.prediction_ts,
COUNT(o.order_id) AS purchases_7d
FROM prediction_points p
LEFT JOIN orders o
ON o.customer_id = p.entity_id
AND o.status = 'paid'
AND o.order_ts < p.prediction_ts
AND o.order_ts >= p.prediction_ts - INTERVAL '7' DAY
GROUP BY p.entity_id, p.prediction_ts;
The left join preserves prediction points with zero purchases.
2. What is wrong with joining customer_dim WHERE is_current = true when generating historical training data?
It leaks future/customer-current state into past examples. Use the dimension version valid at prediction_ts or at the event timestamp relevant to the feature.
3. How do you create a time-based train/test split?
SELECT
*,
CASE
WHEN prediction_ts < TIMESTAMP '2026-04-01 00:00:00' THEN 'train'
WHEN prediction_ts < TIMESTAMP '2026-05-01 00:00:00' THEN 'validation'
ELSE 'test'
END AS split
FROM training_examples;
For entity leakage risk, also ensure the same entity’s future does not leak into training through aggregate features or split design.
4. A model's offline AUC is great but online performance is poor. What SQL/data causes do you investigate?
Point-in-time leakage, label leakage, random split over temporal data, training/serving feature skew, missing late data handling, changed population, duplicate entities across splits, bad negative sampling, incorrectly joined labels, and features available offline but not online.