Feature and Label SQL

SQL-heavy ML engineering interviews often test whether you can build datasets without leakage and explain how they run at scale.

Prediction-Time Grain

Start with the prediction set.

prediction_points(entity_id, prediction_ts)

Every feature must be computed using information available at or before prediction_ts. Every label must be computed from an outcome window after prediction_ts.

Point-In-Time Feature Join

WITH candidate_features AS (
	SELECT
		p.entity_id,
		p.prediction_ts,
		f.feature_name,
		f.feature_value,
		ROW_NUMBER() OVER (
			PARTITION BY p.entity_id, p.prediction_ts, f.feature_name
			ORDER BY f.snapshot_ts DESC
		) AS rn
	FROM prediction_points p
	JOIN feature_snapshots f
		ON f.entity_id = p.entity_id
		AND f.snapshot_ts <= p.prediction_ts
)
SELECT entity_id, prediction_ts, feature_name, feature_value
FROM candidate_features
WHERE rn = 1;

Rolling Features

Example: orders in previous 30 days. Interval syntax varies by engine.

SELECT
	p.entity_id,
	p.prediction_ts,
	COUNT(o.order_id) AS orders_30d,
	SUM(o.total_amount) AS revenue_30d
FROM prediction_points p
LEFT JOIN orders o
	ON o.customer_id = p.entity_id
	AND o.order_ts < p.prediction_ts
	AND o.order_ts >= p.prediction_ts - INTERVAL '30' DAY
	AND o.status = 'paid'
GROUP BY p.entity_id, p.prediction_ts;

Use < prediction_ts, not <=, unless events at the exact prediction timestamp are truly known before prediction.

Label Generation

Example: churn label if no activity in the next 30 days after a 7-day observation gap.

SELECT
	p.entity_id,
	p.prediction_ts,
	CASE
		WHEN EXISTS (
			SELECT 1
			FROM events e
			WHERE e.user_id = p.entity_id
				AND e.event_ts >= p.prediction_ts + INTERVAL '7' DAY
				AND e.event_ts < p.prediction_ts + INTERVAL '37' DAY
		)
		THEN 0
		ELSE 1
	END AS churn_label
FROM prediction_points p;

The gap avoids labeling behavior that may not be observable yet.

Leakage Checklist

Feature computed after prediction time.
Dimension joined to current state instead of historical state.
Label window overlaps feature window.
Backfilled data includes corrections unavailable at prediction time.
Aggregates use full-history table without time predicate.
Train/test split randomly separates rows from the same entity across time.
Target-derived fields sneak into features.

Evaluation SQL

Confusion matrix at threshold:

SELECT
	SUM(CASE WHEN score >= 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS tp,
	SUM(CASE WHEN score >= 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS fp,
	SUM(CASE WHEN score < 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS fn,
	SUM(CASE WHEN score < 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS tn
FROM predictions p
JOIN labels l
	ON l.entity_id = p.entity_id
	AND l.label_ts = p.prediction_ts;

Calibration buckets:

WITH scored AS (
	SELECT
		score,
		label_value,
		CASE
			WHEN score < 0.1 THEN '00-10'
			WHEN score < 0.2 THEN '10-20'
			WHEN score < 0.3 THEN '20-30'
			WHEN score < 0.4 THEN '30-40'
			WHEN score < 0.5 THEN '40-50'
			WHEN score < 0.6 THEN '50-60'
			WHEN score < 0.7 THEN '60-70'
			WHEN score < 0.8 THEN '70-80'
			WHEN score < 0.9 THEN '80-90'
			ELSE '90-100'
		END AS bucket
	FROM predictions p
	JOIN labels l
		ON l.entity_id = p.entity_id
)
SELECT
	bucket,
	COUNT(*) AS examples,
	AVG(score) AS avg_score,
	AVG(label_value) AS observed_rate
FROM scored
GROUP BY bucket
ORDER BY bucket;

Production Dataset Contract

For every training dataset, define:

entity grain
prediction timestamp
feature availability timestamp
label window
exclusion criteria
backfill range
leakage tests
row count expectations
null/default policy
train/validation/test split strategy

Practice

1. Build a feature: number of purchases in the 7 days before prediction time.

SELECT
	p.entity_id,
	p.prediction_ts,
	COUNT(o.order_id) AS purchases_7d
FROM prediction_points p
LEFT JOIN orders o
	ON o.customer_id = p.entity_id
	AND o.status = 'paid'
	AND o.order_ts < p.prediction_ts
	AND o.order_ts >= p.prediction_ts - INTERVAL '7' DAY
GROUP BY p.entity_id, p.prediction_ts;

The left join preserves prediction points with zero purchases.

2. What is wrong with joining customer_dim WHERE is_current = true when generating historical training data?

It leaks future/customer-current state into past examples. Use the dimension version valid at prediction_ts or at the event timestamp relevant to the feature.

3. How do you create a time-based train/test split?

SELECT
	*,
	CASE
		WHEN prediction_ts < TIMESTAMP '2026-04-01 00:00:00' THEN 'train'
		WHEN prediction_ts < TIMESTAMP '2026-05-01 00:00:00' THEN 'validation'
		ELSE 'test'
	END AS split
FROM training_examples;

For entity leakage risk, also ensure the same entity’s future does not leak into training through aggregate features or split design.

4. A model's offline AUC is great but online performance is poor. What SQL/data causes do you investigate?

Point-in-time leakage, label leakage, random split over temporal data, training/serving feature skew, missing late data handling, changed population, duplicate entities across splits, bad negative sampling, incorrectly joined labels, and features available offline but not online.