Skip to content

Feature and Label SQL

SQL-heavy ML engineering interviews often test whether you can build datasets without leakage and explain how they run at scale.

Start with the prediction set.

prediction_points(entity_id, prediction_ts)

Every feature must be computed using information available at or before prediction_ts. Every label must be computed from an outcome window after prediction_ts.

WITH candidate_features AS (
	SELECT
		p.entity_id,
		p.prediction_ts,
		f.feature_name,
		f.feature_value,
		ROW_NUMBER() OVER (
			PARTITION BY p.entity_id, p.prediction_ts, f.feature_name
			ORDER BY f.snapshot_ts DESC
		) AS rn
	FROM prediction_points p
	JOIN feature_snapshots f
		ON f.entity_id = p.entity_id
		AND f.snapshot_ts <= p.prediction_ts
)
SELECT entity_id, prediction_ts, feature_name, feature_value
FROM candidate_features
WHERE rn = 1;

Example: orders in previous 30 days. Interval syntax varies by engine.

SELECT
	p.entity_id,
	p.prediction_ts,
	COUNT(o.order_id) AS orders_30d,
	SUM(o.total_amount) AS revenue_30d
FROM prediction_points p
LEFT JOIN orders o
	ON o.customer_id = p.entity_id
	AND o.order_ts < p.prediction_ts
	AND o.order_ts >= p.prediction_ts - INTERVAL '30' DAY
	AND o.status = 'paid'
GROUP BY p.entity_id, p.prediction_ts;

Use < prediction_ts, not <=, unless events at the exact prediction timestamp are truly known before prediction.

Example: churn label if no activity in the next 30 days after a 7-day observation gap.

SELECT
	p.entity_id,
	p.prediction_ts,
	CASE
		WHEN EXISTS (
			SELECT 1
			FROM events e
			WHERE e.user_id = p.entity_id
				AND e.event_ts >= p.prediction_ts + INTERVAL '7' DAY
				AND e.event_ts < p.prediction_ts + INTERVAL '37' DAY
		)
		THEN 0
		ELSE 1
	END AS churn_label
FROM prediction_points p;

The gap avoids labeling behavior that may not be observable yet.

  • Feature computed after prediction time.
  • Dimension joined to current state instead of historical state.
  • Label window overlaps feature window.
  • Backfilled data includes corrections unavailable at prediction time.
  • Aggregates use full-history table without time predicate.
  • Train/test split randomly separates rows from the same entity across time.
  • Target-derived fields sneak into features.

Confusion matrix at threshold:

SELECT
	SUM(CASE WHEN score >= 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS tp,
	SUM(CASE WHEN score >= 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS fp,
	SUM(CASE WHEN score < 0.8 AND label_value = 1 THEN 1 ELSE 0 END) AS fn,
	SUM(CASE WHEN score < 0.8 AND label_value = 0 THEN 1 ELSE 0 END) AS tn
FROM predictions p
JOIN labels l
	ON l.entity_id = p.entity_id
	AND l.label_ts = p.prediction_ts;

Calibration buckets:

WITH scored AS (
	SELECT
		score,
		label_value,
		CASE
			WHEN score < 0.1 THEN '00-10'
			WHEN score < 0.2 THEN '10-20'
			WHEN score < 0.3 THEN '20-30'
			WHEN score < 0.4 THEN '30-40'
			WHEN score < 0.5 THEN '40-50'
			WHEN score < 0.6 THEN '50-60'
			WHEN score < 0.7 THEN '60-70'
			WHEN score < 0.8 THEN '70-80'
			WHEN score < 0.9 THEN '80-90'
			ELSE '90-100'
		END AS bucket
	FROM predictions p
	JOIN labels l
		ON l.entity_id = p.entity_id
)
SELECT
	bucket,
	COUNT(*) AS examples,
	AVG(score) AS avg_score,
	AVG(label_value) AS observed_rate
FROM scored
GROUP BY bucket
ORDER BY bucket;

For every training dataset, define:

  • entity grain
  • prediction timestamp
  • feature availability timestamp
  • label window
  • exclusion criteria
  • backfill range
  • leakage tests
  • row count expectations
  • null/default policy
  • train/validation/test split strategy
1. Build a feature: number of purchases in the 7 days before prediction time.
SELECT
	p.entity_id,
	p.prediction_ts,
	COUNT(o.order_id) AS purchases_7d
FROM prediction_points p
LEFT JOIN orders o
	ON o.customer_id = p.entity_id
	AND o.status = 'paid'
	AND o.order_ts < p.prediction_ts
	AND o.order_ts >= p.prediction_ts - INTERVAL '7' DAY
GROUP BY p.entity_id, p.prediction_ts;

The left join preserves prediction points with zero purchases.

2. What is wrong with joining customer_dim WHERE is_current = true when generating historical training data?

It leaks future/customer-current state into past examples. Use the dimension version valid at prediction_ts or at the event timestamp relevant to the feature.

3. How do you create a time-based train/test split?
SELECT
	*,
	CASE
		WHEN prediction_ts < TIMESTAMP '2026-04-01 00:00:00' THEN 'train'
		WHEN prediction_ts < TIMESTAMP '2026-05-01 00:00:00' THEN 'validation'
		ELSE 'test'
	END AS split
FROM training_examples;

For entity leakage risk, also ensure the same entity’s future does not leak into training through aggregate features or split design.

4. A model's offline AUC is great but online performance is poor. What SQL/data causes do you investigate?

Point-in-time leakage, label leakage, random split over temporal data, training/serving feature skew, missing late data handling, changed population, duplicate entities across splits, bad negative sampling, incorrectly joined labels, and features available offline but not online.