Common Query Patterns

These are the patterns that show up repeatedly in data and ML engineering loops.

Deduplication

Keep the latest record per natural key.

WITH ranked AS (
	SELECT
		e.*,
		ROW_NUMBER() OVER (
			PARTITION BY event_id
			ORDER BY ingest_ts DESC
		) AS rn
	FROM raw_events e
)
SELECT *
FROM ranked
WHERE rn = 1;

Staff note: define what “latest” means. Source event time, processing time, version, and ingest time are not interchangeable.

Funnels

WITH user_steps AS (
	SELECT
		user_id,
		MIN(CASE WHEN event_name = 'view_product' THEN event_ts END) AS viewed_at,
		MIN(CASE WHEN event_name = 'add_to_cart' THEN event_ts END) AS carted_at,
		MIN(CASE WHEN event_name = 'purchase' THEN event_ts END) AS purchased_at
	FROM events
	GROUP BY user_id
)
SELECT
	COUNT(*) AS users,
	SUM(CASE WHEN viewed_at IS NOT NULL THEN 1 ELSE 0 END) AS viewed,
	SUM(CASE WHEN carted_at > viewed_at THEN 1 ELSE 0 END) AS carted_after_view,
	SUM(CASE WHEN purchased_at > carted_at THEN 1 ELSE 0 END) AS purchased_after_cart
FROM user_steps;

This simple version assumes the first timestamp per event type is enough. Harder versions need ordered paths and step windows.

Retention Cohorts

WITH first_seen AS (
	SELECT user_id, MIN(event_date) AS cohort_date
	FROM events
	GROUP BY user_id
),
activity AS (
	SELECT DISTINCT user_id, event_date
	FROM events
)
SELECT
	f.cohort_date,
	a.event_date,
	COUNT(*) AS active_users
FROM first_seen f
JOIN activity a
	ON a.user_id = f.user_id
	AND a.event_date >= f.cohort_date
GROUP BY f.cohort_date, a.event_date;

Date-diff syntax varies by engine. In interviews, say you would convert event_date - cohort_date to a cohort age using the engine’s date arithmetic.

Gaps and Islands

Find consecutive active days per user. Date arithmetic syntax varies; day_number can be a generated integer representing days since epoch.

WITH active_days AS (
	SELECT DISTINCT user_id, day_number
	FROM daily_user_activity
),
numbered AS (
	SELECT
		user_id,
		day_number,
		day_number - ROW_NUMBER() OVER (
			PARTITION BY user_id
			ORDER BY day_number
		) AS island_key
	FROM active_days
)
SELECT
	user_id,
	MIN(day_number) AS start_day,
	MAX(day_number) AS end_day,
	COUNT(*) AS streak_days
FROM numbered
GROUP BY user_id, island_key;

Sessionization

WITH ordered AS (
	SELECT
		user_id,
		event_ts,
		event_name,
		LAG(event_ts) OVER (
			PARTITION BY user_id
			ORDER BY event_ts
		) AS prev_event_ts
	FROM events
),
marked AS (
	SELECT
		*,
		CASE
			WHEN prev_event_ts IS NULL THEN 1
			WHEN event_ts > prev_event_ts + INTERVAL '30' MINUTE THEN 1
			ELSE 0
		END AS is_new_session
	FROM ordered
),
sessionized AS (
	SELECT
		*,
		SUM(is_new_session) OVER (
			PARTITION BY user_id
			ORDER BY event_ts
			ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
		) AS session_number
	FROM marked
)
SELECT user_id, session_number, MIN(event_ts) AS started_at, MAX(event_ts) AS ended_at
FROM sessionized
GROUP BY user_id, session_number;

Interval syntax varies. In Spark SQL, you may write interval expressions differently depending on version.

As-Of Join

Attach the latest feature snapshot before prediction time.

WITH candidate_features AS (
	SELECT
		p.entity_id,
		p.prediction_ts,
		f.feature_name,
		f.feature_value,
		ROW_NUMBER() OVER (
			PARTITION BY p.entity_id, p.prediction_ts, f.feature_name
			ORDER BY f.snapshot_ts DESC
		) AS rn
	FROM predictions p
	JOIN feature_snapshots f
		ON f.entity_id = p.entity_id
		AND f.snapshot_ts <= p.prediction_ts
)
SELECT entity_id, prediction_ts, feature_name, feature_value
FROM candidate_features
WHERE rn = 1;

This is the core anti-leakage pattern for point-in-time ML features.

Practice

1. Find users who did event A and then event B within 24 hours.

WITH a_events AS (
	SELECT user_id, event_ts AS a_ts
	FROM events
	WHERE event_name = 'A'
),
b_events AS (
	SELECT user_id, event_ts AS b_ts
	FROM events
	WHERE event_name = 'B'
)
SELECT DISTINCT a.user_id
FROM a_events a
JOIN b_events b
	ON b.user_id = a.user_id
	AND b.b_ts > a.a_ts
	AND b.b_ts <= a.a_ts + INTERVAL '24' HOUR;

If many events exist, you may need prefiltered partitions, clustering by user/time, or a windowed path algorithm.

2. Find each user's longest active-day streak.

Use gaps-and-islands, then rank streaks.

WITH active_days AS (
	SELECT DISTINCT user_id, day_number
	FROM daily_user_activity
),
numbered AS (
	SELECT
		user_id,
		day_number,
		day_number - ROW_NUMBER() OVER (
			PARTITION BY user_id
			ORDER BY day_number
		) AS island_key
	FROM active_days
),
streaks AS (
	SELECT user_id, COUNT(*) AS streak_days
	FROM numbered
	GROUP BY user_id, island_key
)
SELECT user_id, MAX(streak_days) AS longest_streak_days
FROM streaks
GROUP BY user_id;

3. Why might a funnel query overcount conversions?

Common causes: not enforcing step order, using all event rows instead of user-grain collapse, joining step tables many-to-many, not bounding the conversion window, duplicate events, and mixing event time with ingest time.