Common Query Patterns
These are the patterns that show up repeatedly in data and ML engineering loops.
Deduplication
Section titled “Deduplication”Keep the latest record per natural key.
WITH ranked AS (
SELECT
e.*,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY ingest_ts DESC
) AS rn
FROM raw_events e
)
SELECT *
FROM ranked
WHERE rn = 1;
Staff note: define what “latest” means. Source event time, processing time, version, and ingest time are not interchangeable.
Funnels
Section titled “Funnels”WITH user_steps AS (
SELECT
user_id,
MIN(CASE WHEN event_name = 'view_product' THEN event_ts END) AS viewed_at,
MIN(CASE WHEN event_name = 'add_to_cart' THEN event_ts END) AS carted_at,
MIN(CASE WHEN event_name = 'purchase' THEN event_ts END) AS purchased_at
FROM events
GROUP BY user_id
)
SELECT
COUNT(*) AS users,
SUM(CASE WHEN viewed_at IS NOT NULL THEN 1 ELSE 0 END) AS viewed,
SUM(CASE WHEN carted_at > viewed_at THEN 1 ELSE 0 END) AS carted_after_view,
SUM(CASE WHEN purchased_at > carted_at THEN 1 ELSE 0 END) AS purchased_after_cart
FROM user_steps;
This simple version assumes the first timestamp per event type is enough. Harder versions need ordered paths and step windows.
Retention Cohorts
Section titled “Retention Cohorts”WITH first_seen AS (
SELECT user_id, MIN(event_date) AS cohort_date
FROM events
GROUP BY user_id
),
activity AS (
SELECT DISTINCT user_id, event_date
FROM events
)
SELECT
f.cohort_date,
a.event_date,
COUNT(*) AS active_users
FROM first_seen f
JOIN activity a
ON a.user_id = f.user_id
AND a.event_date >= f.cohort_date
GROUP BY f.cohort_date, a.event_date;
Date-diff syntax varies by engine. In interviews, say you would convert event_date - cohort_date to a cohort age using the engine’s date arithmetic.
Gaps and Islands
Section titled “Gaps and Islands”Find consecutive active days per user. Date arithmetic syntax varies; day_number can be a generated integer representing days since epoch.
WITH active_days AS (
SELECT DISTINCT user_id, day_number
FROM daily_user_activity
),
numbered AS (
SELECT
user_id,
day_number,
day_number - ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY day_number
) AS island_key
FROM active_days
)
SELECT
user_id,
MIN(day_number) AS start_day,
MAX(day_number) AS end_day,
COUNT(*) AS streak_days
FROM numbered
GROUP BY user_id, island_key;
Sessionization
Section titled “Sessionization”WITH ordered AS (
SELECT
user_id,
event_ts,
event_name,
LAG(event_ts) OVER (
PARTITION BY user_id
ORDER BY event_ts
) AS prev_event_ts
FROM events
),
marked AS (
SELECT
*,
CASE
WHEN prev_event_ts IS NULL THEN 1
WHEN event_ts > prev_event_ts + INTERVAL '30' MINUTE THEN 1
ELSE 0
END AS is_new_session
FROM ordered
),
sessionized AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_number
FROM marked
)
SELECT user_id, session_number, MIN(event_ts) AS started_at, MAX(event_ts) AS ended_at
FROM sessionized
GROUP BY user_id, session_number;
Interval syntax varies. In Spark SQL, you may write interval expressions differently depending on version.
As-Of Join
Section titled “As-Of Join”Attach the latest feature snapshot before prediction time.
WITH candidate_features AS (
SELECT
p.entity_id,
p.prediction_ts,
f.feature_name,
f.feature_value,
ROW_NUMBER() OVER (
PARTITION BY p.entity_id, p.prediction_ts, f.feature_name
ORDER BY f.snapshot_ts DESC
) AS rn
FROM predictions p
JOIN feature_snapshots f
ON f.entity_id = p.entity_id
AND f.snapshot_ts <= p.prediction_ts
)
SELECT entity_id, prediction_ts, feature_name, feature_value
FROM candidate_features
WHERE rn = 1;
This is the core anti-leakage pattern for point-in-time ML features.
Practice
Section titled “Practice”1. Find users who did event A and then event B within 24 hours.
WITH a_events AS (
SELECT user_id, event_ts AS a_ts
FROM events
WHERE event_name = 'A'
),
b_events AS (
SELECT user_id, event_ts AS b_ts
FROM events
WHERE event_name = 'B'
)
SELECT DISTINCT a.user_id
FROM a_events a
JOIN b_events b
ON b.user_id = a.user_id
AND b.b_ts > a.a_ts
AND b.b_ts <= a.a_ts + INTERVAL '24' HOUR;
If many events exist, you may need prefiltered partitions, clustering by user/time, or a windowed path algorithm.
2. Find each user's longest active-day streak.
Use gaps-and-islands, then rank streaks.
WITH active_days AS (
SELECT DISTINCT user_id, day_number
FROM daily_user_activity
),
numbered AS (
SELECT
user_id,
day_number,
day_number - ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY day_number
) AS island_key
FROM active_days
),
streaks AS (
SELECT user_id, COUNT(*) AS streak_days
FROM numbered
GROUP BY user_id, island_key
)
SELECT user_id, MAX(streak_days) AS longest_streak_days
FROM streaks
GROUP BY user_id;
3. Why might a funnel query overcount conversions?
Common causes: not enforcing step order, using all event rows instead of user-grain collapse, joining step tables many-to-many, not bounding the conversion window, duplicate events, and mixing event time with ingest time.