Skip to content

Joins, NULLs, Cardinality Drills

Say the preserved side and expected join cardinality before opening each answer.

1. Return all customers and their latest paid order, preserving customers with none.
WITH ranked AS (
	SELECT
		o.*,
		ROW_NUMBER() OVER (
			PARTITION BY customer_id
			ORDER BY order_ts DESC, order_id DESC
		) AS rn
	FROM orders o
	WHERE status = 'paid'
)
SELECT c.customer_id, r.order_id, r.order_ts, r.total_amount
FROM customers c
LEFT JOIN ranked r
	ON r.customer_id = c.customer_id
	AND r.rn = 1;

The filter for rn = 1 stays in the join condition to preserve customers without paid orders.

2. Find orders whose customer row is missing.
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN customers c
	ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;

This is an orphan fact check.

3. Find customers with a paid order but no successful payment.
SELECT DISTINCT o.customer_id
FROM orders o
WHERE o.status = 'paid'
	AND NOT EXISTS (
		SELECT 1
		FROM payments p
		WHERE p.order_id = o.order_id
			AND p.status = 'succeeded'
	);

Use NOT EXISTS to avoid NULL traps and duplicate amplification.

4. Check whether joining products to order_items is many-to-one.
SELECT product_id, COUNT(*) AS rows
FROM products
GROUP BY product_id
HAVING COUNT(*) > 1;

If this returns rows, products is not unique by product_id; joining may fan out.

5. Count customers with no orders without using LEFT JOIN.
SELECT COUNT(*) AS customers_without_orders
FROM customers c
WHERE NOT EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
);

This is an anti semijoin.

6. Count paid orders per customer and include customers with zero.
SELECT
	c.customer_id,
	COUNT(o.order_id) AS paid_orders
FROM customers c
LEFT JOIN orders o
	ON o.customer_id = c.customer_id
	AND o.status = 'paid'
GROUP BY c.customer_id;

Count the right-side key, not COUNT(*), because preserved rows with no order still produce one joined row.

7. Return customers whose country differs from US, including missing country.
SELECT customer_id
FROM customers
WHERE country <> 'US' OR country IS NULL;

country <> 'US' alone excludes NULLs.

8. Find duplicated experiment assignments for the same user and experiment.
SELECT user_id, experiment_name, COUNT(*) AS assignments
FROM experiments
GROUP BY user_id, experiment_name
HAVING COUNT(*) > 1;

This matters because assignment duplication can corrupt conversion rates.

9. Join events to experiment assignment only after assignment time.
SELECT e.user_id, x.experiment_name, x.variant, e.event_name, e.event_ts
FROM events e
JOIN experiments x
	ON x.user_id = e.user_id
	AND e.event_ts >= x.assigned_ts;

This avoids attributing pre-assignment behavior to the experiment.

10. Find users assigned to both variants of the same experiment.
SELECT user_id, experiment_name
FROM experiments
GROUP BY user_id, experiment_name
HAVING COUNT(DISTINCT variant) > 1;

This is a contamination check.

11. Reconcile order counts between two marts by day.
WITH a AS (
	SELECT order_date, COUNT(*) AS orders_a
	FROM mart_a_orders
	GROUP BY order_date
),
b AS (
	SELECT order_date, COUNT(*) AS orders_b
	FROM mart_b_orders
	GROUP BY order_date
)
SELECT
	COALESCE(a.order_date, b.order_date) AS order_date,
	a.orders_a,
	b.orders_b,
	COALESCE(a.orders_a, 0) - COALESCE(b.orders_b, 0) AS delta
FROM a
FULL OUTER JOIN b
	ON b.order_date = a.order_date;

If full outer join is unavailable, emulate with left join plus right anti join.

12. Find order IDs that appear multiple times in orders.
SELECT order_id, COUNT(*) AS rows
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

This tests primary-key uniqueness at order grain.

13. Return products with no paid order items.
SELECT p.product_id
FROM products p
WHERE NOT EXISTS (
	SELECT 1
	FROM order_items oi
	JOIN orders o
		ON o.order_id = oi.order_id
	WHERE oi.product_id = p.product_id
		AND o.status = 'paid'
);

This checks product existence in paid sales only.

14. Find customers who bought products in more than three categories.
SELECT o.customer_id
FROM orders o
JOIN order_items oi
	ON oi.order_id = o.order_id
JOIN products p
	ON p.product_id = oi.product_id
WHERE o.status = 'paid'
GROUP BY o.customer_id
HAVING COUNT(DISTINCT p.category) > 3;

Use DISTINCT because multiple products in the same category should count once.

15. Find many-to-many fanout when joining orders to promotions.
SELECT
	o.order_id,
	COUNT(*) AS promotion_matches
FROM orders o
JOIN promotions p
	ON p.customer_id = o.customer_id
	AND o.order_ts >= p.valid_from
	AND o.order_ts < p.valid_to
GROUP BY o.order_id
HAVING COUNT(*) > 1;

If this returns rows, define attribution rules before summing revenue.

16. Get all customers who have any order in June without multiplying rows.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.order_ts >= TIMESTAMP '2026-06-01 00:00:00'
		AND o.order_ts < TIMESTAMP '2026-07-01 00:00:00'
);

This is cleaner than JOIN plus DISTINCT.

17. Find customers with orders in both January and February.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.order_ts >= TIMESTAMP '2026-01-01 00:00:00'
		AND o.order_ts < TIMESTAMP '2026-02-01 00:00:00'
)
AND EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.order_ts >= TIMESTAMP '2026-02-01 00:00:00'
		AND o.order_ts < TIMESTAMP '2026-03-01 00:00:00'
);

Two semijoins avoid row multiplication across months.

18. Find customers with January orders but no February orders.
SELECT c.customer_id
FROM customers c
WHERE EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.order_ts >= TIMESTAMP '2026-01-01 00:00:00'
		AND o.order_ts < TIMESTAMP '2026-02-01 00:00:00'
)
AND NOT EXISTS (
	SELECT 1
	FROM orders o
	WHERE o.customer_id = c.customer_id
		AND o.order_ts >= TIMESTAMP '2026-02-01 00:00:00'
		AND o.order_ts < TIMESTAMP '2026-03-01 00:00:00'
);
19. Detect nullable join keys before joining two event tables.
SELECT
	SUM(CASE WHEN session_id IS NULL THEN 1 ELSE 0 END) AS null_session_rows,
	COUNT(*) AS total_rows
FROM events;

NULL join keys do not match using normal equality. Decide whether to filter, impute, or keep unmatched rows.

20. Keep one customer row per email before joining to orders.
WITH one_customer_per_email AS (
	SELECT *
	FROM (
		SELECT
			c.*,
			ROW_NUMBER() OVER (
				PARTITION BY email
				ORDER BY signup_ts DESC, customer_id DESC
			) AS rn
		FROM customers c
		WHERE email IS NOT NULL
	) ranked
	WHERE rn = 1
)
SELECT *
FROM one_customer_per_email;

Define the survivor rule explicitly. Do not deduplicate accidentally.

21. Find orders with multiple successful payments.
SELECT order_id, COUNT(*) AS succeeded_payments
FROM payments
WHERE status = 'succeeded'
GROUP BY order_id
HAVING COUNT(*) > 1;

This can indicate retries, partial captures, or data duplication. Business rules decide whether it is valid.

22. Join orders to exactly one successful payment, choosing latest payment.
WITH ranked_payments AS (
	SELECT
		p.*,
		ROW_NUMBER() OVER (
			PARTITION BY order_id
			ORDER BY payment_ts DESC, payment_id DESC
		) AS rn
	FROM payments p
	WHERE status = 'succeeded'
)
SELECT o.order_id, p.payment_id, p.payment_ts
FROM orders o
LEFT JOIN ranked_payments p
	ON p.order_id = o.order_id
	AND p.rn = 1;

This prevents payment fanout.

23. Return customers who have only cancelled orders.
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 0
	AND SUM(CASE WHEN status <> 'cancelled' OR status IS NULL THEN 1 ELSE 0 END) = 0;

Be explicit about how NULL status should behave. This answer treats NULL as not cancelled.

24. Find event rows whose user is not a known customer, preserving NULL user IDs separately.
SELECT e.event_id, e.user_id
FROM events e
LEFT JOIN customers c
	ON c.customer_id = e.user_id
WHERE e.user_id IS NULL
	OR c.customer_id IS NULL;

This shows both missing user IDs and user IDs with no customer dimension row.

25. Explain why joining two aggregated tables can still double count.

If the two tables are aggregated at different grains, joining them can create fanout. Example: one table is customer_id, month; another is customer_id, product_category, month. Joining only on customer_id, month repeats the customer-month metrics once per category. Align grains first or aggregate both to the same join keys.