Data Quality and Modeling Drills
Treat every query as a production invariant.
1. Test uniqueness of the orders fact table.
SELECT order_id, COUNT(*) AS rows
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Expected result is zero rows.
2. Test non-null primary keys.
SELECT COUNT(*) AS null_order_ids
FROM orders
WHERE order_id IS NULL;
Primary keys should not be NULL.
3. Test foreign-key coverage from orders to customers.
SELECT COUNT(*) AS orphan_orders
FROM orders o
LEFT JOIN customers c
ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;
4. Test accepted values for order status.
SELECT status, COUNT(*) AS rows
FROM orders
WHERE status NOT IN ('created', 'paid', 'cancelled', 'refunded')
OR status IS NULL
GROUP BY status;
Adjust valid values to the business contract.
5. Test order totals are non-negative.
SELECT order_id, total_amount
FROM orders
WHERE total_amount < 0;
Refunds should usually be separate facts or explicit negative events, not accidental negative orders.
6. Reconcile order totals to item totals.
WITH item_totals AS (
SELECT order_id, SUM(quantity * unit_price) AS item_total
FROM order_items
GROUP BY order_id
)
SELECT o.order_id, o.total_amount, i.item_total
FROM orders o
JOIN item_totals i
ON i.order_id = o.order_id
WHERE o.total_amount <> i.item_total;
Use tolerance for currency rounding.
7. Detect duplicate raw events by event ID.
SELECT event_id, COUNT(*) AS rows
FROM raw_events
GROUP BY event_id
HAVING COUNT(*) > 1;
Decide whether to deduplicate by latest ingest, source sequence, or exact payload hash.
8. Detect event timestamps in the future.
SELECT event_id, event_ts
FROM events
WHERE event_ts > CURRENT_TIMESTAMP;
Function names for current timestamp vary slightly by engine.
9. Detect impossible order lifecycle ordering.
SELECT order_id, created_ts, paid_ts, shipped_ts
FROM order_lifecycle
WHERE paid_ts < created_ts
OR shipped_ts < paid_ts;
Handle NULLs depending on whether each milestone is required.
10. Test SCD type 2 intervals for overlap.
WITH ordered AS (
SELECT
customer_id,
valid_from,
valid_to,
LAG(valid_to) OVER (
PARTITION BY customer_id
ORDER BY valid_from
) AS prev_valid_to
FROM customer_dim
)
SELECT *
FROM ordered
WHERE prev_valid_to > valid_from;
11. Test SCD type 2 current row count.
SELECT customer_id, COUNT(*) AS current_rows
FROM customer_dim
WHERE is_current = 1
GROUP BY customer_id
HAVING COUNT(*) <> 1;
This assumes every customer should have exactly one current row.
12. Find gaps in SCD intervals.
WITH ordered AS (
SELECT
customer_id,
valid_from,
valid_to,
LEAD(valid_from) OVER (
PARTITION BY customer_id
ORDER BY valid_from
) AS next_valid_from
FROM customer_dim
)
SELECT *
FROM ordered
WHERE next_valid_from IS NOT NULL
AND valid_to <> next_valid_from;
Whether gaps are invalid depends on the business.
13. Identify late-arriving partitions needing backfill.
SELECT event_date, COUNT(*) AS late_events
FROM raw_events
WHERE ingest_date > event_date
GROUP BY event_date;
Use the engine’s date diff when you need a lateness threshold.
14. Validate no duplicate rows at feature table grain.
SELECT entity_id, prediction_ts, feature_name, COUNT(*) AS rows
FROM feature_values
GROUP BY entity_id, prediction_ts, feature_name
HAVING COUNT(*) > 1;
15. Validate a daily mart has one row per day and country.
SELECT metric_date, country, COUNT(*) AS rows
FROM daily_country_metrics
GROUP BY metric_date, country
HAVING COUNT(*) > 1;
16. Detect missing dates in a daily metric table.
SELECT d.calendar_date
FROM date_spine d
LEFT JOIN daily_metrics m
ON m.metric_date = d.calendar_date
WHERE m.metric_date IS NULL;
Filter the date spine to the expected active range.
17. Compare row counts between staging and production by partition.
WITH s AS (
SELECT event_date, COUNT(*) AS staging_rows
FROM staging_events
GROUP BY event_date
),
p AS (
SELECT event_date, COUNT(*) AS prod_rows
FROM prod_events
GROUP BY event_date
)
SELECT
COALESCE(s.event_date, p.event_date) AS event_date,
s.staging_rows,
p.prod_rows,
COALESCE(s.staging_rows, 0) - COALESCE(p.prod_rows, 0) AS delta
FROM s
FULL OUTER JOIN p
ON p.event_date = s.event_date;
18. Detect schema drift represented as unexpected event names.
SELECT event_name, COUNT(*) AS rows
FROM events
WHERE event_name NOT IN ('signup', 'view_product', 'add_to_cart', 'purchase', 'refund')
GROUP BY event_name;
For flexible event systems, maintain an event taxonomy table instead of hardcoding.
19. Check that each payment maps to one order.
SELECT payment_id, COUNT(DISTINCT order_id) AS orders
FROM payments
GROUP BY payment_id
HAVING COUNT(DISTINCT order_id) <> 1;
20. Check that order item quantities are positive integers.
SELECT order_id, product_id, quantity
FROM order_items
WHERE quantity <= 0
OR quantity IS NULL;
Integer-type validation is usually handled by schema, but semantic checks still help.
21. Detect customers whose signup timestamp changes between snapshots.
SELECT customer_id, COUNT(DISTINCT signup_ts) AS signup_ts_values
FROM customer_snapshots
GROUP BY customer_id
HAVING COUNT(DISTINCT signup_ts) > 1;
Some attributes should be immutable.
22. Find metric deltas greater than expected threshold.
WITH daily AS (
SELECT metric_date, active_users
FROM daily_metrics
),
with_prev AS (
SELECT
metric_date,
active_users,
LAG(active_users) OVER (ORDER BY metric_date) AS prev_active_users
FROM daily
)
SELECT *
FROM with_prev
WHERE ABS(active_users - prev_active_users) > 100000;
Use relative thresholds for scale-sensitive metrics.
23. Validate idempotent loads do not create duplicate partitions.
SELECT load_date, event_date, COUNT(*) AS rows
FROM partition_load_audit
GROUP BY load_date, event_date
HAVING COUNT(*) > 1;
The exact audit schema varies, but the invariant is one committed output per logical partition version.
24. Model order refunds as facts or dimensions?
Refunds are facts: they are business events with their own timestamp, amount, status, and lifecycle. Modeling them as mutable attributes on orders can hide history and make reconciliation harder.
25. State the minimum contract for a curated fact table.
Declare grain, primary key, required fields, event-time semantics, accepted statuses, late-data policy, deduplication rule, source lineage, freshness SLO, and reconciliation checks.