Skip to content

Glossary

SQL Semantics

Term	Meaning
grain	What one row represents.
bag semantics	SQL usually preserves duplicates unless removed.
cardinality	Number of rows, or relationship shape between tables.
functional dependency	A key determines another attribute, such as `order_id -> customer_id`.
semijoin	Return left rows where a match exists, often expressed with `EXISTS`.
anti join	Return left rows where no match exists, often expressed with `NOT EXISTS`.
three-valued logic	SQL predicates can be true, false, or unknown due to NULL.
sargable	Predicate can use an efficient search/access path.
window frame	The set of rows visible to a window aggregate for each current row.
date spine	Calendar table used to preserve dates with zero activity.

Modeling

Term	Meaning
fact table	Events or measurements at a declared grain.
dimension table	Descriptive attributes for entities.
SCD type 2	Dimension pattern storing historical versions with validity intervals.
snapshot fact	Periodic state captured at a point in time.
late-arriving data	Events that arrive after their event-time partition should have been processed.
watermark	Bound on how late data is expected before results are considered final.
idempotent pipeline	Re-running produces the same correct output for the same inputs.

Optimization

Term	Meaning
cost-based optimizer	Chooses plans using estimated costs from stats and metadata.
predicate pushdown	Applying filters as close to the scan as possible.
projection pushdown	Reading only needed columns.
partition pruning	Skipping partitions based on predicates.
index seek	Accessing a narrow key/range through an index.
hash join	Builds a hash table on one input and probes with the other.
merge join	Joins sorted inputs.
nested loop join	For each outer row, probes inner rows.
spill	Intermediate data exceeds memory and writes to disk.
row estimate	Optimizer’s estimated rows at a plan node.

Distributed Execution

Term	Meaning
partition	Unit of distributed data/task parallelism.
shuffle	Network redistribution of data by key.
broadcast join	Send a small table to all workers to avoid shuffling a large table.
skew	Uneven key distribution causing hot partitions or straggler tasks.
coalesce	Reduce number of partitions, often without full shuffle.
repartition	Redistribute data, usually causing shuffle.
tiny files	Many small files causing metadata and scheduling overhead.
data skipping	Avoiding files/blocks based on min/max or similar metadata.
adaptive execution	Runtime plan changes based on observed statistics.

ML Data

Term	Meaning
prediction point	Entity and timestamp for which features and labels are generated.
point-in-time join	Join using only data available at prediction time.
feature leakage	Feature uses future or target-derived information.
label window	Future interval used to compute the target.
observation window	Past interval used to compute features.
training/serving skew	Offline feature values differ from online serving values.
calibration	Agreement between predicted probabilities and observed frequencies.
confusion matrix	TP/FP/FN/TN counts at a threshold.

Interview Phrases

Phrase	Use it when
”The output grain is…”	Starting any query.
”This join can fan out…”	Warning about duplicate amplification.
”I would use a semijoin…”	Existence check.
”This needs a half-open interval…”	Time filters or validity intervals.
”That is engine-dependent syntax…”	Date math, percentiles, approximate sketches, indexes.
”I would verify with `EXPLAIN`…”	Performance claims.
”I would separate event-time correctness from processing-time availability…”	Late data.
”The dominant distributed cost is…”	Spark/Dask/lakehouse optimization.