Skip to content

Glossary

TermMeaning
grainWhat one row represents.
bag semanticsSQL usually preserves duplicates unless removed.
cardinalityNumber of rows, or relationship shape between tables.
functional dependencyA key determines another attribute, such as order_id -> customer_id.
semijoinReturn left rows where a match exists, often expressed with EXISTS.
anti joinReturn left rows where no match exists, often expressed with NOT EXISTS.
three-valued logicSQL predicates can be true, false, or unknown due to NULL.
sargablePredicate can use an efficient search/access path.
window frameThe set of rows visible to a window aggregate for each current row.
date spineCalendar table used to preserve dates with zero activity.
TermMeaning
fact tableEvents or measurements at a declared grain.
dimension tableDescriptive attributes for entities.
SCD type 2Dimension pattern storing historical versions with validity intervals.
snapshot factPeriodic state captured at a point in time.
late-arriving dataEvents that arrive after their event-time partition should have been processed.
watermarkBound on how late data is expected before results are considered final.
idempotent pipelineRe-running produces the same correct output for the same inputs.
TermMeaning
cost-based optimizerChooses plans using estimated costs from stats and metadata.
predicate pushdownApplying filters as close to the scan as possible.
projection pushdownReading only needed columns.
partition pruningSkipping partitions based on predicates.
index seekAccessing a narrow key/range through an index.
hash joinBuilds a hash table on one input and probes with the other.
merge joinJoins sorted inputs.
nested loop joinFor each outer row, probes inner rows.
spillIntermediate data exceeds memory and writes to disk.
row estimateOptimizer’s estimated rows at a plan node.
TermMeaning
partitionUnit of distributed data/task parallelism.
shuffleNetwork redistribution of data by key.
broadcast joinSend a small table to all workers to avoid shuffling a large table.
skewUneven key distribution causing hot partitions or straggler tasks.
coalesceReduce number of partitions, often without full shuffle.
repartitionRedistribute data, usually causing shuffle.
tiny filesMany small files causing metadata and scheduling overhead.
data skippingAvoiding files/blocks based on min/max or similar metadata.
adaptive executionRuntime plan changes based on observed statistics.
TermMeaning
prediction pointEntity and timestamp for which features and labels are generated.
point-in-time joinJoin using only data available at prediction time.
feature leakageFeature uses future or target-derived information.
label windowFuture interval used to compute the target.
observation windowPast interval used to compute features.
training/serving skewOffline feature values differ from online serving values.
calibrationAgreement between predicted probabilities and observed frequencies.
confusion matrixTP/FP/FN/TN counts at a threshold.
PhraseUse it when
”The output grain is…”Starting any query.
”This join can fan out…”Warning about duplicate amplification.
”I would use a semijoin…”Existence check.
”This needs a half-open interval…”Time filters or validity intervals.
”That is engine-dependent syntax…”Date math, percentiles, approximate sketches, indexes.
”I would verify with EXPLAIN…”Performance claims.
”I would separate event-time correctness from processing-time availability…”Late data.
”The dominant distributed cost is…”Spark/Dask/lakehouse optimization.