Glossary
SQL Semantics
Section titled “SQL Semantics”| Term | Meaning |
|---|---|
| grain | What one row represents. |
| bag semantics | SQL usually preserves duplicates unless removed. |
| cardinality | Number of rows, or relationship shape between tables. |
| functional dependency | A key determines another attribute, such as order_id -> customer_id. |
| semijoin | Return left rows where a match exists, often expressed with EXISTS. |
| anti join | Return left rows where no match exists, often expressed with NOT EXISTS. |
| three-valued logic | SQL predicates can be true, false, or unknown due to NULL. |
| sargable | Predicate can use an efficient search/access path. |
| window frame | The set of rows visible to a window aggregate for each current row. |
| date spine | Calendar table used to preserve dates with zero activity. |
Modeling
Section titled “Modeling”| Term | Meaning |
|---|---|
| fact table | Events or measurements at a declared grain. |
| dimension table | Descriptive attributes for entities. |
| SCD type 2 | Dimension pattern storing historical versions with validity intervals. |
| snapshot fact | Periodic state captured at a point in time. |
| late-arriving data | Events that arrive after their event-time partition should have been processed. |
| watermark | Bound on how late data is expected before results are considered final. |
| idempotent pipeline | Re-running produces the same correct output for the same inputs. |
Optimization
Section titled “Optimization”| Term | Meaning |
|---|---|
| cost-based optimizer | Chooses plans using estimated costs from stats and metadata. |
| predicate pushdown | Applying filters as close to the scan as possible. |
| projection pushdown | Reading only needed columns. |
| partition pruning | Skipping partitions based on predicates. |
| index seek | Accessing a narrow key/range through an index. |
| hash join | Builds a hash table on one input and probes with the other. |
| merge join | Joins sorted inputs. |
| nested loop join | For each outer row, probes inner rows. |
| spill | Intermediate data exceeds memory and writes to disk. |
| row estimate | Optimizer’s estimated rows at a plan node. |
Distributed Execution
Section titled “Distributed Execution”| Term | Meaning |
|---|---|
| partition | Unit of distributed data/task parallelism. |
| shuffle | Network redistribution of data by key. |
| broadcast join | Send a small table to all workers to avoid shuffling a large table. |
| skew | Uneven key distribution causing hot partitions or straggler tasks. |
| coalesce | Reduce number of partitions, often without full shuffle. |
| repartition | Redistribute data, usually causing shuffle. |
| tiny files | Many small files causing metadata and scheduling overhead. |
| data skipping | Avoiding files/blocks based on min/max or similar metadata. |
| adaptive execution | Runtime plan changes based on observed statistics. |
ML Data
Section titled “ML Data”| Term | Meaning |
|---|---|
| prediction point | Entity and timestamp for which features and labels are generated. |
| point-in-time join | Join using only data available at prediction time. |
| feature leakage | Feature uses future or target-derived information. |
| label window | Future interval used to compute the target. |
| observation window | Past interval used to compute features. |
| training/serving skew | Offline feature values differ from online serving values. |
| calibration | Agreement between predicted probabilities and observed frequencies. |
| confusion matrix | TP/FP/FN/TN counts at a threshold. |
Interview Phrases
Section titled “Interview Phrases”| Phrase | Use it when |
|---|---|
| ”The output grain is…” | Starting any query. |
| ”This join can fan out…” | Warning about duplicate amplification. |
| ”I would use a semijoin…” | Existence check. |
| ”This needs a half-open interval…” | Time filters or validity intervals. |
| ”That is engine-dependent syntax…” | Date math, percentiles, approximate sketches, indexes. |
”I would verify with EXPLAIN…” | Performance claims. |
| ”I would separate event-time correctness from processing-time availability…” | Late data. |
| ”The dominant distributed cost is…” | Spark/Dask/lakehouse optimization. |