Orientation
What You'll Master Here
A query that runs is not the same as a query that is right. This chapter is about the second thing: proving data is trustworthy, finding where it is not, and leaving evidence that you checked.
Data quality work is a small set of repeatable checks — duplicates, missing values, broken foreign keys, totals that do not reconcile, anomalous rows — plus the discipline to express each one as a query whose result you can read at a glance.
Everything runs on the same fixtures from earlier chapters: the marketplace customers, orders, and order_items, plus the order_snapshots and events tables. They already contain realistic flaws — a missing country, repeated snapshot keys, orders with no line items, totals that disagree — which is exactly what makes them good practice.
By the end you should be able to take a vague worry like “I don’t trust this table” and turn it into a handful of concrete checks, each returning a count you can defend in a review.
Why data engineers care
Bad data is expensive precisely because it is quiet. A duplicate doubles a metric, a NULL shrinks a denominator, an orphan key drops a join — and nothing errors. The engineers who are trusted are the ones who check before they ship.
Completeness
Are required values present?
null / missing checks
Uniqueness
Is each entity stored once?
duplicate-key checks
Validity
Are values in range and well-formed?
range / format checks
Consistency
Do related fields agree?
cross-field rules
Referential
Do foreign keys point at real rows?
orphan / anti-join checks
Freshness
Is the data recent enough?
latency / lateness checks
Common mistake
Treating “the query ran” as “the data is correct”.
Running is a syntax property; correctness is a data property. They are unrelated, and only the second one matters.
Better habit
- State each rule as one sentence before writing the check.
- Make a passing check return 0, not an empty result.
- Run the checks before you trust any number built on the table.
When asked “how would you validate this table?”, list the dimensions out loud — completeness, uniqueness, validity, consistency, referential integrity, freshness — then turn two or three into queries. That structure reads as senior.
Use the topic menu as a checklist of check types. Each one is a tool you should be able to reach for by name when a table looks suspicious.
