QUERY SYSTEMS

Data Quality, Debugging & Auditing

Chapter 08AdvancedAuditing

Orientation

What You'll Master Here

A query that runs is not the same as a query that is right. This chapter is about the second thing: proving data is trustworthy, finding where it is not, and leaving evidence that you checked.

Data quality work is a small set of repeatable checks — duplicates, missing values, broken foreign keys, totals that do not reconcile, anomalous rows — plus the discipline to express each one as a query whose result you can read at a glance.

Everything runs on the same fixtures from earlier chapters: the marketplace customers, orders, and order_items, plus the order_snapshots and events tables. They already contain realistic flaws — a missing country, repeated snapshot keys, orders with no line items, totals that disagree — which is exactly what makes them good practice.

By the end you should be able to take a vague worry like “I don’t trust this table” and turn it into a handful of concrete checks, each returning a count you can defend in a review.

Why data engineers care

Bad data is expensive precisely because it is quiet. A duplicate doubles a metric, a NULL shrinks a denominator, an orphan key drops a join — and nothing errors. The engineers who are trusted are the ones who check before they ship.

Core mental model

Every data-quality check answers one yes/no question about a table, and returns the rows (or the count of rows) that fail it. A clean table makes every check return zero.

A check is just a query that counts failures
Input data
customers6 rows
customer_idcountry
1US
2NULL
3GB
4IN
5US
6GB

Customer 2 has a NULL country; everyone else is populated.

-- "every customer should have a country" — count the ones that don't
select
  count(*) as missing_country
from customers
where country is null;
Query result1 row
missing_country
1

One failing row. The whole chapter is variations on “count what is wrong”.

The simplest data-quality check: one rule, one count. A healthy table returns 0; this one returns 1.

Completeness

Are required values present?

null / missing checks

Uniqueness

Is each entity stored once?

duplicate-key checks

Validity

Are values in range and well-formed?

range / format checks

Consistency

Do related fields agree?

cross-field rules

Referential

Do foreign keys point at real rows?

orphan / anti-join checks

Freshness

Is the data recent enough?

latency / lateness checks

Key terms
data-quality check
A query that returns the rows failing one rule, or the count of them.
audit
A collection of checks run together, with their results recorded as evidence.
reconciliation
Computing a number two independent ways and comparing them.
anomaly
A row whose value is unexpected relative to a baseline (a range, an average, a rule).

Common mistake

Treating “the query ran” as “the data is correct”.

Running is a syntax property; correctness is a data property. They are unrelated, and only the second one matters.

Better habit

  • State each rule as one sentence before writing the check.
  • Make a passing check return 0, not an empty result.
  • Run the checks before you trust any number built on the table.
Interview note

When asked “how would you validate this table?”, list the dimensions out loud — completeness, uniqueness, validity, consistency, referential integrity, freshness — then turn two or three into queries. That structure reads as senior.

Study tip

Use the topic menu as a checklist of check types. Each one is a tool you should be able to reach for by name when a table looks suspicious.

Remember this

Data quality is a set of small, countable checks. A trustworthy table is one where every check you can think of returns zero.