D8LooP - Practise Data Engineering

Orientation

What You'll Master Here

A query that runs is not the same as a query that is right. This chapter is about the second thing: proving data is trustworthy, finding where it is not, and leaving evidence that you checked.

Data quality work is a small set of repeatable checks — duplicates, missing values, broken foreign keys, totals that do not reconcile, anomalous rows — plus the discipline to express each one as a query whose result you can read at a glance.

Everything runs on the same fixtures from earlier chapters: the marketplace customers, orders, and order_items, plus the order_snapshots and events tables. They already contain realistic flaws — a missing country, repeated snapshot keys, orders with no line items, totals that disagree — which is exactly what makes them good practice.

By the end you should be able to take a vague worry like “I don’t trust this table” and turn it into a handful of concrete checks, each returning a count you can defend in a review.

Why data engineers care

Bad data is expensive precisely because it is quiet. A duplicate doubles a metric, a NULL shrinks a denominator, an orphan key drops a join — and nothing errors. The engineers who are trusted are the ones who check before they ship.

Core mental model

Every data-quality check answers one yes/no question about a table, and returns the rows (or the count of rows) that fail it. A clean table makes every check return zero.

A check is just a query that counts failures

Input data

customers6 rows

customer_id	country
1	US
2	NULL
3	GB
4	IN
5	US
6	GB

Customer 2 has a NULL country; everyone else is populated.

-- "every customer should have a country" — count the ones that don't
select
  count(*) as missing_country
from customers
where country is null;

Query result1 row

missing_country
1

One failing row. The whole chapter is variations on “count what is wrong”.

The simplest data-quality check: one rule, one count. A healthy table returns 0; this one returns 1.

Completeness

Are required values present?

null / missing checks

Uniqueness

Is each entity stored once?

duplicate-key checks

Validity

Are values in range and well-formed?

range / format checks

Consistency

Do related fields agree?

cross-field rules

Referential

Do foreign keys point at real rows?

orphan / anti-join checks

Freshness

Is the data recent enough?

latency / lateness checks

Key terms

data-quality check: A query that returns the rows failing one rule, or the count of them.
audit: A collection of checks run together, with their results recorded as evidence.
reconciliation: Computing a number two independent ways and comparing them.
anomaly: A row whose value is unexpected relative to a baseline (a range, an average, a rule).

Common mistake

Treating “the query ran” as “the data is correct”.

Running is a syntax property; correctness is a data property. They are unrelated, and only the second one matters.

Better habit

State each rule as one sentence before writing the check.
Make a passing check return 0, not an empty result.
Run the checks before you trust any number built on the table.

Interview note

When asked “how would you validate this table?”, list the dimensions out loud — completeness, uniqueness, validity, consistency, referential integrity, freshness — then turn two or three into queries. That structure reads as senior.

Study tip

Use the topic menu as a checklist of check types. Each one is a tool you should be able to reach for by name when a table looks suspicious.

Remember this

Data quality is a set of small, countable checks. A trustworthy table is one where every check you can think of returns zero.

Data Quality, Debugging & Auditing

What You'll Master Here