Orientation
What You'll Master Here
Aggregation is how raw rows become business numbers. Instead of returning one row per order, you collapse many rows into a single summary per group: revenue per status, orders per buyer, active customers per country.
This chapter teaches aggregation as metric reasoning, not just syntax: choose the right aggregate, group at the right grain, filter groups with HAVING, build several metrics in one pass with conditional aggregation, and define a metric you can defend.
Everything runs on the same marketplace dataset from earlier chapters — customers, orders, and order_items — so the focus stays on how a number is computed and what it actually means.
By the end you should be able to turn a vague request like “how is revenue doing?” into a precise, grouped query, and explain its numerator, denominator, and grain in plain language.
Why data engineers care
Almost every dashboard, KPI, and interview question is an aggregate. The dangerous mistakes are quiet: the wrong grain doubles a total, a NULL group hides rows, or an integer division silently returns zero.
orders (many rows)
1001 ·
1002 ·
1003 ·
1004 ·
1006 ·
… 12 rows total
GROUP BY collapses every row that shares a status into a single summary row. Each output row is one group, and the aggregate (count, sum, avg) describes that whole group.
Common mistake
Reporting a total without saying what one row of the result represents.
A number with no stated grain is easy to misread — “revenue” per status, per buyer, and per day are very different things.
Better habit
- Name the group grain before writing the query.
- Check that group totals reconcile to the overall total.
- Say the metric out loud: “revenue per status”.
Strong aggregation answers start with the grain: “I want one row per status, with the count and the summed amount.” That sentence prevents most metric bugs.
Use the topic menu as a checklist. Each topic is a metric habit you should be able to demonstrate, not just recognise.
