QUERY SYSTEMS

Aggregation & Business Metrics

Chapter 03EasyFoundations

Orientation

What You'll Master Here

Aggregation is how raw rows become business numbers. Instead of returning one row per order, you collapse many rows into a single summary per group: revenue per status, orders per buyer, active customers per country.

This chapter teaches aggregation as metric reasoning, not just syntax: choose the right aggregate, group at the right grain, filter groups with HAVING, build several metrics in one pass with conditional aggregation, and define a metric you can defend.

Everything runs on the same marketplace dataset from earlier chapters — customers, orders, and order_items — so the focus stays on how a number is computed and what it actually means.

By the end you should be able to turn a vague request like “how is revenue doing?” into a precise, grouped query, and explain its numerator, denominator, and grain in plain language.

Why data engineers care

Almost every dashboard, KPI, and interview question is an aggregate. The dangerous mistakes are quiet: the wrong grain doubles a total, a NULL group hides rows, or an integer division silently returns zero.

Core mental model

A GROUP BY query answers: for each group, what is the one number that summarizes it — and is that number measuring what I think it is?

One row per status: count and revenue
Input data
orders6 rows
order_idstatustotal_amount
1001paid80
1002paid120
1003pending40
1004shipped220
1005paid540
1006refunded60

6 of the 12 orders shown. paid: 1001, 1002, 1005, 1007, 1010, 1011.

select
  status,
  count(*)          as orders,
  sum(total_amount) as revenue
from orders
group by status
order by revenue desc;
Query result4 rows
statusordersrevenue
paid61585
shipped3445
pending270
refunded160

The four group revenues add back to 2160, the SUM(total_amount) over all 12 orders.

Twelve order rows collapse into one summary row per status. Each output row describes a whole group.

orders (many rows)

1001 · paid

1002 · paid

1003 · pending

1004 · shipped

1006 · refunded

… 12 rows total

GROUP BY status

one row per status

paidcount = 6

shippedcount = 3

pendingcount = 2

refundedcount = 1

GROUP BY collapses every row that shares a status into a single summary row. Each output row is one group, and the aggregate (count, sum, avg) describes that whole group.

Key terms
aggregate function
A function that turns many rows into one value: COUNT, SUM, AVG, MIN, MAX.
GROUP BY
Splits rows into groups; the query returns one summary row per group.
grain (of a metric)
What one output row represents: one status, one buyer, one (buyer, day) pair.
metric definition
The exact rule for a number: its numerator, denominator, filters, and grain.

Common mistake

Reporting a total without saying what one row of the result represents.

A number with no stated grain is easy to misread — “revenue” per status, per buyer, and per day are very different things.

Better habit

  • Name the group grain before writing the query.
  • Check that group totals reconcile to the overall total.
  • Say the metric out loud: “revenue per status”.
Interview note

Strong aggregation answers start with the grain: “I want one row per status, with the count and the summed amount.” That sentence prevents most metric bugs.

Study tip

Use the topic menu as a checklist. Each topic is a metric habit you should be able to demonstrate, not just recognise.

Remember this

Aggregation collapses rows into one number per group. Before syntax, know the group grain and what the number means.