Orientation
What You'll Master Here
Real data almost never lives in one table. A customer, their orders, and the items inside each order are stored separately and connected by keys. A join is how you put them back together to answer a question.
This chapter teaches joins as relationship reasoning, not syntax memorisation: pick the right join type, predict how many rows come back, recognise when a join quietly multiplies or drops rows, and prove your result with counts.
We stay on the same marketplace dataset from Chapter 1 — customers, orders, and order_items — so your attention is on how the tables relate, not on a new business domain.
By the end you should be able to look at two tables, name the relationship between them, choose INNER vs LEFT deliberately, and explain in plain language why the row count is what it is.
Why data engineers care
Nearly every metric a data engineer ships requires a join. The expensive mistakes are silent: an INNER JOIN that drops valid rows, or a one-to-many join that doubles revenue while still looking plausible.
One customer has many orders; one order has many item lines. The foreign key always lives on the “many” side: orders.buyer_id points back to customers, order_items.order_id points back to orders.
Common mistake
Thinking a join only adds columns.
A join can also add rows (one-to-many) or remove rows (INNER with no match). It changes the row set, not just the width.
Better habit
- Name the relationship before writing the join.
- Predict the row count before you run it.
- Decide what should happen to rows that have no match.
Strong join answers start with the relationship: “Each order belongs to one customer, so I join orders to customers on buyer_id, and the order grain is preserved.”
Use the topic menu on the left as a checklist. Each topic is a join habit you should be able to demonstrate, not just recognise.
