D8LooP - Practise Data Engineering

Orientation

What You'll Master Here

Real data almost never lives in one table. A customer, their orders, and the items inside each order are stored separately and connected by keys. A join is how you put them back together to answer a question.

This chapter teaches joins as relationship reasoning, not syntax memorisation: pick the right join type, predict how many rows come back, recognise when a join quietly multiplies or drops rows, and prove your result with counts.

We stay on the same marketplace dataset from Chapter 1 — customers, orders, and order_items — so your attention is on how the tables relate, not on a new business domain.

By the end you should be able to look at two tables, name the relationship between them, choose INNER vs LEFT deliberately, and explain in plain language why the row count is what it is.

Why data engineers care

Nearly every metric a data engineer ships requires a join. The expensive mistakes are silent: an INNER JOIN that drops valid rows, or a one-to-many join that doubles revenue while still looking plausible.

Core mental model

A join answers one question: for each row on one side, which rows on the other side relate to it — and what should happen when none do?

A join adds related columns from another table

Input data

orders4 rows

order_id	buyer_id	total_amount
1001	1	80.00
1002	3	120.00
1003	4	40.00
1004	1	220.00

buyer_id is the foreign key pointing at customers.customer_id.

customers3 rows

customer_id	country
1	US
3	GB
4	IN

select
  o.order_id,
  o.total_amount,
  c.country as buyer_country
from orders as o
join customers as c
  on c.customer_id = o.buyer_id
order by o.order_id
limit 4;

Query result4 rows

order_id	total_amount	buyer_country
1001	80.00	US
1002	120.00	GB
1003	40.00	IN
1004	220.00	US

orders.buyer_id points at customers.customer_id; the join follows that pointer to attach country.

Each order keeps its own columns and gains the buyer’s country from the related customers row.

customers

one row per customer

key: customer_id

orders

one row per order

key: order_id (buyer_id → customers)

order_items

one row per item line

key: order_id + sku (order_id → orders)

One customer has many orders; one order has many item lines. The foreign key always lives on the “many” side: orders.buyer_id points back to customers, order_items.order_id points back to orders.

Key terms

join: Combining rows from two tables based on a matching condition (usually a key).
key: The column(s) you match on, e.g. customer_id linking orders back to customers.
cardinality: How many rows on one side relate to a row on the other: one-to-one, one-to-many, or many-to-many.
match: A pair of rows whose join condition is true. Rows with no match are the heart of LEFT/anti joins.

Common mistake

Thinking a join only adds columns.

A join can also add rows (one-to-many) or remove rows (INNER with no match). It changes the row set, not just the width.

Better habit

Name the relationship before writing the join.
Predict the row count before you run it.
Decide what should happen to rows that have no match.

Interview note

Strong join answers start with the relationship: “Each order belongs to one customer, so I join orders to customers on buyer_id, and the order grain is preserved.”

Study tip

Use the topic menu on the left as a checklist. Each topic is a join habit you should be able to demonstrate, not just recognise.

Remember this

A join is relationship reasoning. Before syntax, know the relationship, the join key, and what happens when a row has no match.

Joins & Relationship Reasoning

What You'll Master Here