D8LooP - Practise Data Engineering

Orientation

What You'll Master Here

Chapter 4 turns SQL from a pile of clauses into a readable plan. You will use CTEs, subqueries, derived tables, and disciplined final projection to make complex work debuggable.

The key shift is to think in named row sets. Each CTE or subquery answers one small question: which rows, which grain, which metric, which exception.

By the end you should be able to refactor a hard query into layers, choose between a CTE and a subquery deliberately, and narrate the answer like an engineer instead of a syntax collector.

Why data engineers care

Interview SQL and production SQL both reward clear reasoning. A correct answer that cannot be debugged is fragile, and a query nobody can explain is a liability.

Core mental model

A complex query is a small pipeline written inside one statement: source rows, cleaned rows, shaped rows, final answer.

Name the working set in a CTE

Input data

orders5 rows

order_id	buyer_id	status	total_amount
1001	1	paid	80
1002	3	paid	120
1003	4	pending	40
1004	1	shipped	220
1005	2	paid	540

The pending order 1003 is excluded; 9 of the 12 orders are paid or shipped.

with paid_orders as (
  select order_id, buyer_id, total_amount
  from orders
  where status in ('paid', 'shipped')
)
select *
from paid_orders
order by order_id
limit 4;

Query result4 rows

order_id	buyer_id	total_amount
1001	1	80
1002	3	120
1004	1	220
1005	2	540

Showing the first 4 of the 9 paid/shipped rows the CTE defines.

The CTE names the rows you care about; the final select just reads from it. One idea, one name.

1. raw_orders

filter and standardize source rows

2. paid_orders

keep the business population

3. country_revenue

aggregate at the reporting grain

4. final select

name columns for the reader

A CTE stack should read like a debug transcript: each layer has one job and can be selected independently while you inspect the query.

Key terms

CTE: A named temporary result set introduced with WITH and referenced by the main query.
subquery: A SELECT nested inside another query, used for filtering, thresholds, or derived row sets.
derived table: A subquery in FROM that behaves like a table for the outer query.
correlation: A subquery that references columns from the current outer-query row.

Common mistake

Cramming every transformation into one large SELECT.

The query may run, but it becomes hard to review, test, and fix when a row count changes.

Better habit

Name intermediate row sets after the job they perform.
Run each layer before adding the next.
Keep the final SELECT boring and obvious.

Interview note

A strong layered answer sounds like a calm walkthrough: “first I isolate the rows, then I aggregate, then I expose the result.”

Study tip

Use the topic menu as a checklist. Each topic is a structuring habit you should be able to demonstrate, not just recognise.

Remember this

Think in named row sets. The best complex SQL reads like a short, debuggable pipeline.

CTEs, Subqueries & Query Structure

What You'll Master Here