QUERY SYSTEMS

CTEs, Subqueries & Query Structure

Chapter 04MediumStructure

Orientation

What You'll Master Here

Chapter 4 turns SQL from a pile of clauses into a readable plan. You will use CTEs, subqueries, derived tables, and disciplined final projection to make complex work debuggable.

The key shift is to think in named row sets. Each CTE or subquery answers one small question: which rows, which grain, which metric, which exception.

By the end you should be able to refactor a hard query into layers, choose between a CTE and a subquery deliberately, and narrate the answer like an engineer instead of a syntax collector.

Why data engineers care

Interview SQL and production SQL both reward clear reasoning. A correct answer that cannot be debugged is fragile, and a query nobody can explain is a liability.

Core mental model

A complex query is a small pipeline written inside one statement: source rows, cleaned rows, shaped rows, final answer.

Name the working set in a CTE
Input data
orders5 rows
order_idbuyer_idstatustotal_amount
10011paid80
10023paid120
10034pending40
10041shipped220
10052paid540

The pending order 1003 is excluded; 9 of the 12 orders are paid or shipped.

with paid_orders as (
  select order_id, buyer_id, total_amount
  from orders
  where status in ('paid', 'shipped')
)
select *
from paid_orders
order by order_id
limit 4;
Query result4 rows
order_idbuyer_idtotal_amount
1001180
10023120
10041220
10052540

Showing the first 4 of the 9 paid/shipped rows the CTE defines.

The CTE names the rows you care about; the final select just reads from it. One idea, one name.

1. raw_orders

filter and standardize source rows

2. paid_orders

keep the business population

3. country_revenue

aggregate at the reporting grain

4. final select

name columns for the reader

A CTE stack should read like a debug transcript: each layer has one job and can be selected independently while you inspect the query.

Key terms
CTE
A named temporary result set introduced with WITH and referenced by the main query.
subquery
A SELECT nested inside another query, used for filtering, thresholds, or derived row sets.
derived table
A subquery in FROM that behaves like a table for the outer query.
correlation
A subquery that references columns from the current outer-query row.

Common mistake

Cramming every transformation into one large SELECT.

The query may run, but it becomes hard to review, test, and fix when a row count changes.

Better habit

  • Name intermediate row sets after the job they perform.
  • Run each layer before adding the next.
  • Keep the final SELECT boring and obvious.
Interview note

A strong layered answer sounds like a calm walkthrough: “first I isolate the rows, then I aggregate, then I expose the result.”

Study tip

Use the topic menu as a checklist. Each topic is a structuring habit you should be able to demonstrate, not just recognise.

Remember this

Think in named row sets. The best complex SQL reads like a short, debuggable pipeline.