D8LooP - Practise Data Engineering

Orientation

What You'll Master Here

A correct query can still be a bad query. On a cloud warehouse, the same answer can cost a few cents or many dollars depending on how much data it scans — and the difference is almost entirely under your control.

This chapter is about that control: how columnar storage, partitioning, and clustering decide what gets read, how to read a query plan, and how to reason about scan cost the way a warehouse bills it.

The mental shift is from “does it run?” to “how much does it read?”. On engines like BigQuery, Snowflake, and Redshift, bytes scanned — not rows returned — is the unit of both time and money.

To keep everything runnable, the Practice Lab works on two small metadata tables you would really query to tune a warehouse: table_partitions (per-partition row and byte counts) and query_runs (a query-history log with bytes scanned and partitions touched).

Why data engineers care

Performance is a feature and cost is a budget. The engineer who can cut a daily report from 200 GB to 5 GB scanned saves real money every day and makes every dashboard faster — without changing a single number.

Core mental model

A warehouse query costs what it scans. Optimization is the art of scanning fewer columns and fewer partitions while returning the same answer.

Two queries, same partitions, very different cost

Input data

query_runs2 rows

query_id	query_label	partitions_scanned	gb_scanned
1	daily_active_users	8	12
2	orders_export	8	240

Both scan all 8 partitions; query 2 (an export with SELECT *) reads 240 GB vs 12 GB.

-- both touch all 8 partitions, but one reads far more data
select
  query_label,
  partitions_scanned,
  gb_scanned
from query_runs
where query_id in (1, 2)
order by query_id;

Query result2 rows

query_label	partitions_scanned	gb_scanned
daily_active_users	8	12
orders_export	8	240

Same partitions, 20× the bytes. Partitions are not the whole story — columns scanned matter just as much.

query_runs is a query-history log. These two queries scan the same number of partitions, yet one reads 20× the bytes — the difference is how many columns it touched.

on-demand cost is driven by bytes scanned, not rows returned

SELECT two columns, one day1.5 GB

SELECT *, one day12 GB

SELECT *, all partitions30 GB

The same logical answer can cost 1.5 GB or 30 GB depending on how many columns and partitions it scans. Pruning columns and partitions is the single biggest lever on the bill.

Key terms

columnar storage: Data stored column-by-column, so a query only reads the columns it names.
partition: A physical slice of a table (often one per day) that can be skipped if a filter excludes it.
pruning: Skipping partitions or columns the query cannot possibly need, so they are never read.
bytes scanned: The amount of data a query reads — the basis of on-demand cost and most of its runtime.

Common mistake

Judging a query by how many rows it returns.

A query returning 10 rows can scan 200 GB to get them. Cost tracks bytes read, not the size of the result.

Better habit

Ask “how much will this scan?” before running a query on a large table.
Treat columns and partitions as the two cost levers.
Measure in bytes scanned, not rows returned.

Interview note

When asked to optimize a query, start with “what is it scanning?” — name the partitions and columns it reads. That framing beats jumping straight to indexes or rewrites.

Study tip

Use the topic menu as a cost checklist: columns, partitions, clustering, the plan, and the bill. Each topic is a lever you can pull to scan less.

Remember this

A warehouse query costs what it scans. Optimization means reading fewer columns and partitions for the same answer — measured in bytes, not rows.

Warehouse Performance & Cost

What You'll Master Here