Orientation
What You'll Master Here
A correct query can still be a bad query. On a cloud warehouse, the same answer can cost a few cents or many dollars depending on how much data it scans — and the difference is almost entirely under your control.
This chapter is about that control: how columnar storage, partitioning, and clustering decide what gets read, how to read a query plan, and how to reason about scan cost the way a warehouse bills it.
The mental shift is from “does it run?” to “how much does it read?”. On engines like BigQuery, Snowflake, and Redshift, bytes scanned — not rows returned — is the unit of both time and money.
To keep everything runnable, the Practice Lab works on two small metadata tables you would really query to tune a warehouse: table_partitions (per-partition row and byte counts) and query_runs (a query-history log with bytes scanned and partitions touched).
Why data engineers care
Performance is a feature and cost is a budget. The engineer who can cut a daily report from 200 GB to 5 GB scanned saves real money every day and makes every dashboard faster — without changing a single number.
on-demand cost is driven by bytes scanned, not rows returned
The same logical answer can cost 1.5 GB or 30 GB depending on how many columns and partitions it scans. Pruning columns and partitions is the single biggest lever on the bill.
Common mistake
Judging a query by how many rows it returns.
A query returning 10 rows can scan 200 GB to get them. Cost tracks bytes read, not the size of the result.
Better habit
- Ask “how much will this scan?” before running a query on a large table.
- Treat columns and partitions as the two cost levers.
- Measure in bytes scanned, not rows returned.
When asked to optimize a query, start with “what is it scanning?” — name the partitions and columns it reads. That framing beats jumping straight to indexes or rewrites.
Use the topic menu as a cost checklist: columns, partitions, clustering, the plan, and the bill. Each topic is a lever you can pull to scan less.
