Orientation
What You'll Master Here
Pure Python batch transforms are the middle ground between tiny row helpers and full analytical engines. They fit when the input is small enough, bounded by chunking, or already sampled for a control-plane task.
This chapter teaches the patterns behind real batch jobs: map, filter, group, join, dedupe, aggregate, sorted scan, and window-like state.
The goal is not to avoid pandas or SQL. The goal is to know the core data movement so you can choose the right engine and still reason about correctness.
Why data engineers care
Many pipeline bugs are pattern bugs: fanout joins, wrong dedupe winners, unreported rejects, and metrics computed at the wrong grain.
Core mental model
A batch transform turns one bounded input set into one or more output sets plus evidence.
raw batch
bounded input records
normalize
typed fields
classify
accepted / rejected
aggregate
metrics or outputs
report
counts reconcile
- bounded batch
- A dataset small enough to materialize safely or a deliberately limited chunk.
- business key
- The field or tuple that identifies the entity you are deduping or joining.
- grain
- What one output row represents, such as one customer, one day, or one customer-day.
Common mistake
Using pure Python for an unbounded warehouse-scale join.
The code may work on samples and fail when state grows.
Better habit
- State the input size assumption before materializing.
- Name the output grain before aggregating.
- Return rejected rows and summary counts alongside outputs.
I would use pure Python when the batch is bounded, then choose dictionaries for lookup joins, defaultdict for grouping, Counter for tallies, and sorted scans for latest-row logic.
Practice prompts
- Name the grain of a customer revenue report before writing code.
- Decide whether a proposed transform should be pure Python, SQL, or Spark.
Remember this
Pure Python batch code is about explicit data movement, not clever loops.
