DATA TRANSFORMS

Pure Python Batch Transform Patterns

Chapter 08IntermediateBatch

Orientation

What You'll Master Here

Pure Python batch transforms are the middle ground between tiny row helpers and full analytical engines. They fit when the input is small enough, bounded by chunking, or already sampled for a control-plane task.

This chapter teaches the patterns behind real batch jobs: map, filter, group, join, dedupe, aggregate, sorted scan, and window-like state.

The goal is not to avoid pandas or SQL. The goal is to know the core data movement so you can choose the right engine and still reason about correctness.

Why data engineers care

Many pipeline bugs are pattern bugs: fanout joins, wrong dedupe winners, unreported rejects, and metrics computed at the wrong grain.

Core mental model

A batch transform turns one bounded input set into one or more output sets plus evidence.

raw batch

bounded input records

normalize

typed fields

classify

accepted / rejected

aggregate

metrics or outputs

report

counts reconcile

Key terms
bounded batch
A dataset small enough to materialize safely or a deliberately limited chunk.
business key
The field or tuple that identifies the entity you are deduping or joining.
grain
What one output row represents, such as one customer, one day, or one customer-day.

Common mistake

Using pure Python for an unbounded warehouse-scale join.

The code may work on samples and fail when state grows.

Better habit

  • State the input size assumption before materializing.
  • Name the output grain before aggregating.
  • Return rejected rows and summary counts alongside outputs.
What to say

I would use pure Python when the batch is bounded, then choose dictionaries for lookup joins, defaultdict for grouping, Counter for tallies, and sorted scans for latest-row logic.

Practice prompts

  • Name the grain of a customer revenue report before writing code.
  • Decide whether a proposed transform should be pure Python, SQL, or Spark.

Remember this

Pure Python batch code is about explicit data movement, not clever loops.