DATA TRANSFORMS

Iterators, Generators & Streaming Data

Chapter 07IntermediateStreaming

Orientation

What You'll Master Here

Many beginner Python examples load everything into a list. Data engineering work often cannot do that: files are large, events arrive continuously, and downstream writers prefer fixed-size batches.

Iterators and generators let Python process one record at a time. They are the foundation for streaming file ingestion, memory-stable transforms, chunked writes, and pipeline code that does not collapse when data grows.

This chapter teaches the mental model behind lazy processing, when it helps, when it hurts, and how to keep evidence such as line numbers and rejected rows while streaming.

Why data engineers care

A job that works on 100 rows but loads 10 million rows into memory is not production-ready. Streaming patterns keep memory predictable.

Core mental model

A stream is a promise to produce the next item on demand, not a container holding every item now.

file handle

one line ready

parser

one dict

validator

accepted or rejected

normalizer

one typed row

writer

flush batch

Key terms
iterable
An object that can produce an iterator, such as a list, file handle, dict, or generator.
iterator
An object that returns the next value when next() is called and eventually raises StopIteration.
generator
A function or expression that yields values lazily while preserving local state.
batch
A bounded group of records processed or written together.

Common mistake

Converting every reader to list(reader) before processing.

Memory usage grows with input size and the job fails when files get large.

Better habit

  • Stream records until you have a reason to materialize them.
  • Keep source line numbers attached before validation.
  • Batch writes deliberately instead of buffering the whole dataset.
What to say

I would design the job as a lazy pipeline: read one record, parse it, validate it, normalize it, then write in bounded batches so memory stays predictable.

Practice prompts

  • Explain why list(reader) is risky for an unknown-size file.
  • Draw a lazy pipeline for NDJSON events from file to accepted output.

Remember this

Streaming Python is not magic. It is disciplined one-record-at-a-time processing with explicit batching.