Orientation
What You'll Master Here
Many beginner Python examples load everything into a list. Data engineering work often cannot do that: files are large, events arrive continuously, and downstream writers prefer fixed-size batches.
Iterators and generators let Python process one record at a time. They are the foundation for streaming file ingestion, memory-stable transforms, chunked writes, and pipeline code that does not collapse when data grows.
This chapter teaches the mental model behind lazy processing, when it helps, when it hurts, and how to keep evidence such as line numbers and rejected rows while streaming.
Why data engineers care
A job that works on 100 rows but loads 10 million rows into memory is not production-ready. Streaming patterns keep memory predictable.
Core mental model
A stream is a promise to produce the next item on demand, not a container holding every item now.
file handle
one line ready
parser
one dict
validator
accepted or rejected
normalizer
one typed row
writer
flush batch
- iterable
- An object that can produce an iterator, such as a list, file handle, dict, or generator.
- iterator
- An object that returns the next value when next() is called and eventually raises StopIteration.
- generator
- A function or expression that yields values lazily while preserving local state.
- batch
- A bounded group of records processed or written together.
Common mistake
Converting every reader to list(reader) before processing.
Memory usage grows with input size and the job fails when files get large.
Better habit
- Stream records until you have a reason to materialize them.
- Keep source line numbers attached before validation.
- Batch writes deliberately instead of buffering the whole dataset.
I would design the job as a lazy pipeline: read one record, parse it, validate it, normalize it, then write in bounded batches so memory stays predictable.
Practice prompts
- Explain why list(reader) is risky for an unknown-size file.
- Draw a lazy pipeline for NDJSON events from file to accepted output.
Remember this
Streaming Python is not magic. It is disciplined one-record-at-a-time processing with explicit batching.
