Orientation
What You'll Master Here
By now you can ingest files and turn raw strings into typed values. The next skill is structure: how to arrange that logic so a teammate can review it, test it, rerun it, and change it safely.
Data engineering Python usually fails when everything lives in one script: file reads, environment variables, parsing, validation, metrics, and writes all tangled together. This chapter separates those responsibilities.
You will learn to treat functions as pipeline boundaries, modules as ownership boundaries, and configuration as an explicit contract instead of a scattering of constants.
Why data engineers care
Production jobs are changed more often than they are written. Clear functions and modules reduce review risk, isolate bugs, and make reruns explainable.
Core mental model
Keep business logic pure, keep I/O at the edge, pass configuration explicitly, and make every boundary testable.
raw row
dict[str, object]
normalize_order()
typed transform
accepted
NormalizedOrder
rejected
RejectedRow
- pure function
- A function whose output depends only on its input and has no hidden side effects.
- I/O wrapper
- Thin code that reads, writes, logs, or calls external systems around pure logic.
- configuration object
- A single typed object that carries runtime settings through the job.
- entrypoint
- The file or function operators run to start the job, often guarded by __main__.
Common mistake
Putting parsing, environment reads, file writes, and transforms in one long function.
A small rule change requires re-reading the whole job and makes unit tests awkward.
Better habit
- Write small functions around one decision.
- Pass config as an argument instead of reading globals everywhere.
- Keep run scripts thin and transform modules boring.
I would keep row transforms pure, isolate file/API/database I/O at the boundary, and pass a JobConfig object so the behavior is reviewable and testable.
Practice prompts
- Take a 100-line ingestion script and draw which lines are transform logic versus I/O.
- Name three settings that should live in JobConfig instead of global constants.
Remember this
Intermediate Python for data engineering is mostly boundary design: what belongs in the function, module, config, and entrypoint.
