Orientation
What You'll Master Here
Advanced Python pipelines are not just transforms. They are rerunnable systems that know what they processed, what they wrote, and how to recover safely.
This chapter teaches full refreshes, incremental loads, idempotency, cursors, watermarks, checkpoints, replay windows, partition-aware writes, manifests, backfills, task boundaries, orchestration, retries, and exactly-once language.
The goal is clean operational behavior: the same input produces the same final output, retries do not duplicate data, and every run leaves evidence.
Why data engineers care
Most production pipeline incidents come from reruns, late data, partial writes, or unclear ownership between tasks.
Core mental model
A pipeline is a sequence of committed boundaries, not one long script.
| task | input | output |
|---|---|---|
| extract | cursor window | raw landing file |
| validate | raw file | accepted/rejected + manifest |
| transform | accepted rows | curated partition |
| reconcile | curated partition | audit receipt |
- idempotent
- Safe to run again with the same input without duplicating or corrupting output.
- watermark
- A committed high-water mark that defines what the pipeline has safely processed.
- replay window
- A lookback range reprocessed to catch late or corrected records.
- task boundary
- A unit of work with explicit inputs, outputs, retries, and evidence.
Common mistake
Treating reruns as an afterthought.
A normal retry can create duplicate files, duplicate rows, or inconsistent partitions.
Better habit
- Write deterministic output paths.
- Commit watermarks only after successful writes.
- Store manifests for every run.
Say: I would design this to be idempotent, commit watermarks after output validation, and replay a bounded window for late data.
Practice prompts
- Define idempotency for a daily partition writer.
- Name the evidence needed before advancing a watermark.
Remember this
Reliable pipelines are designed around reruns.
