D8LooP - Practise Data Engineering

Orientation

What You'll Master Here

Advanced Python pipelines are not just transforms. They are rerunnable systems that know what they processed, what they wrote, and how to recover safely.

This chapter teaches full refreshes, incremental loads, idempotency, cursors, watermarks, checkpoints, replay windows, partition-aware writes, manifests, backfills, task boundaries, orchestration, retries, and exactly-once language.

The goal is clean operational behavior: the same input produces the same final output, retries do not duplicate data, and every run leaves evidence.

Why data engineers care

Most production pipeline incidents come from reruns, late data, partial writes, or unclear ownership between tasks.

Core mental model

A pipeline is a sequence of committed boundaries, not one long script.

Task boundary map

task	input	output
extract	cursor window	raw landing file
validate	raw file	accepted/rejected + manifest
transform	accepted rows	curated partition
reconcile	curated partition	audit receipt

Key terms

idempotent: Safe to run again with the same input without duplicating or corrupting output.
watermark: A committed high-water mark that defines what the pipeline has safely processed.
replay window: A lookback range reprocessed to catch late or corrected records.
task boundary: A unit of work with explicit inputs, outputs, retries, and evidence.

Common mistake

Treating reruns as an afterthought.

A normal retry can create duplicate files, duplicate rows, or inconsistent partitions.

Better habit

Write deterministic output paths.
Commit watermarks only after successful writes.
Store manifests for every run.

Senior signal

Say: I would design this to be idempotent, commit watermarks after output validation, and replay a bounded window for late data.

Practice prompts

Define idempotency for a daily partition writer.
Name the evidence needed before advancing a watermark.

Remember this

Reliable pipelines are designed around reruns.

Incremental, Idempotent & Orchestrated Pipelines

What You'll Master Here