DATA TRANSFORMS

Testing, Debugging, Logging & Operability

Chapter 13AdvancedOperability

Orientation

What You'll Master Here

Advanced Python data engineering is not just writing transforms. It is proving they work, explaining failures, and giving operators enough evidence to trust a run.

This chapter connects pytest-style tests, fixtures, golden outputs, failure-path assertions, dependency mocking, row-count debugging, structured logging, metrics, and run manifests.

The mindset is simple: every important promise in the pipeline should have a small test, a clear log, a health signal, or a receipt.

Why data engineers care

The difference between a script and a production data job is whether someone can debug and rerun it at 2 AM without guessing.

Core mental model

Testing proves expected behavior before the run; observability proves what happened during the run.

Data job test pyramid
layerexampleconfidence
unitnormalize_order fixture rowsfast rule proof
contractgolden accepted/rejected outputsschema and evidence proof
integrationtemp files and mocked API/db boundaryboundary proof
Key terms
fixture
A reusable test input or setup object that makes tests small and readable.
golden output
A known expected output used to prove a transform contract has not drifted.
structured log
A log event with named fields such as run_id, partition, rows_seen, and status.
operability
The practical ability to run, monitor, debug, and rerun a job safely.

Common mistake

Testing only the happy path.

Rejected rows, warnings, and broken dependencies fail first in production.

Better habit

  • Test pure transforms with tiny input/output examples.
  • Assert failure evidence, not just that an exception happened.
  • Log run context and emit manifests that reconcile counts.
Senior signal

Say: I would test accepted and rejected outputs, mock unstable boundaries, log structured run context, and emit a manifest with counts that reconcile.

Practice prompts

  • List the first three tests you would write for an orders normalization job.
  • Name the run fields you would want in every log event.

Remember this

Reliability is a product feature of the pipeline.