Orientation
What You'll Master Here
Validation is how a Python data job proves that output rows are trustworthy before they reach a file, API, database, or warehouse.
This chapter turns earlier habits into a full contract system: required fields, field rules, record rules, batch rules, rejected rows, warnings, fatal errors, exception context, manifests, and tests.
The goal is not to make code noisy. The goal is to make failure paths explicit enough that bad data is stopped, explained, and measured.
Why data engineers care
Silent validation failures become bad metrics, unsafe upserts, broken partitions, and expensive incident investigations.
Core mental model
Validate shape, validate fields, validate records, validate the batch, then write only safe output plus evidence.
schema check
columns exist
field rules
types and ranges
record rules
row is coherent
batch rules
duplicates and totals
output
safe rows + evidence
- rejected row
- A row excluded from accepted output with structured reason and source context.
- warning
- A non-blocking quality signal that is reported while the row may continue.
- fatal error
- A batch-level problem that should stop the run, such as missing required columns.
- contract evidence
- Counts and reports proving what validation accepted, rejected, warned, or stopped.
Common mistake
Catching every error and continuing without evidence.
The job appears resilient while silently losing correctness.
Better habit
- Separate warning, rejection, and fatal paths.
- Attach file, line, field, value, and reason to failures.
- Test each validation rule with tiny examples.
I would validate schema first, then field and record rules, emit rejected rows with source context, stop on fatal batch errors, and write a manifest that reconciles counts.
Practice prompts
- List which failures in an orders file should reject a row versus stop the run.
- Design a rejected-row schema.
Remember this
Validation is not just defensive code. It is the data contract made executable.
