Orientation
What You'll Master Here
Ingestion is the boundary where the outside world enters your pipeline. A file is not trusted data yet. It is bytes, text, rows, payloads, timestamps, and failure modes that need to be turned into records with evidence.
This chapter teaches how data engineers use Python to read files and payloads safely: paths, directory contracts, CSV, JSON, NDJSON, row numbers, rejected rows, and manifests.
The goal is not to memorize every option in pathlib, csv, or json. The goal is to build a repeatable ingestion habit: discover inputs, parse them deliberately, preserve enough context to debug, and emit a receipt for the run.
Why data engineers care
Most downstream data quality problems start at ingestion. If you lose file names, row numbers, encodings, or rejected-row reasons, you lose the evidence needed to fix a bad batch.
Core mental model
Treat ingestion as a contract boundary: raw file in, accepted records plus rejected evidence plus manifest out.
Ingestion boundary
Raw file
bytes and strings
Parser
CSV / JSON / NDJSON
Validator
contract checks
Accepted rows
safe records
Rejected rows
evidence
- path
- A structured reference to a file or directory. Use Path objects so joins, suffixes, parents, and globbing stay explicit.
- payload
- The raw content you received: CSV text, a JSON document, one NDJSON line, or an API response body.
- manifest
- A run receipt that records what files were seen, row counts, accepted counts, rejected counts, and final status.
- rejected row
- A row that failed a row-level rule but did not necessarily make the whole batch unreadable.
Common mistake
Opening files as if every payload is already clean data.
Malformed rows either crash the whole batch or disappear without an audit trail.
Returning only clean records and throwing away file and row context.
No one can trace a bad warehouse row back to the input that produced it.
Better habit
- Always preserve source file and row number when parsing row-based files.
- Return accepted records and rejected records separately.
- Emit a manifest even for empty, failed, or quarantined inputs.
A strong ingestion answer says how you will handle partial failure. Interviewers notice when you preserve row numbers and rejected-row reasons instead of only showing the happy path.
Ingestion failures are often caused by upstream systems changing delimiters, adding columns, sending empty files, or producing one malformed record in an otherwise usable batch.
Practice prompts
- For a CSV file you know, list the accepted output fields and the rejected-row fields you would preserve.
- Name one condition that should quarantine the entire file and one condition that should reject only a row.
Remember this
A file is not trusted data until you parse it, validate it, keep rejected evidence, and write a manifest.
