Orientation
What You'll Master Here
Serialization is the moment Python values become bytes or text that another system must read. That boundary deserves the same care as validation or schema design.
This chapter covers CSV, JSON, NDJSON, compression, row-oriented formats, columnar formats, Parquet, Avro, ORC, partitioned layouts, and file manifests.
The standard-library examples are executable Python. The Parquet, Avro, and ORC sections explain the data engineering tradeoffs and where dedicated libraries or engines usually enter.
Why data engineers care
A perfectly normalized record can still fail downstream if Decimal, datetime, headers, compression, schema, or partition layout are serialized carelessly.
Core mental model
Serialization is an output contract: values, field names, ordering, encoding, compression, schema, and file evidence.
domain values
Decimal / datetime
serializer
policy
bytes or text
file payload
manifest
evidence
- serialization
- Converting in-memory values into a file, stream, or wire representation.
- row-oriented
- Records stored one row at a time, common in CSV, JSON, and NDJSON.
- columnar
- Values stored by column, useful for analytical scans and compression.
- manifest
- File-level evidence such as path, format, row count, schema version, and status.
Common mistake
Treating file writing as a final afterthought.
Downstream jobs discover type and schema issues after the data lands.
Better habit
- Choose format by consumer and access pattern.
- Serialize Decimal and datetime deliberately.
- Write a manifest for every produced file.
I would pick the format based on consumer, schema needs, row vs column access, compression, and partition layout, then emit a manifest with row counts and schema version.
Practice prompts
- Choose a format for raw API events and explain why.
- List manifest fields for a partitioned export.
Remember this
File format choice is data architecture, not a save-as detail.
