DATA TRANSFORMS

Serialization & File Formats

Chapter 09IntermediateStorage

Orientation

What You'll Master Here

Serialization is the moment Python values become bytes or text that another system must read. That boundary deserves the same care as validation or schema design.

This chapter covers CSV, JSON, NDJSON, compression, row-oriented formats, columnar formats, Parquet, Avro, ORC, partitioned layouts, and file manifests.

The standard-library examples are executable Python. The Parquet, Avro, and ORC sections explain the data engineering tradeoffs and where dedicated libraries or engines usually enter.

Why data engineers care

A perfectly normalized record can still fail downstream if Decimal, datetime, headers, compression, schema, or partition layout are serialized carelessly.

Core mental model

Serialization is an output contract: values, field names, ordering, encoding, compression, schema, and file evidence.

domain values

Decimal / datetime

serializer

policy

bytes or text

file payload

manifest

evidence

Key terms
serialization
Converting in-memory values into a file, stream, or wire representation.
row-oriented
Records stored one row at a time, common in CSV, JSON, and NDJSON.
columnar
Values stored by column, useful for analytical scans and compression.
manifest
File-level evidence such as path, format, row count, schema version, and status.

Common mistake

Treating file writing as a final afterthought.

Downstream jobs discover type and schema issues after the data lands.

Better habit

  • Choose format by consumer and access pattern.
  • Serialize Decimal and datetime deliberately.
  • Write a manifest for every produced file.
What to say

I would pick the format based on consumer, schema needs, row vs column access, compression, and partition layout, then emit a manifest with row counts and schema version.

Practice prompts

  • Choose a format for raw API events and explain why.
  • List manifest fields for a partitioned export.

Remember this

File format choice is data architecture, not a save-as detail.