DATA TRANSFORMS

Records, Collections & Keyed Data

Chapter 02FoundationsCollections

Orientation

What You'll Master Here

Chapter 1 said a record is a dict and a dataset is an iterable of records. This chapter is about the containers that hold and reshape those records: lists, tuples, dicts, and sets. Choosing the right one is most of what makes a transform fast, correct, and easy to read.

The moves here are the everyday vocabulary of data work: build a lookup map so you can join two datasets, group records by a key, dedupe by a key while keeping the right copy, and check membership with sets. These are the pure-Python versions of the joins and group-bys you already know from SQL.

We keep using the marketplace dataset, buyers and orders and order items, so the focus stays on the data structures, not the domain.

Why data engineers care

The difference between a pipeline that finishes in seconds and one that times out is usually a data-structure choice: a dict lookup instead of a nested loop, a set instead of a list scan.

Core mental model

Pick the container by the question you ask of it: order matters → list, identity/lookup → dict, membership/uniqueness → set, fixed shape → tuple.

Key terms
collection
A container that holds many values: list, tuple, dict, or set.
lookup map
A dict keyed by an id so you can fetch a related record in constant time.
grouping
Bucketing records by a shared key so each bucket can be aggregated.
dedupe
Removing repeats, deciding which copy of a key to keep.

Common mistake

Defaulting to a list for everything, then scanning it repeatedly with `in`.

Membership and lookup become O(n) each, and the job slows to a crawl as data grows.

Better habit

  • Ask "what question will I ask of this collection?" before choosing it.
  • Reach for a dict the moment you need to look something up by id.
  • Reach for a set the moment you ask "is it in here?" or "is it unique?".
How to read this chapter

Each section maps a SQL idea you know (join, group by, distinct) to its pure-Python container move. If you know the SQL, you already know the goal.

Interview note

Reaching for the right container unprompted ("I would index buyers into a dict first") signals you think about complexity, not just correctness.

Practice prompts

  • List four questions about data and name the best container for each.
  • Find a list in your code that is searched with `in` and decide whether it should be a set.

Remember this

The container you choose encodes the question you are asking; pick it deliberately and the transform almost writes itself.