DATA TRANSFORMS

Strings, Regex & Text Normalization

Chapter 03FoundationsText

Orientation

What You'll Master Here

Text is where data is dirtiest. Names arrive with stray whitespace and inconsistent casing, files come in unknown encodings, free-text fields hide the values you need, and two strings that look identical on screen can be different sequences of code points. Most "the join lost rows" mysteries are really text problems.

This chapter gives you the toolkit to tame it: decode bytes to text safely, clean and normalize strings so equal-looking values actually compare equal, split and parse fields without surprises, and use regular expressions to extract, validate, and rewrite text deliberately.

The throughline is a single goal: produce stable, comparable values. A key you group or join on is only useful if every record that should share it produces the exact same string.

Why data engineers care

A join key of "café" and a join key of "Café " are different strings, so the rows silently fail to match. Text normalization is the difference between a correct join and a quietly broken one.

Core mental model

Decode bytes to text at the edge, normalize to one canonical form, then compare. Looks-the-same is not is-the-same.

Key terms
encoding
The rule mapping bytes to characters (UTF-8, Latin-1). Bytes are meaningless without it.
normalization
Reducing a string to one canonical form so equal-looking values compare equal.
regex
A pattern language for matching, extracting, and rewriting text.
canonical key
The single normalized form of a value used for grouping, joining, and dedupe.

Common mistake

Trusting that a value looks clean because it prints fine.

Hidden whitespace, casing, or alternate code points break joins and dedupe while everything looks right on screen.

Better habit

  • Decode to str at the boundary; never process raw bytes deep in logic.
  • Normalize before you compare, group, or join.
  • When two equal-looking values do not match, suspect the text first.
Interview note

Asked why a join dropped rows, a strong first guess is "key formatting: whitespace, casing, or encoding." It shows you know data is dirty before it is wrong.

How to read this chapter

Each section ends one step closer to a reusable key() function. The final section assembles the pieces into the canonicalizer you will paste into real pipelines.

Practice prompts

  • Name two ways two equal-looking strings can fail to compare equal.
  • Describe where in a pipeline decoding to text should happen, and why there.

Remember this

Text work exists to produce stable, comparable values; "looks the same" must be turned into "is the same" before any key is used.