DATA TRANSFORMS

APIs, Databases & External Boundaries

Chapter 11AdvancedBoundaries

Orientation

The External Boundary

Chapters 1-10 lived inside your process, where code is fast, reliable, and fully in your control. This chapter crosses the boundary into the outside world: APIs and databases you do not own. The moment you cross that line, three things become true that were never true before: calls are slow, calls can fail at any time, and you will be rate-limited if you ask too fast.

Every technique here exists to handle one of those three facts. Timeouts and retries handle failure. Pagination handles size. Backoff and rate-limit handling handle throttling. Parameterized queries and cursors handle databases safely. Secrets handling keeps your credentials out of logs.

The goal is a source that the rest of your pipeline can treat like any other iterable of records, with all the boundary mess hidden behind one clean, testable function. We build toward exactly that by the end of the chapter.

Why data engineers care

Code that ignores the boundary works in a demo and fails in production: it hangs on a slow call, crashes on a flaky one, gets banned for ignoring rate limits, or leaks a secret into a log. Treating the boundary with respect is what makes a pipeline survive contact with reality.

Core mental model

Inside your process: trust it. Across the boundary: assume slow, flaky, and rate-limited, and code defensively for all three.

Inside your process

fast · reliable · in your control

Pure transforms over records, as in chapters 1-10.

External world (API / DB)

  • slow (network latency)
  • flaky (can fail any call)
  • rate-limited (will throttle you)
  • changes without warning

Every rule in this chapter exists because the right side is slow, flaky, and rate-limited. Code that ignores that works in a demo and fails in production.

Key terms
boundary
The line between your process and an external system (API or database) you do not control.
transient failure
A temporary error (timeout, 429, 5xx) that often succeeds if retried.
idempotent read
A read that returns the same slice when repeated, so a rerun does no harm.
source
A function that hides boundary concerns and yields records to the rest of the pipeline.

Common mistake

Treating an API or database call like a local function that always returns instantly and succeeds.

The first slow or failed call hangs or crashes the whole job, often at 3am when the upstream has a bad day.

Better habit

  • Assume every external call can be slow, fail, or be throttled.
  • Decide the failure and retry policy before writing the happy path.
  • Hide boundary details behind a single source function.
Production reality

Upstream systems have outages, deploys, and rate-limit changes you never hear about. Your job has to absorb that without losing or duplicating data.

Interview note

Asked to "pull data from an API", a senior answer names timeouts, retries with backoff, pagination, and rate limits before writing a single line. That checklist is the signal.

Practice prompts

  • List the three facts that become true once a call crosses the boundary.
  • Name which technique in this chapter addresses each of those three facts.

Remember this

Crossing the boundary changes the rules: code for slow, flaky, and rate-limited from the first line, and hide it behind a clean source.