Orientation
The External Boundary
Chapters 1-10 lived inside your process, where code is fast, reliable, and fully in your control. This chapter crosses the boundary into the outside world: APIs and databases you do not own. The moment you cross that line, three things become true that were never true before: calls are slow, calls can fail at any time, and you will be rate-limited if you ask too fast.
Every technique here exists to handle one of those three facts. Timeouts and retries handle failure. Pagination handles size. Backoff and rate-limit handling handle throttling. Parameterized queries and cursors handle databases safely. Secrets handling keeps your credentials out of logs.
The goal is a source that the rest of your pipeline can treat like any other iterable of records, with all the boundary mess hidden behind one clean, testable function. We build toward exactly that by the end of the chapter.
Why data engineers care
Code that ignores the boundary works in a demo and fails in production: it hangs on a slow call, crashes on a flaky one, gets banned for ignoring rate limits, or leaks a secret into a log. Treating the boundary with respect is what makes a pipeline survive contact with reality.
Core mental model
Inside your process: trust it. Across the boundary: assume slow, flaky, and rate-limited, and code defensively for all three.
Inside your process
fast · reliable · in your control
Pure transforms over records, as in chapters 1-10.
External world (API / DB)
- • slow (network latency)
- • flaky (can fail any call)
- • rate-limited (will throttle you)
- • changes without warning
Every rule in this chapter exists because the right side is slow, flaky, and rate-limited. Code that ignores that works in a demo and fails in production.
- boundary
- The line between your process and an external system (API or database) you do not control.
- transient failure
- A temporary error (timeout, 429, 5xx) that often succeeds if retried.
- idempotent read
- A read that returns the same slice when repeated, so a rerun does no harm.
- source
- A function that hides boundary concerns and yields records to the rest of the pipeline.
Common mistake
Treating an API or database call like a local function that always returns instantly and succeeds.
The first slow or failed call hangs or crashes the whole job, often at 3am when the upstream has a bad day.
Better habit
- Assume every external call can be slow, fail, or be throttled.
- Decide the failure and retry policy before writing the happy path.
- Hide boundary details behind a single source function.
Upstream systems have outages, deploys, and rate-limit changes you never hear about. Your job has to absorb that without losing or duplicating data.
Asked to "pull data from an API", a senior answer names timeouts, retries with backoff, pagination, and rate limits before writing a single line. That checklist is the signal.
Practice prompts
- List the three facts that become true once a call crosses the boundary.
- Name which technique in this chapter addresses each of those three facts.
Remember this
Crossing the boundary changes the rules: code for slow, flaky, and rate-limited from the first line, and hide it behind a clean source.
