D8LooP - Practise Data Engineering

Chapter 07

Orientation: SQL as a table-maintenance language

Analytical SQL asks a question and returns an answer. Data-engineering SQL keeps a warehouse table correct after thousands or millions of upstream changes arrive. That shift changes the job: the query is no longer just a report, it is a contract for how target state evolves.

This chapter treats SQL as the language of safe table maintenance. You will learn how to decide which rows are new, changed, unchanged, deleted, late, or historically important before a pipeline mutates a target table.

A source table is where new data arrives. A target table is the trusted warehouse table users query. A staging query sits between them: it cleans the source, removes bad keys, chooses one row per business key, and prepares an action plan.

A business key is the real-world identifier for one entity, such as `customer_id`. “One row per business key” means customer 104 should appear once in the target current table, even if the source sent customer 104 twice in one batch.

Why data engineers care

Production tables are reused by dashboards, ML features, finance workflows, and downstream models. A bad merge key, a missing replay window, or an accidental append-only model can silently corrupt every consumer.

Core mental model

Every load compares a source slice to an existing target state. Your SQL should make the intended action for each key visible before the target is changed.

A SELECT-based action plan before mutation

Input data

customer_stage8 rows

customer_id	email	tier	updated_at	is_deleted	batch_id
101	ana@example.com	gold	2026-03-05 10:00	0	7001
102	ben@example.com	silver	2026-03-05 11:00	0	7001
103	cy@example.com	bronze	2026-03-05 12:00	1	7001
104	dee-old@example.com	gold	2026-03-05 09:00	0	7000
104	dee@example.com	platinum	2026-03-05 12:30	0	7001
105	eli@example.com	bronze	2026-03-05 08:00	0	7001
NULL	ghost@example.com	silver	2026-03-05 13:00	0	7001
106	fay@example.com	silver	2026-03-04 23:30	0	6999

This incoming batch intentionally includes one duplicate key, one NULL key, one unchanged row, and one tombstone.

dim_customer_current5 rows

customer_id	email	tier	updated_at	is_active
101	ana@example.com	silver	2026-02-10 09:00	1
103	cy@example.com	bronze	2026-01-15 09:00	1
104	dee@example.com	gold	2026-02-20 09:00	1
105	eli@example.com	bronze	2026-02-21 09:00	1
107	gus@example.com	silver	2026-02-25 09:00	1

This is the trusted current-state table before the incoming batch is applied.

WITH ranked_stage AS (
  SELECT
    customer_id,
    email,
    tier,
    updated_at,
    is_deleted,
    batch_id,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC, batch_id DESC
    ) AS row_num
  FROM customer_stage
  WHERE customer_id IS NOT NULL
),
clean_stage AS (
  SELECT customer_id, email, tier, updated_at, is_deleted
  FROM ranked_stage
  WHERE row_num = 1
)
SELECT
  s.customer_id,
  CASE
    WHEN t.customer_id IS NULL AND s.is_deleted = 0 THEN 'insert'
    WHEN s.is_deleted = 1 THEN 'delete_candidate'
    WHEN s.email != t.email OR s.tier != t.tier THEN 'update'
    ELSE 'unchanged'
  END AS load_action
FROM clean_stage AS s
LEFT JOIN dim_customer_current AS t
  ON t.customer_id = s.customer_id
ORDER BY s.customer_id;

Query result6 rows

customer_id	load_action
101	update
102	insert
103	delete_candidate
104	update
105	unchanged
106	insert

101 changes tier from silver to gold. 102 and 106 do not exist in the target. 103 is a tombstone. 104 is deduped to the platinum row. 105 matches the target and is unchanged.

1. source batch

what arrived from the app, CRM, or ingestion job

2. staging query

clean types, reject bad keys, dedupe retries

3. action plan

label each key insert, update, unchanged, or delete

4. target table

the warehouse table consumers trust

raw batch

stage clean

dedupe keys

classify actions

mutate target

audit counts

Key terms

target table: The warehouse table the load is maintaining.
source slice: The incoming records considered by this run.
load action: The insert, update, delete, or no-op decision for one key.

Common mistake

Writing the mutation before proving the action plan.

You discover duplicate keys, tombstones, or changed attributes only after the target is already wrong.

Better habit

Start with a SELECT that classifies actions per key.
Count actions before mutating.
Keep the source slice, target comparison, and final action labels in separate CTEs.
Always explain at least one output row aloud: what source row met what target row, and why did that produce this action?

Production habit

A senior load query is auditable: someone can read it and explain why each row will be inserted, updated, deleted, or ignored.

Remember this

Before table maintenance SQL changes data, it should prove the change plan.

Data Engineering SQL Usage

Orientation: SQL as a table-maintenance language