QUERY SYSTEMS

Data Engineering SQL Usage

Chapter 07AdvancedWarehouse

Chapter 07

Orientation: SQL as a table-maintenance language

Analytical SQL asks a question and returns an answer. Data-engineering SQL keeps a warehouse table correct after thousands or millions of upstream changes arrive. That shift changes the job: the query is no longer just a report, it is a contract for how target state evolves.

This chapter treats SQL as the language of safe table maintenance. You will learn how to decide which rows are new, changed, unchanged, deleted, late, or historically important before a pipeline mutates a target table.

A source table is where new data arrives. A target table is the trusted warehouse table users query. A staging query sits between them: it cleans the source, removes bad keys, chooses one row per business key, and prepares an action plan.

A business key is the real-world identifier for one entity, such as `customer_id`. “One row per business key” means customer 104 should appear once in the target current table, even if the source sent customer 104 twice in one batch.

Why data engineers care

Production tables are reused by dashboards, ML features, finance workflows, and downstream models. A bad merge key, a missing replay window, or an accidental append-only model can silently corrupt every consumer.

Core mental model

Every load compares a source slice to an existing target state. Your SQL should make the intended action for each key visible before the target is changed.

A SELECT-based action plan before mutation
Input data
customer_stage8 rows
customer_idemailtierupdated_atis_deletedbatch_id
101ana@example.comgold2026-03-05 10:0007001
102ben@example.comsilver2026-03-05 11:0007001
103cy@example.combronze2026-03-05 12:0017001
104dee-old@example.comgold2026-03-05 09:0007000
104dee@example.complatinum2026-03-05 12:3007001
105eli@example.combronze2026-03-05 08:0007001
NULLghost@example.comsilver2026-03-05 13:0007001
106fay@example.comsilver2026-03-04 23:3006999

This incoming batch intentionally includes one duplicate key, one NULL key, one unchanged row, and one tombstone.

dim_customer_current5 rows
customer_idemailtierupdated_atis_active
101ana@example.comsilver2026-02-10 09:001
103cy@example.combronze2026-01-15 09:001
104dee@example.comgold2026-02-20 09:001
105eli@example.combronze2026-02-21 09:001
107gus@example.comsilver2026-02-25 09:001

This is the trusted current-state table before the incoming batch is applied.

WITH ranked_stage AS (
  SELECT
    customer_id,
    email,
    tier,
    updated_at,
    is_deleted,
    batch_id,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY updated_at DESC, batch_id DESC
    ) AS row_num
  FROM customer_stage
  WHERE customer_id IS NOT NULL
),
clean_stage AS (
  SELECT customer_id, email, tier, updated_at, is_deleted
  FROM ranked_stage
  WHERE row_num = 1
)
SELECT
  s.customer_id,
  CASE
    WHEN t.customer_id IS NULL AND s.is_deleted = 0 THEN 'insert'
    WHEN s.is_deleted = 1 THEN 'delete_candidate'
    WHEN s.email != t.email OR s.tier != t.tier THEN 'update'
    ELSE 'unchanged'
  END AS load_action
FROM clean_stage AS s
LEFT JOIN dim_customer_current AS t
  ON t.customer_id = s.customer_id
ORDER BY s.customer_id;
Query result6 rows
customer_idload_action
101update
102insert
103delete_candidate
104update
105unchanged
106insert

101 changes tier from silver to gold. 102 and 106 do not exist in the target. 103 is a tombstone. 104 is deduped to the platinum row. 105 matches the target and is unchanged.

1. source batch

what arrived from the app, CRM, or ingestion job

2. staging query

clean types, reject bad keys, dedupe retries

3. action plan

label each key insert, update, unchanged, or delete

4. target table

the warehouse table consumers trust

raw batch
stage clean
dedupe keys
classify actions
mutate target
audit counts
Key terms
target table
The warehouse table the load is maintaining.
source slice
The incoming records considered by this run.
load action
The insert, update, delete, or no-op decision for one key.

Common mistake

Writing the mutation before proving the action plan.

You discover duplicate keys, tombstones, or changed attributes only after the target is already wrong.

Better habit

  • Start with a SELECT that classifies actions per key.
  • Count actions before mutating.
  • Keep the source slice, target comparison, and final action labels in separate CTEs.
  • Always explain at least one output row aloud: what source row met what target row, and why did that produce this action?
Production habit

A senior load query is auditable: someone can read it and explain why each row will be inserted, updated, deleted, or ignored.

Remember this

Before table maintenance SQL changes data, it should prove the change plan.