Python Knowledge Base

ListsDictsSetsLookupsDedupe

Chapter 2

Records, Collections & Keyed Data

Lists, dicts, sets, tuples, grouping, lookup maps, dedupe keys, and nested records used in real transforms.

EncodingsCleaningRegexParsingNormalization

Chapter 3

Strings, Regex & Text Normalization

Clean, parse, and normalize messy text: encodings, whitespace, casing, delimiters, and regex for real-world fields.

pathlibCSVJSONNDJSONManifests

Chapter 4

Files, Paths & Payload Ingestion

Read files and payloads safely with pathlib, CSV, JSON, NDJSON, encodings, manifests, and malformed-row handling.

NonedatetimeTime zonesDecimalTyping

Chapter 5

Types, Time, Nulls & Numeric Correctness

Handle None, datetime values, time zones, Decimal math, rounding, optional fields, and type hints deliberately.

Learning level

Intermediate

Chapter 6

Functions, Modules & Configuration

Structure transform code with small functions, modules, config objects, environment rules, and dependency boundaries.

FunctionsModulesConfigEnvironmentsBoundaries

IteratorsyielditertoolsStreamingMemory

Chapter 7

Iterators, Generators & Streaming Data

Process data that does not fit in memory using lazy evaluation, yield, itertools, and streaming file pipelines.

BatchJoinsAggregatesScansWindows

Chapter 8

Pure Python Batch Transform Patterns

Use map/filter/reduce-style flows, in-memory joins, aggregations, stateful scans, and sorted-window logic.

ParquetAvroColumnarCompressionSchemas

Chapter 9

Serialization & File Formats

Move data between CSV, JSON, Parquet, Avro, and ORC: row vs columnar, compression, and schema-aware reads/writes.

ValidationSchemasRejected rowsWarningsErrors

Chapter 10

Validation, Contracts & Error Handling

Design required-field checks, schema validation, rejected-row reports, warnings, exceptions, and contract evidence.

Learning level

Advanced

Chapter 11

APIs, Databases & External Boundaries

Pull from APIs and databases with pagination, retries, rate limits, secrets boundaries, and idempotent source reads.

APIsPaginationRetriesDatabasesSecrets

ThreadsProcessesasyncioGILProfiling

Chapter 12

Concurrency, Parallelism & Performance

Choose threads, processes, or asyncio with the GIL in mind, then profile and tune before reaching for a cluster.

pytestFixturesGolden outputsLoggingMetrics

Chapter 13

Testing, Debugging, Logging & Operability

Use pytest-style thinking, fixtures, golden outputs, logging, metrics, and traceable failures for production confidence.

CursorsWatermarksReplayCheckpointsTasks

Chapter 14

Incremental, Idempotent & Orchestrated Pipelines

Build cursor, watermark, replay-window, checkpoint, manifest, and task-boundary habits for rerunnable jobs.