Most pipeline failures aren’t caused by a single bug — they’re caused by missing guardrails: unclear ownership, no backfill strategy, silent quality regressions, or fragile schedules. This checklist is a practical way to review a data pipeline (batch or streaming) before promoting it to production.
1) Define the contract and the owners
- Owner: there is a named on-call/rotation for incidents.
- Consumers: downstream dashboards/models are documented.
- Schema: expected fields + types are declared (and versioned).
- SLOs: freshness and completeness expectations are written (e.g., ‘99% of runs within 30 minutes’).
Tip
If you can’t state who owns a dataset and what ‘good’ looks like, you can’t operate it reliably.
2) Reliability and idempotency
Production pipelines must tolerate retries without duplicating records or corrupting state. This means deterministic keys, safe upserts, and clear run boundaries.
- Idempotent writes (merge/upsert) or dedupe strategy is implemented.
- Retries are enabled with exponential backoff.
- Failure modes are categorized (transient vs. permanent).
- Timeouts are set for external calls (APIs, warehouses, object storage).
sql-- Example: idempotent load with merge MERGE INTO mart.orders t USING staging.orders s ON t.order_id = s.order_id WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT *;
3) Scheduling, backfills, and reprocessing
Backfills are where pipelines go to die if you don’t plan for them. Make reprocessing a first-class feature — not a manual emergency procedure.
- Run window is explicit (e.g., process partitions by day/hour).
- Backfill supports ‘start date → end date’ parameters.
- Data is partitioned to enable efficient reprocessing.
- Late arriving data handling is defined (watermarks / lookback windows).
4) Data quality checks
Quality regressions should fail fast or at least alert loudly. Minimum checks: row counts, null constraints, uniqueness, and accepted value ranges for key fields.
- Uniqueness constraint for primary keys.
- Not-null checks for critical fields.
- Volume anomaly detection (e.g., +/− 30% vs baseline).
- Schema drift detection with explicit approval path.
yaml# Example: quality checks (conceptual) checks: - name: orders_primary_key type: unique column: order_id - name: orders_customer_id_not_null type: not_null column: customer_id - name: orders_daily_volume type: volume_anomaly threshold_pct: 30
5) Observability: metrics, logs, traces
When something breaks at 2am you need immediate answers: what failed, where, why, and what changed. Observability is the difference between a 5‑minute fix and a 5‑hour outage.
- Each run emits: start/end time, status, processed rows, error category.
- Dashboards exist for freshness, volume, and failure rate.
- Alerts are actionable (include run id, partition, and last successful run).
- Run metadata is retained long enough for audits (e.g., 30–90 days).
6) Security and access controls
Security doesn’t have to slow you down — but missing access boundaries will. Bake in least privilege, secrets management, and PII controls.
- Secrets are stored in a vault/secret manager (not env files committed to git).
- Service principals have least privilege roles.
- PII classification is documented and enforced (masking/tokenization).
- Audit logs exist for sensitive dataset access.
7) Deployment and change management
- CI runs tests (unit + integration) before merge.
- Deployments are repeatable (IaC) and environment-specific.
- Rollback plan exists (previous artifact + data correction steps).
- Breaking changes require version bump and consumer notification.
“A pipeline is production-ready when you can change it safely — not when it ‘works on your laptop’.”
Closing: make the checklist executable
The biggest upgrade you can make is turning this checklist into automation: tests, monitors, and guardrails that run every deploy. If you’d like, we can convert these checks into a repeatable framework (dbt tests, orchestration policies, data contracts, and alerting) tailored to your stack.
