Home

Sunday, December 7, 2025

Incremental Load in ETL — CDC, Checksum & Timestamp Explained with Interview Questions

Incremental Load in ETL — CDC, Checksum & Timestamp Explained with Interview Questions

📌 Introduction

Running a full load for large datasets becomes slow, expensive, and inefficient. That’s why Incremental Load is used — it loads only the data that has changed since the last load.

Three widely used Incremental Load techniques:

  • Change Data Capture (CDC)
  • Checksum / Hash Comparison
  • Timestamp / Watermark

🔹 What is Incremental Load?

Incremental load means loading only new and updated records.

💡 Why Incremental Load is Important?

  • Avoids full table scan
  • Reduces time and cost
  • Improves ETL efficiency
  • Less impact on source systems

1️⃣ Change Data Capture (CDC)

CDC captures Inserts, Updates & Deletes from the transaction log and stores it in change tables.

Advantages:

  • Tracks deletes automatically
  • High performance
  • Complete change history

Limitations:

  • Must be supported on source DB
  • Extra storage required

2️⃣ Checksum / Hash Based Method

Generates a hash value from important columns. If hash changes → row updated.

Advantages:

  • Works without timestamp or CDC
  • Detects updated rows effectively

Limitations:

  • Hash calculation overhead
  • Delete detection needs key comparison

3️⃣ Timestamp / Watermark

Loads rows where ModifiedDate > LastWatermark.

Advantages:

  • Very simple and common approach
  • Best for small to medium datasets

Limitations:

  • Does not detect deletes
  • Late-arriving data issue

🔄 Quick Comparison

FeatureCDCChecksumTimestamp
Detect Deletes✔ Yes⚠ Partial❌ No
Implementation EffortMediumMediumEasy
Source DependencyCDC RequiredNoTimestamp Required

💬 Interview Questions

Q1: Why do we prefer Incremental Load?
A: Less data movement → better performance & lower cost.

Q2: Which method captures deletes?
A: CDC.

Q3: Timestamp method disadvantages?
A: No delete tracking & late-arriving data problems.

Q4: What if timestamp is not updated during changes?
A: Use Checksum or CDC.

📌 Summary

  • CDC → Best for full accuracy and delete tracking
  • Checksum → Best when timestamp not reliable
  • Timestamp → Best for simplicity

👉 Comment which method your project uses!

No comments: