Incremental Load in ETL — CDC, Checksum & Timestamp Explained with Interview Questions
📌 Introduction
Running a full load for large datasets becomes slow, expensive, and inefficient. That’s why Incremental Load is used — it loads only the data that has changed since the last load.
Three widely used Incremental Load techniques:
- Change Data Capture (CDC)
- Checksum / Hash Comparison
- Timestamp / Watermark
🔹 What is Incremental Load?
Incremental load means loading only new and updated records.
💡 Why Incremental Load is Important?
- Avoids full table scan
- Reduces time and cost
- Improves ETL efficiency
- Less impact on source systems
1️⃣ Change Data Capture (CDC)
CDC captures Inserts, Updates & Deletes from the transaction log and stores it in change tables.
Advantages:
- Tracks deletes automatically
- High performance
- Complete change history
Limitations:
- Must be supported on source DB
- Extra storage required
2️⃣ Checksum / Hash Based Method
Generates a hash value from important columns. If hash changes → row updated.
Advantages:
- Works without timestamp or CDC
- Detects updated rows effectively
Limitations:
- Hash calculation overhead
- Delete detection needs key comparison
3️⃣ Timestamp / Watermark
Loads rows where ModifiedDate > LastWatermark.
Advantages:
- Very simple and common approach
- Best for small to medium datasets
Limitations:
- Does not detect deletes
- Late-arriving data issue
🔄 Quick Comparison
| Feature | CDC | Checksum | Timestamp |
|---|---|---|---|
| Detect Deletes | ✔ Yes | ⚠ Partial | ❌ No |
| Implementation Effort | Medium | Medium | Easy |
| Source Dependency | CDC Required | No | Timestamp Required |
💬 Interview Questions
Q1: Why do we prefer Incremental Load?
A: Less data movement → better performance & lower cost.
Q2: Which method captures deletes?
A: CDC.
Q3: Timestamp method disadvantages?
A: No delete tracking & late-arriving data problems.
Q4: What if timestamp is not updated during changes?
A: Use Checksum or CDC.
📌 Summary
- CDC → Best for full accuracy and delete tracking
- Checksum → Best when timestamp not reliable
- Timestamp → Best for simplicity
👉 Comment which method your project uses!
No comments:
Post a Comment