feat: Complete hard_deletes='new_record' implementation for snapshots #1244

randypitcherii · 2025-10-29T19:34:07Z

Description

This PR implements full support for the hard_deletes='new_record' configuration in snapshot materializations, resolving the issue where deletion records were being created with NULL values for all source columns.

Problem

Issue #1176 reported that when using hard_deletes: new_record, the deletion records for removed source records contained NULL values for all source columns (id, name, etc.) instead of preserving the actual values from the deleted records. This created malformed output that made it impossible to identify which records were deleted.

Root Cause

The dbt-core snapshot_staging_table macro generates a deletion_records CTE that needs to match columns between the source query and the target snapshot table. When building the list of existing snapshot columns (snapshotted_cols), it used get_columns_in_relation() which returns agate.Row tuples like ('col_name', 'data_type', 'comment') in Databricks.

The macro then tried to access the .name attribute on these tuples via get_list_of_column_names(), which doesn't exist on tuples. This caused the column matching logic to fail silently, resulting in all source columns being set to NULL in the deletion records.

Solution

Created a Databricks-specific override of snapshot_staging_table that properly extracts column names from agate.Row tuples by accessing index [0] instead of the .name attribute. This ensures the deletion_records CTE can correctly match columns and preserve source values when inserting deletion tracking records.

Additionally, overrode build_snapshot_table to include the dbt_is_deleted column during initial snapshot creation when using hard_deletes='new_record' mode.

Changes

New Files

dbt/include/databricks/macros/materializations/snapshot_helpers.sql (221 lines)
- databricks__build_snapshot_table: Adds dbt_is_deleted column for new_record mode
- databricks__snapshot_staging_table: Complete override to fix column name extraction from agate.Row tuples
dbt/include/databricks/macros/materializations/snapshot_merge.sql (32 lines)
- databricks__snapshot_merge_sql: Implements hard_deletes-aware MERGE logic
- Supports all three modes: ignore (default), invalidate, new_record
tests/functional/adapter/simple_snapshot/test_hard_deletes.py (298 lines)
- Comprehensive functional tests for all three hard_deletes modes
- Verifies correct behavior for ignore, invalidate, and new_record modes

Modified Files

.gitignore: Added exclusion for docs/plans/ directory

All Three Modes Now Working

✅ hard_deletes='ignore' (default)

Deleted records remain unchanged in snapshot
dbt_valid_to stays NULL for records no longer in source
Maintains backward compatibility

✅ hard_deletes='invalidate'

Deleted records are invalidated by setting dbt_valid_to timestamp
Uses Delta Lake's WHEN NOT MATCHED BY SOURCE clause

✅ hard_deletes='new_record' ← This is the fix

Original records are invalidated (dbt_valid_to set to deletion timestamp)
New deletion records inserted with dbt_is_deleted=True and actual source column values preserved
Provides complete audit trail showing exactly what was deleted
Resolves the malformed output issue from SCD2 column check and timestamp check malformed output #1176

Testing

Completed

✅ All 3 functional tests passing (verified during implementation)
✅ Code quality checks passing (ruff, ruff-format, mypy)
✅ No regressions in existing snapshot functionality

Environment Tested

Unity Catalog cluster with Delta Lake
SQL Warehouse (UC)

Note on Test Execution

The functional tests were verified to pass during implementation. Current test environment has permission restrictions (403 Forbidden from cloud storage provider) which is expected as noted in CONTRIBUTING.MD - the full test matrix will be run by maintainers on the staging branch.

Checklist

Code follows PEP 8 style guidelines (verified with ruff)
Type hints verified (mypy passing)
Functional tests added for all three hard_deletes modes
Commits signed with -s flag
PR allows edits from maintainers
Linked to issue SCD2 column check and timestamp check malformed output #1176

Breaking Changes

None - this is a bug fix that maintains full backward compatibility. The default behavior (hard_deletes='ignore') is unchanged.

Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

Fixes databricks#1176 Implements full support for hard_deletes configuration in snapshot materializations, enabling users to track deleted source records with dedicated deletion records marked by dbt_is_deleted=True. The dbt-core snapshot_staging_table macro generates a deletion_records CTE that relies on get_column_schema_from_query() for source columns, which returns proper column schema objects with .name attributes. However, when building the list of snapshotted_cols from the target table, it used get_columns_in_relation() which returns agate.Row tuples like ('col_name', 'data_type', 'comment'). The deletion_records CTE tried to iterate these tuples using .name attribute (via get_list_of_column_names()), which doesn't exist on tuples. This caused the column matching logic to fail silently, preventing deletion records from being properly constructed with the correct columns from the snapshotted table. This resulted in deletion records being inserted with NULL values for all source columns (id, name, etc.) instead of the actual values from the deleted records, causing malformed output as reported in issue databricks#1176. Created databricks__snapshot_staging_table override that properly extracts column names from agate.Row tuples by accessing index [0] instead of .name attribute. This ensures the deletion_records CTE receives correct column lists for both source and target tables, allowing proper column matching when inserting deletion records. Additionally, overrode databricks__build_snapshot_table to include dbt_is_deleted column in initial snapshot table creation when hard_deletes='new_record', ensuring the column exists from the start and doesn't need to be added later. **New file: dbt/include/databricks/macros/materializations/snapshot_helpers.sql** - databricks__build_snapshot_table: Adds dbt_is_deleted column for new_record mode - databricks__snapshot_staging_table: Complete override to fix column name extraction - Properly extracts column names from agate.Row tuples using index [0] - Filters out Databricks metadata rows (starting with '#') - Generates correct deletion_records CTE with proper column matching **New file: dbt/include/databricks/macros/materializations/snapshot_merge.sql** - databricks__snapshot_merge_sql: Implements hard_deletes-aware MERGE logic - Supports 'invalidate' mode with WHEN NOT MATCHED BY SOURCE clause - Uses 'insert *' pattern to include all staging table columns including dbt_is_deleted **New file: tests/functional/adapter/simple_snapshot/test_hard_deletes.py** - Comprehensive functional tests for all three hard_deletes modes - TestHardDeleteIgnore: Verifies deleted records remain unchanged (default) - TestHardDeleteInvalidate: Verifies dbt_valid_to is set for deleted records - TestHardDeleteNewRecord: Verifies new deletion records with dbt_is_deleted=True **hard_deletes='ignore'** (default) - Deleted records remain unchanged in snapshot - dbt_valid_to stays NULL for records no longer in source - Maintains backward compatibility **hard_deletes='invalidate'** - Deleted records are invalidated by setting dbt_valid_to timestamp - Uses Delta Lake's WHEN NOT MATCHED BY SOURCE clause - Original records marked as no longer valid when removed from source **hard_deletes='new_record'** - Original records are invalidated (dbt_valid_to set) - New deletion records inserted with dbt_is_deleted=True and actual source column values - Provides complete audit trail of deletions - Resolves malformed output issue where deletion records had NULL values - All 3 functional tests passing (ignore, invalidate, new_record) - Code quality checks passing (ruff, ruff-format, mypy) - No regressions in existing snapshot functionality - Verified with Databricks Delta Lake MERGE operations - Tested against Unity Catalog cluster - dbt/include/databricks/macros/materializations/snapshot_helpers.sql (new, 221 lines) - dbt/include/databricks/macros/materializations/snapshot_merge.sql (new, 32 lines) - tests/functional/adapter/simple_snapshot/test_hard_deletes.py (new, 298 lines) - .gitignore (added docs/plans/ exclusion) Signed-off-by: Randy Pitcher <randypitcherii@gmail.com> Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Randy Pitcher <randypitcherii@gmail.com>

randypitcherii requested review from benc-db and ericj-db as code owners October 29, 2025 19:34

randypitcherii and others added 2 commits October 29, 2025 15:48

updated gitignore to exclude .cusrsor folder

941bfce

randypitcherii force-pushed the feature/1176-hard-delete-processing branch from 9259b77 to bc32f1a Compare October 29, 2025 20:11

Update .gitignore to include uv.lock and docs/plans/

ae4bba0

randypitcherii mentioned this pull request Oct 29, 2025

SCD2 column check and timestamp check malformed output #1176

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Complete hard_deletes='new_record' implementation for snapshots #1244

feat: Complete hard_deletes='new_record' implementation for snapshots #1244

Uh oh!

randypitcherii commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Complete hard_deletes='new_record' implementation for snapshots #1244

Are you sure you want to change the base?

feat: Complete hard_deletes='new_record' implementation for snapshots #1244

Uh oh!

Conversation

randypitcherii commented Oct 29, 2025

Description

Problem

Root Cause

Solution

Changes

New Files

Modified Files

All Three Modes Now Working

Testing

Completed

Environment Tested

Note on Test Execution

Checklist

Breaking Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant