Skip to content

Conversation

@sgomezvillamor
Copy link
Contributor

@sgomezvillamor sgomezvillamor commented Nov 10, 2025

Context

This PR removes all Pydantic v1 compatibility code and completes the migration to Pydantic v2. This is now safe because:

  • All Python projects in the repo have pydantic>=2.0 as a dependency
    • metadata-ingestion: pydantic>=2.4.0,<3.0.0
    • datahub-actions: pydantic>=2.0.0,<3.0.0
    • smoke-test: uses metadata-ingestion (Pydantic v2)
    • Note: The codebase maintains compatibility with Pydantic 2.0+ (some CI environments use older constraint files with Pydantic 2.0-2.3)
  • The codebase no longer needs to support Pydantic v1 users
  • Pydantic v2 has been stable since June 2023 (over a year and a half ago)
  • This simplifies the codebase by removing 254 lines of compatibility/branching code

What Changed

Deleted compatibility layer files:

  • metadata-ingestion/src/datahub/configuration/pydantic_migration_helpers.py (provided v1_* aliases)
  • metadata-ingestion/src/datahub/pydantic/compat.py (version detection utilities)

Migrated to native Pydantic v2 APIs (38 files):

  • v1_ConfigModelBaseModel
  • v1_FieldField
  • v1_validator@field_validator
  • v1_root_validator@model_validator
  • Extra.forbidmodel_config = {"extra": "forbid"}
  • __fields__model_fields
  • parse_obj()model_validate()
  • _init_private_attributesmodel_post_init

Removed all version checks:

  • Deleted conditional code for PYDANTIC_VERSION, PYDANTIC_VERSION_2, PYDANTIC_SUPPORTS_CALLABLE_DISCRIMINATOR
  • Kept only Pydantic v2 code paths

Updated discriminated unions:

  • Leverages Pydantic v2's smart union matching for assertion types and search filters
  • Avoids Discriminator/Tag (Pydantic 2.4+) to maintain compatibility with Pydantic 2.0-2.3 used in some Airflow constraint files
  • Added comments explaining where smart union matching is used

Updated configurations:

  • Removed pydantic.v1.mypy from mypy plugins
  • Removed Pydantic v1 deprecation warnings from pytest config

Breaking Changes

None for users - all projects already require Pydantic v2. This only affects:

  • Internal development environments still on Pydantic v1 (shouldn't exist given dependencies)
  • Any external code importing private compatibility utilities (were never meant to be public APIs)

Testing

  • All 4109 unit tests pass (8 skipped)
  • All mypy type checks pass (1651 source files)
  • All linting checks pass (ruff format + ruff check)
  • ✅ Updated test fixtures for Pydantic v2 error message formats
  • ✅ Updated golden files for test outputs
  • Compatible with Pydantic 2.0+ (tested in CI with various Airflow constraint files)

Files Changed

38 files changed: +1132 insertions, -1386 deletions (net reduction of 254 lines)

Key areas:

  • Assertion API models (8 concrete assertion types)
  • Data contract models
  • Configuration models
  • Search filters (discriminated unions)
  • Test files and fixtures

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 10, 2025

🔴 Meticulous spotted visual differences in 2 of 939 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit 30532e7. This comment will update as new commits are pushed.


class InOperator(v1_ConfigModel):
class InOperator(BaseModel):
model_config = {"extra": "forbid"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we are not using a v2 ConfigModel where we set extra forbiddens by default, and all the models inherit it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really good catch
addressed in commit 3a360ce

model_config = {"extra": "forbid"}

type: Literal["count", "percentage"] = Field(default="count")
value: int = Field(default=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is not part of this pr, but at line 42 the comparison is not correct ->
It is comparing self.type (which is a string at runtime) to Literal["count"] (which is a type annotation, not a value). This comparison will always be False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)


class SqlMetricAssertion(BaseEntityAssertion):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we explicity specify it implements BaseAssertionProtocol which makes explicit to implement get_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is BaseAssertionProtocol is not being used at all in the model hierarchy. Not before the PR and so not now.

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Nov 11, 2025
- Migrate ConfigModel from Pydantic v1 style (class Config) to v2 style (model_config = ConfigDict)
- Replace 35 instances of manual 'model_config = {"extra": "forbid"}' with ConfigModel inheritance across 9 files
- Simplify Filter union in search_filters.py by removing TYPE_CHECKING conditional
- Fix redshift.py to use parse_obj_allow_extras() instead of direct Config.extra manipulation

This eliminates code duplication and centralizes Pydantic configuration policy.
@codecov
Copy link

codecov bot commented Nov 11, 2025

Bundle Report

Changes will increase total bundle size by 31 bytes (0.0%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
datahub-react-web-esm 28.63MB 31 bytes (0.0%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: datahub-react-web-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/index-*.js 31 bytes 19.0MB 0.0%

This commit completes the migration of all Pydantic v1 legacy syntax to v2 across
the entire DataHub Python codebase.

**Configuration Migration (18 instances):**
- Migrate `class Config:` → `model_config = ConfigDict(...)`
- Updated ConfigModel, PermissiveConfigModel, ConnectionModel base classes
- Migrated 13 additional model classes across multiple files

**Method Migrations (76 instances):**
- `.parse_obj()` → `.model_validate()` (38 instances)
- `.parse_raw()` → `.model_validate_json()` (2 instances)
- `.dict()` → `.model_dump()` (27 instances)
- `.json()` → `.model_dump_json()` (4 instances)
- `.update_forward_refs()` → `.model_rebuild()` (3 instances)
- `.copy()` and `.schema()` - all false positives (dicts/lists/HTTP responses)

**Scope of Changes:**
- metadata-ingestion/src: 50 instances
- metadata-ingestion/tests: 30 instances
- datahub-actions: 12 instances
- smoke-test: 2 instances

**Total: 94 Pydantic v1 calls migrated**

**Key Files Updated:**
- common.py: Base ConfigModel classes
- Multiple source files: sql_queries.py, datahub_classifier.py, schema_assertion.py, etc.
- Multiple CLI files: structuredproperties_cli.py, forms_cli.py, dataset_cli.py, etc.
- Test files: RDS IAM tests, Unity Catalog tests, assertion tests, etc.
- datahub-actions: propagation_action.py, filter tests, consumer offsets
- smoke-test: stateful ingestion tests

All Pydantic v2 deprecation warnings have been resolved. The codebase is now
fully compliant with Pydantic v2 with no remaining v1 syntax.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants