Skip to content

Conversation

@treff7es
Copy link
Contributor

Summary

Fixes DataFlow entities (e.g., Fivetran pipelines) appearing in a "Default" folder in DataHub's browse tree by implementing proper BrowsePathsV2 support in the DataFlow SDK class.

Problem

It was reported in Fivetran pipeline entities were appearing in a "Default" folder in the browse tree, while their individual DataJob tasks appeared in the correct locations. This created confusion and duplicate-like entries in the UI.

Root Cause:

  • The DataFlow SDK class did not set a BrowsePathsV2 aspect during initialization
  • Without this aspect, DataHub's backend defaults entities to a "Default" folder
  • In contrast, DataJob entities properly inherit and extend their parent DataFlow's browse path via _set_browse_path_from_flow()

Changes

Core Implementation

src/datahub/sdk/dataflow.py:

  • Added _set_default_browse_path() method to set browse paths based on platform instance
  • Modified __init__ to call _set_default_browse_path(platform_instance) during initialization
  • Added DataPlatformInstanceUrn import

Behavior:

  • DataFlows with platform_instance set will appear under that platform instance in browse tree (e.g., /prod_instance/)
  • DataFlows without platform_instance will have an empty browse path (not "Default")
  • DataJobs will have hierarchical paths: [platform_instance, dataflow_urn] when platform_instance exists

Testing

tests/unit/sdk_v2/test_dataflow.py:

  • Added test_dataflow_browse_path_with_platform_instance() - verifies browse path is set correctly with platform_instance
  • Added test_dataflow_browse_path_without_platform_instance() - verifies empty browse path without platform_instance

tests/unit/sdk_v2/test_datajob.py:

  • Updated test_datajob_complex() assertion to match new hierarchical browse path behavior

Golden files updated:

  • All DataFlow and DataJob golden files updated to include new browsePathsV2 aspects

Impact

Affected Sources

  • Fivetran: Primary beneficiary - pipelines will now appear under platform instance instead of "Default"
  • Any future source using DataFlow SDK: Will automatically get proper browse paths

User Experience

  • Before: Fivetran pipelines appeared in /Default/ folder
  • After: Fivetran pipelines appear in /[platform_instance]/ (when platform_instance is configured)
  • DataJobs maintain correct hierarchical structure within their DataFlow

Backward Compatibility

  • ✅ Existing DataFlow entities without platform_instance will have empty browse paths (acceptable)
  • ✅ All existing tests pass with updated golden files
  • ✅ No breaking changes to DataFlow or DataJob APIs

Testing

  • ✅ All 128 SDK v2 unit tests pass
  • ✅ Linting passes (./gradlew lintFix)
  • ✅ New tests verify browse path behavior with/without platform_instance
  • ✅ Golden files updated and validated

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 11, 2025
@codecov
Copy link

codecov bot commented Nov 11, 2025

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
9402 2 9400 58
View the top 2 failed test(s) by shortest run time
tests.integration.fivetran.test_fivetran::test_fivetran_with_snowflake_dest
Stack Traces | 0.266s run time
Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataFlow:(fivetran,my_confluent_cloud_connector_id,PROD):
<browsePathsV2> added

Urn changed, urn:li:dataFlow:(fivetran,calendar_elected,PROD):
<browsePathsV2> added
tests.integration.fivetran.test_fivetran::test_fivetran_with_snowflake_dest_and_null_connector_user
Stack Traces | 0.268s run time
Metadata files differ (use `pytest --update-golden-files` to update):
Urn changed, urn:li:dataFlow:(fivetran,my-fivetran.my_confluent_cloud_connector_id,PROD):
<browsePathsV2> added

Urn changed, urn:li:dataFlow:(fivetran,my-fivetran.calendar_elected,PROD):
<browsePathsV2> added

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@alwaysmeticulous
Copy link

alwaysmeticulous bot commented Nov 11, 2025

🔴 Meticulous spotted visual differences in 2 of 939 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit a2c3885. This comment will update as new commits are pushed.

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM in the context of SDK

However, if this issue was found during Fivetran ingestion, shouldn't that be automagically fixed by the auto browse path processor?

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants