-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingestion): Add secret masking framework #15188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingestion): Add secret masking framework #15188
Conversation
kyungsoo-datahub
commented
Nov 3, 2025
- Thread-safe SecretRegistry with copy-on-write pattern
- Logging layer masking filter
- Bootstrap initialization for different contexts
- CLI integration
- Unit and integration tests
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| """Ingest metadata into DataHub.""" | ||
|
|
||
| # Initialize secret masking (before any logging) | ||
| initialize_secret_masking() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that the masking only happens in datahub ingest command?
instead, shouldn't we make it by default for all output when using the python sdk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some downstream libraries was stuck when I broaden the scope of this change. We can broaden the scope on next phase.
| class ExecutionContext(Enum): | ||
| """Execution context for DataHub ingestion.""" | ||
|
|
||
| CLI = "cli" | ||
| UI_BACKEND = "ui_backend" | ||
| REMOTE_EXECUTOR = "remote" | ||
| SCHEDULED = "scheduled" | ||
| UNKNOWN = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand different context will have different sources of secret. So far, cli masks env vars. Remote executor will mask secrets from secret store, and so on.
Wondering if being aware of the context is responsibility of the masking component or instead the target component should be responsible to register the secrets?
Basically, I would like to reduce the coupling of this masking component; ideally, it should only depend on logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comment. I removed contexts and sources as we discussed.
| uninstall_masking_filter, | ||
| ) | ||
| from datahub.ingestion.masking.secret_registry import SecretRegistry | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have some mechanism (env var) to disable masking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the flag to disable this.
|
|
||
| # Detect UI backend | ||
| if os.environ.get("DATAHUB_UI_INGESTION") == "1": | ||
| return ExecutionContext.UI_BACKEND |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UI_BACKEND
SCHEDULED
what are those context? may both refer to ingestion from managed ingestion ui?
managed ingestion can be scheduled or not, is that relevant for the masking?
ingestion can be triggered from managed ui or cli, shouldn't we also cover the second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks
|
|
||
| # Detect remote executor | ||
| if os.environ.get("DATAHUB_EXECUTOR_ID"): | ||
| return ExecutionContext.REMOTE_EXECUTOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it relevant for the masking whether remote or embedded executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed following our discussion.
| Args: | ||
| context: Execution context (auto-detected if None) | ||
| secret_sources: List of sources to load (auto-detected if None) | ||
| max_message_size: Maximum log message size before truncation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
truncating log... that's sort of new feature, is that required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the revision, it would be truncated only when log masking is enabled. FYI, all debug logs will be enabled in debug mode and log masking would be disabled in debug mode.
| # Disable HTTP debug output (prevent deadlock) | ||
| try: | ||
| import http.client | ||
|
|
||
| http.client.HTTPConnection.debuglevel = 0 | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Set HTTP-related loggers to INFO (not DEBUG) | ||
| for logger_name in [ | ||
| "urllib3", | ||
| "urllib3.connectionpool", | ||
| "urllib3.util.retry", | ||
| "requests", | ||
| ]: | ||
| try: | ||
| logging.getLogger(logger_name).setLevel(logging.INFO) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we lose ability to have debug logs from these 3rd party libs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the revision, the 3rd party libs debug log would be showing up when log masking is disabled.
| elif source == "datahub_secrets_store": | ||
| # TODO: Implement when backend API available | ||
| logger.debug("DataHub secrets store not yet implemented") | ||
| else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about SecretStr from configs? are we covering them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I added them. Thanks.
| # Note: Never add variables containing secrets, passwords, keys, or tokens. | ||
| # Examples to NEVER add: AWS_ACCESS_KEY_ID, DATABASE_PASSWORD, API_KEY, | ||
| # SECRET_KEY. If unsure, don't add it. | ||
| _SYSTEM_ENV_VARS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may rename with a more descriptive name such as:
DATAHUB_MASKING_ENV_VARS_ALLOWED
DATAHUB_MASKING_ENV_VARS_SKIPPED
...
or something like that
Additionally, we may have a mechanism to incorporate values without requiring a release:
DATAHUB_MASKING_ENV_VARS_ALLOWED_PATTERN
so we can provide a regex pattern to skip additional env vars
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion. Revised.
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments
My main concern is on the context because it will couple this masking components to eg datahub secret store and so on 🤔
ed608a2 to
ebd9743
Compare
|
✅ Meticulous spotted 0 visual differences across 984 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit 6b039f4. This comment will update as new commits are pushed. |
Bundle ReportBundle size has no change ✅ |
| if PYDANTIC_VERSION_2: | ||
| from pydantic import SecretStr, model_validator | ||
| else: | ||
| from pydantic import SecretStr, root_validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently merged #15057
So there should be no more pydantic v1 root_validator or validator
We should stop introducing pydantic v1 compatibility code and just focus (assume) python v1
And if we find some component still supporting pydantic v1 in codebase, masking will only work if pydantic v2 at runtime.
We need to move and push towards pydantic v2 only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the info. After rebasing master, the problem with Pydantic 1 is gone. I removed.
| try: | ||
| from datahub.masking.secret_registry import SecretRegistry | ||
|
|
||
| _MASKING_AVAILABLE = True | ||
| except ImportError: | ||
| _MASKING_AVAILABLE = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which scenario this import will fail?
IMO the optional dependency is not necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my understanding:
- there should be no code dependant on imports availability
- if we have some conditional code, it should be dependant on
is_masking_enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. I removed this. Thanks.
Start services in background and poll health endpoints until ready before container startup completes. This ensures services are responding when Docker runs immediate healthcheck (ignoring start_period). - hex/datahub mock APIs: poll for 30s - s3mock: wrap entrypoint, poll for 60s - move test_hex_ingestion to Batch 7
Docker Compose interprets $VAR as Compose variables. Escape with $$ so shell variables are passed literally to the container shell. Fixes Batches 6, 7, 8 failures caused by variable interpolation.
Add detailed logging to debug container exit code 1 issues: - set -x for shell command tracing - Log every step: package install, file checks, server start - Check if Python/Java processes are alive - Show netstat output for port binding - Log PID and process status - Output all wget/health check results - Add error messages for each failure point This will help identify why containers exit with code 1.
Root cause: wget --spider uses HEAD requests for health checks, but mock servers only implemented do_GET(). This caused 404 responses, failing health checks and causing containers to exit with code 1. Added do_HEAD() methods to both mock servers to handle HEAD requests for /health endpoint used by container startup scripts and Docker healthchecks. This fixes all Batch 6 and 8 test failures.
The adobe/s3mock image has a special entrypoint that launches the Java application. Overriding it with /bin/sh breaks the container completely. Solution: Remove all custom startup logic and healthcheck for s3mock. Without a healthcheck, docker compose --wait will start the container and move on. The test's wait_until_responsive() will handle readiness. This fixes Batch 6 kafka-connect test failures.
Now that the root cause is fixed (missing HEAD method handlers), remove all the verbose debug logging from container startup scripts. Keep only the essential startup logic: - Install wget - Start Python server in background - Poll health endpoint for 30 seconds - Exit successfully when ready
Move all Batch 8 tests to Batch 6 and remove Batch 8 from CI: - test_hex_ingestion_with_lineage - test_kafka_connect_bigquery_sink_ingest - test_kafka_connect_debezium_postgres - test_kafka_connect_debezium_sqlserver This consolidates the test batches, reducing CI jobs from 9 to 8.
- Move hex tests from batch_6 back to batch_2 - Move kafka-connect tests from batch_6 back to batch_1 This reverts the test rebundling from commit 16c3d3a that moved these tests to batch_6.
Revert workflow and build.gradle to only run batches 0-5 now that tests have been moved back to their original locations.
- Trino tests were moved from batch_1 to batch_7, move them back - Remove remaining batch_7 markers from kafka-connect tests that were added after the initial rebundling
- Thread-safe SecretRegistry with copy-on-write pattern - Logging layer masking filter - Bootstrap initialization for different contexts - CLI integration - Unit and integration tests
- Move masking to upper level - Remove context and source dependency - Address issues
- Remove conditional Pydantic v1/v2 imports - Remove try/except wrapper for SecretRegistry import
- Add test for __init__.py imports to cover module exports - Add tests for is_bootstrapped() and get_bootstrap_error() functions - Add test for initialization with masking disabled - Add test for exception hook masking failure handling These tests improve coverage of edge cases and error paths in the masking framework.
d15a7d5 to
6b039f4
Compare