-
Notifications
You must be signed in to change notification settings - Fork 45
Unity Catalog config for predictive optimization #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unity Catalog config for predictive optimization #1333
Conversation
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThe PR updates the Databricks Unity Catalog + Iceberg documentation in modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc. It adds a NOTE on required workspace configuration for predictive optimization/compaction (lazyClustering, backfillStats, autoConflictResolution) and recommends an explicit OPTIMIZE. It expands Unity Catalog permissions (ALL PRIVILEGES, EXTERNAL USE SCHEMA). It broadens cluster configuration guidance (rpk cluster config edit, REST catalog properties, secret handling, cloud login) with copyable examples and placeholders. It enhances SQL examples for catalog/table name parsing and shows result output. Placeholder handling distinguishes cloud vs. non-cloud and includes a Unity Catalog name placeholder. Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant RP as Redpanda Cluster
participant DB as Databricks Workspace
participant UC as Unity Catalog
U->>DB: Set workspace configs (lazyClustering, backfillStats, autoConflictResolution)
Note over DB: Predictive optimization prerequisites configured
U->>RP: rpk cluster config edit (iceberg_* properties, REST endpoint, auth, warehouse)
U->>DB: Store secrets (service principal tokens) if needed
U->>RP: Reference stored secrets in cluster properties
U->>DB: Grant permissions in UC (ALL PRIVILEGES, EXTERNAL USE SCHEMA)
DB->>UC: Apply grants
U->>DB: Run SQL using fully-qualified names<br/>and optional OPTIMIZE command
DB->>UC: Resolve catalog/schema/table
DB->>RP: Read/Write Iceberg topic data via REST catalog
DB-->>U: Query results / compaction outcome
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc (7)
31-37
: AsciiDoc code fence, not Markdown backticksEven if you keep a code example here, switch from ```sql fences to AsciiDoc:
-```sql -OPTIMIZE <table-name>; -``` +[,sql] +---- +OPTIMIZE <catalog-name>.redpanda.<table-name>; +----
91-93
: Permissions: good call-out on EXTERNAL USE SCHEMA; consider least-privilege example
- Including
EXTERNAL USE SCHEMA
is correct and necessary; it isn’t implied byALL PRIVILEGES
. (docs.databricks.com)- To promote least privilege, consider showing a minimal GRANT example (avoiding broad
ALL PRIVILEGES
) alongside the current text:[,sql]
GRANT USE CATALOG ON CATALOG TO
<sp>
;
GRANT USE SCHEMA ON SCHEMA .redpanda TO<sp>
;
GRANT SELECT ON ALL TABLES IN SCHEMA .redpanda TO<sp>
;
GRANT EXTERNAL USE SCHEMA ON SCHEMA .redpanda TO<sp>
;
100-118
: Endpoint and OAuth settings look correct; add one clarifier
- The Iceberg REST endpoint and OAuth server URI match Databricks docs (
/api/2.1/unity-catalog/iceberg-rest
and/oidc/v1/token
on the workspace host). (docs.databricks.com, learn.microsoft.com)- Minor clarity improvement: add a comment that
iceberg_rest_catalog_warehouse
must be set to the Unity Catalog catalog name (you already explain later, but repeating inline near the sample reduces foot-guns).iceberg_rest_catalog_warehouse: <unity-catalog-name> +# Set this to your Unity Catalog catalog name.
120-142
: Cloud example is solid; optionally show post-apply verificationConsider adding a quick verification step after
rpk cluster config set
so users can confirm what was applied:[,bash]
rpk cluster config get iceberg_rest_catalog_endpoint
rpk cluster config get iceberg_rest_catalog_warehouse
146-156
: Placeholders are clear; tie ‘warehouse’ explicitly to catalog nameTo eliminate ambiguity, explicitly state here that
iceberg_rest_catalog_warehouse
equals<unity-catalog-name>
(the catalog where theredpanda
schema appears). This echoes Databricks’ “warehouse” parameter for REST catalog clients. (docs.databricks.com)
204-206
: Query example: nice use of quoted identifiers; consider LIMITSmall UX tweak to avoid returning large datasets by default:
-SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`; +SELECT * FROM `<catalog-name>`.redpanda.`<table-name>` LIMIT 10;
210-219
: Block role and languageThis is a rendered result grid, not SQL. Consider
[source,text,role="no-copy no-wrap"]
to avoid confusing syntax highlighting.-[,sql,role="no-copy no-wrap"] +[,text,role="no-copy no-wrap"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Redirect rules - redpanda-docs-preview
- GitHub Check: Header rules - redpanda-docs-preview
- GitHub Check: Pages changed - redpanda-docs-preview
+ | ||
[NOTE] | ||
==== | ||
When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables: | ||
|
||
```sql | ||
SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true; | ||
SET spark.databricks.delta.computeStats.autoConflictResolution=true; | ||
|
||
-- Explicitly run predictive optimization on an existing table | ||
OPTIMIZE <table-name>; | ||
``` | ||
==== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove undocumented Spark configs; fix code-block format; recommend supported steps
- The two Spark configs shown (
spark.databricks.delta.liquid.lazyClustering.backfillStats
andspark.databricks.delta.computeStats.autoConflictResolution
) are not publicly documented, and setting non-spark.sql.*
properties via SQLSET
in a notebook/SQL Editor is not supported. Databricks’ guidance is that onlyspark.sql.*
can be set from SQL; other Spark configs must be set at compute level. Including these here will mislead users and likely won’t take effect. (kb.databricks.com) - Predictive Optimization already collects statistics and runs maintenance (OPTIMIZE/VACUUM/ANALYZE) automatically for Unity Catalog managed tables; no extra workspace Spark configs are required. For Iceberg, Managed Iceberg tables are explicitly optimized by Predictive Optimization (including Liquid Clustering). (databricks.com)
- “Explicitly run predictive optimization on an existing table” is imprecise. Users manually trigger compaction with
OPTIMIZE
; Predictive Optimization is the automated service. (docs.databricks.com) - The code block uses Markdown triple backticks inside AsciiDoc, which won’t render as intended. Use AsciiDoc source blocks.
Apply this rewrite to the NOTE:
-When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables:
-
-```sql
-SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true;
-SET spark.databricks.delta.computeStats.autoConflictResolution=true;
-
--- Explicitly run predictive optimization on an existing table
-OPTIMIZE <table-name>;
-```
+When you enable predictive optimization for Unity Catalog, Databricks automatically manages statistics collection and background maintenance (for example, OPTIMIZE and VACUUM) for managed tables, including Managed Iceberg. No additional Spark configuration is required.
+
+To manually trigger maintenance on an existing table, run:
+
+[,sql]
+----
+-- Compaction / incremental clustering
+OPTIMIZE <catalog-name>.redpanda.<table-name>;
+-- Optional: explicitly compute optimizer stats
+ANALYZE TABLE <catalog-name>.redpanda.<table-name> COMPUTE STATISTICS;
+----
References: Predictive Optimization behavior and scope; OPTIMIZE command; Managed Iceberg support. (databricks.com, docs.databricks.com)
|
||
/* | ||
Optionally trigger compaction and liquid clustering on an existing table. | ||
Run OPTIMIZE to check the effect of predictive optimization on the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this phrase. We need to run to check the effect? That doesn't sound right.
Run OPTIMIZE to check the effect of predictive optimization on the table. | |
Run OPTIMIZE to trigger compaction and liquid clustering on the table. |
Consider this suggestion. Adjust to match your needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattschumpert How does this sound: "Run OPTIMIZE to see any immediate effects of predictive optimization on the table"
9d27ebc
to
1c95408
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Description
This pull request adds a new note to the Databricks Unity Catalog setup documentation, clarifying required configuration steps when enabling predictive optimization. The note provides specific SQL commands that users must set in their workspace to ensure predictive optimization works correctly with Iceberg tables.
Documentation improvements:
iceberg-topics-databricks-unity.adoc
, detailing required workspace configurations and SQL commands for predictive optimization to generate column statistics and perform background compaction on Iceberg tables.Resolves https://redpandadata.atlassian.net/browse/
Review deadline:
Page previews
Query Iceberg Topics Using Databricks and Unity Catalog
Checks