Unity Catalog config for predictive optimization #1333

kbatuigas · 2025-08-21T18:18:46Z

Description

This pull request adds a new note to the Databricks Unity Catalog setup documentation, clarifying required configuration steps when enabling predictive optimization. The note provides specific SQL commands that users must set in their workspace to ensure predictive optimization works correctly with Iceberg tables.

Documentation improvements:

Added a [NOTE] block to the Databricks Unity Catalog setup instructions in iceberg-topics-databricks-unity.adoc, detailing required workspace configurations and SQL commands for predictive optimization to generate column statistics and perform background compaction on Iceberg tables.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline:

Page previews

Query Iceberg Topics Using Databricks and Unity Catalog

Checks

New feature
Content gap
Support Follow-up
Small fix (typos, links, copyedits, etc)

netlify · 2025-08-21T18:18:50Z

✅ Deploy Preview for redpanda-docs-preview ready!

Name	Link
🔨 Latest commit	`65a1b53`
🔍 Latest deploy log	https://app.netlify.com/projects/redpanda-docs-preview/deploys/68ac9be8fe554b00084e9e7d
😎 Deploy Preview	https://deploy-preview-1333--redpanda-docs-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2025-08-21T18:18:53Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The PR updates the Databricks Unity Catalog + Iceberg documentation in modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc. It adds a NOTE on required workspace configuration for predictive optimization/compaction (lazyClustering, backfillStats, autoConflictResolution) and recommends an explicit OPTIMIZE. It expands Unity Catalog permissions (ALL PRIVILEGES, EXTERNAL USE SCHEMA). It broadens cluster configuration guidance (rpk cluster config edit, REST catalog properties, secret handling, cloud login) with copyable examples and placeholders. It enhances SQL examples for catalog/table name parsing and shows result output. Placeholder handling distinguishes cloud vs. non-cloud and includes a Unity Catalog name placeholder.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant RP as Redpanda Cluster
  participant DB as Databricks Workspace
  participant UC as Unity Catalog

  U->>DB: Set workspace configs (lazyClustering, backfillStats, autoConflictResolution)
  Note over DB: Predictive optimization prerequisites configured

  U->>RP: rpk cluster config edit (iceberg_* properties, REST endpoint, auth, warehouse)
  U->>DB: Store secrets (service principal tokens) if needed
  U->>RP: Reference stored secrets in cluster properties

  U->>DB: Grant permissions in UC (ALL PRIVILEGES, EXTERNAL USE SCHEMA)
  DB->>UC: Apply grants

  U->>DB: Run SQL using fully-qualified names<br/>and optional OPTIMIZE command
  DB->>UC: Resolve catalog/schema/table
  DB->>RP: Read/Write Iceberg topic data via REST catalog
  DB-->>U: Query results / compaction outcome

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Document custom configuration required to enable Databricks compaction with Iceberg Topics, including predictive optimization prerequisites and explicit OPTIMIZE guidance DOC-1572	✅

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Added Unity Catalog permission steps for ALL PRIVILEGES and EXTERNAL USE SCHEMA (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc)	These permissions are broader than documenting compaction-specific configuration and are not explicitly required by DOC-1572.
Expanded REST catalog cluster properties and authentication instructions, including secret storage and rpk cloud login (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc)	General cluster/auth configuration guidance exceeds the scope of documenting the two workspace settings needed for compaction.
Enhanced SQL examples for catalog/table quoting and result grid (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc)	SQL usage examples are ancillary and not directly tied to the compaction configuration objective.

Possibly related PRs

Query Iceberg topics with Databricks Unity #1154: Updates the same Databricks Unity Catalog Iceberg guide with related permissions and configuration content.
Iceberg: Emphasize setting REST catalog type and endpoint properties together #1183: Adds and clarifies REST catalog configuration and authentication properties relevant to this doc area.
Iceberg updates for REST catalog and wire format compatible modes #1238: Introduces REST catalog base location property changes that interact with the cluster config discussed here.

Suggested reviewers

mattschumpert
micheleRP
nvartolomei
Feediver1
rpdevmp

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch DOC-1572-task-document-custom-configuration-to-enable-datab

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc (7)
31-37: AsciiDoc code fence, not Markdown backticks

Even if you keep a code example here, switch from ```sql fences to AsciiDoc:
-```sql
-OPTIMIZE <table-name>;
-```
+[,sql]
+----
+OPTIMIZE <catalog-name>.redpanda.<table-name>;
+----
91-93: Permissions: good call-out on EXTERNAL USE SCHEMA; consider least-privilege example

Including EXTERNAL USE SCHEMA is correct and necessary; it isn’t implied by ALL PRIVILEGES. (docs.databricks.com)

To promote least privilege, consider showing a minimal GRANT example (avoiding broad ALL PRIVILEGES) alongside the current text:

[,sql]

GRANT USE CATALOG ON CATALOG TO <sp>;
GRANT USE SCHEMA ON SCHEMA .redpanda TO <sp>;
GRANT SELECT ON ALL TABLES IN SCHEMA .redpanda TO <sp>;
GRANT EXTERNAL USE SCHEMA ON SCHEMA .redpanda TO <sp>;

(docs.databricks.com)

100-118: Endpoint and OAuth settings look correct; add one clarifier

The Iceberg REST endpoint and OAuth server URI match Databricks docs (/api/2.1/unity-catalog/iceberg-rest and /oidc/v1/token on the workspace host). (docs.databricks.com, learn.microsoft.com)

Minor clarity improvement: add a comment that iceberg_rest_catalog_warehouse must be set to the Unity Catalog catalog name (you already explain later, but repeating inline near the sample reduces foot-guns).
 iceberg_rest_catalog_warehouse: <unity-catalog-name>
+# Set this to your Unity Catalog catalog name.
120-142: Cloud example is solid; optionally show post-apply verification

Consider adding a quick verification step after rpk cluster config set so users can confirm what was applied:

[,bash]

rpk cluster config get iceberg_rest_catalog_endpoint
rpk cluster config get iceberg_rest_catalog_warehouse

146-156: Placeholders are clear; tie ‘warehouse’ explicitly to catalog name

To eliminate ambiguity, explicitly state here that iceberg_rest_catalog_warehouse equals <unity-catalog-name> (the catalog where the redpanda schema appears). This echoes Databricks’ “warehouse” parameter for REST catalog clients. (docs.databricks.com)

204-206: Query example: nice use of quoted identifiers; consider LIMIT

Small UX tweak to avoid returning large datasets by default:
-SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`;
+SELECT * FROM `<catalog-name>`.redpanda.`<table-name>` LIMIT 10;
210-219: Block role and language

This is a rendered result grid, not SQL. Consider [source,text,role="no-copy no-wrap"] to avoid confusing syntax highlighting.
-[,sql,role="no-copy no-wrap"]
+[,text,role="no-copy no-wrap"]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4ce3a22 and a5b5637.

📒 Files selected for processing (1)

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Redirect rules - redpanda-docs-preview
GitHub Check: Header rules - redpanda-docs-preview
GitHub Check: Pages changed - redpanda-docs-preview

coderabbitai · 2025-08-21T18:24:44Z

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc

+
+[NOTE]
+====
+When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables:
+
+```sql
+SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true;
+SET spark.databricks.delta.computeStats.autoConflictResolution=true;
+
+-- Explicitly run predictive optimization on an existing table
+OPTIMIZE <table-name>;
+```
+====


🛠️ Refactor suggestion

⚠️ Potential issue

Remove undocumented Spark configs; fix code-block format; recommend supported steps

The two Spark configs shown (spark.databricks.delta.liquid.lazyClustering.backfillStats and spark.databricks.delta.computeStats.autoConflictResolution) are not publicly documented, and setting non-spark.sql.* properties via SQL SET in a notebook/SQL Editor is not supported. Databricks’ guidance is that only spark.sql.* can be set from SQL; other Spark configs must be set at compute level. Including these here will mislead users and likely won’t take effect. (kb.databricks.com)

Predictive Optimization already collects statistics and runs maintenance (OPTIMIZE/VACUUM/ANALYZE) automatically for Unity Catalog managed tables; no extra workspace Spark configs are required. For Iceberg, Managed Iceberg tables are explicitly optimized by Predictive Optimization (including Liquid Clustering). (databricks.com)

“Explicitly run predictive optimization on an existing table” is imprecise. Users manually trigger compaction with OPTIMIZE; Predictive Optimization is the automated service. (docs.databricks.com)

The code block uses Markdown triple backticks inside AsciiDoc, which won’t render as intended. Use AsciiDoc source blocks.

Apply this rewrite to the NOTE:

-When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables: - -```sql -SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true; -SET spark.databricks.delta.computeStats.autoConflictResolution=true; - --- Explicitly run predictive optimization on an existing table -OPTIMIZE <table-name>; -``` +When you enable predictive optimization for Unity Catalog, Databricks automatically manages statistics collection and background maintenance (for example, OPTIMIZE and VACUUM) for managed tables, including Managed Iceberg. No additional Spark configuration is required. + +To manually trigger maintenance on an existing table, run: + +[,sql] +---- +-- Compaction / incremental clustering +OPTIMIZE <catalog-name>.redpanda.<table-name>; +-- Optional: explicitly compute optimizer stats +ANALYZE TABLE <catalog-name>.redpanda.<table-name> COMPUTE STATISTICS; +----

References: Predictive Optimization behavior and scope; OPTIMIZE command; Managed Iceberg support. (databricks.com, docs.databricks.com)

paulohtb6 · 2025-08-22T14:50:32Z

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc

+
+/*
+Optionally trigger compaction and liquid clustering on an existing table.
+Run OPTIMIZE to check the effect of predictive optimization on the table.


I don't get this phrase. We need to run to check the effect? That doesn't sound right.

Suggested change

Run OPTIMIZE to check the effect of predictive optimization on the table.

Run OPTIMIZE to trigger compaction and liquid clustering on the table.

Consider this suggestion. Adjust to match your needs.

@mattschumpert How does this sound: "Run OPTIMIZE to see any immediate effects of predictive optimization on the table"

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc

paulohtb6

lgtm

kbatuigas requested a review from a team as a code owner August 21, 2025 18:18

kbatuigas requested a review from mattschumpert August 21, 2025 18:19

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

mattschumpert approved these changes Aug 22, 2025

View reviewed changes

paulohtb6 reviewed Aug 22, 2025

View reviewed changes

modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc Show resolved Hide resolved

kbatuigas added 7 commits August 24, 2025 21:19

Add config settings for Unity Catalog predictive optimization

dd62345

Custom config settings for Unity Catalog predictive optimization

24a45f2

Apply suggestions from automated review

dcd115a

Apply suggestion from automated review

942cde7

Apply suggestion from automated review

1ee0a58

Clarify that OPTIMIZE is optional

fb7fd44

Edit based on doc review

1c95408

kbatuigas force-pushed the DOC-1572-task-document-custom-configuration-to-enable-datab branch from 9d27ebc to 1c95408 Compare August 25, 2025 04:29

kbatuigas requested a review from paulohtb6 August 25, 2025 15:28

Rephrase per docs review

65a1b53

paulohtb6 approved these changes Aug 25, 2025

View reviewed changes

kbatuigas merged commit 1126e17 into main Aug 25, 2025
7 checks passed

kbatuigas deleted the DOC-1572-task-document-custom-configuration-to-enable-datab branch August 25, 2025 17:38

	Run OPTIMIZE to check the effect of predictive optimization on the table.
	Run OPTIMIZE to trigger compaction and liquid clustering on the table.

Unity Catalog config for predictive optimization #1333

Unity Catalog config for predictive optimization #1333

Uh oh!

Conversation

kbatuigas commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Page previews

Checks

Uh oh!

netlify bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for redpanda-docs-preview ready!

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

[,sql]

GRANT USE CATALOG ON CATALOG TO <sp>; GRANT USE SCHEMA ON SCHEMA .redpanda TO <sp>; GRANT SELECT ON ALL TABLES IN SCHEMA .redpanda TO <sp>; GRANT EXTERNAL USE SCHEMA ON SCHEMA .redpanda TO <sp>;

[,bash]

rpk cluster config get iceberg_rest_catalog_endpoint rpk cluster config get iceberg_rest_catalog_warehouse

Uh oh!

coderabbitai bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

paulohtb6 Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbatuigas Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

paulohtb6 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kbatuigas commented Aug 21, 2025 •

edited

Loading

netlify bot commented Aug 21, 2025 •

edited

Loading

coderabbitai bot commented Aug 21, 2025 •

edited by jira bot

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

GRANT USE CATALOG ON CATALOG TO `<sp>`;
GRANT USE SCHEMA ON SCHEMA .redpanda TO `<sp>`;
GRANT SELECT ON ALL TABLES IN SCHEMA .redpanda TO `<sp>`;
GRANT EXTERNAL USE SCHEMA ON SCHEMA .redpanda TO `<sp>`;

rpk cluster config get iceberg_rest_catalog_endpoint
rpk cluster config get iceberg_rest_catalog_warehouse

paulohtb6 Aug 22, 2025 •

edited

Loading