Skip to content

Conversation

kbatuigas
Copy link
Contributor

@kbatuigas kbatuigas commented Aug 21, 2025

Description

This pull request adds a new note to the Databricks Unity Catalog setup documentation, clarifying required configuration steps when enabling predictive optimization. The note provides specific SQL commands that users must set in their workspace to ensure predictive optimization works correctly with Iceberg tables.

Documentation improvements:

  • Added a [NOTE] block to the Databricks Unity Catalog setup instructions in iceberg-topics-databricks-unity.adoc, detailing required workspace configurations and SQL commands for predictive optimization to generate column statistics and perform background compaction on Iceberg tables.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline:

Page previews

Query Iceberg Topics Using Databricks and Unity Catalog

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

@kbatuigas kbatuigas requested a review from a team as a code owner August 21, 2025 18:18
Copy link

netlify bot commented Aug 21, 2025

Deploy Preview for redpanda-docs-preview ready!

Name Link
🔨 Latest commit 65a1b53
🔍 Latest deploy log https://app.netlify.com/projects/redpanda-docs-preview/deploys/68ac9be8fe554b00084e9e7d
😎 Deploy Preview https://deploy-preview-1333--redpanda-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

The PR updates the Databricks Unity Catalog + Iceberg documentation in modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc. It adds a NOTE on required workspace configuration for predictive optimization/compaction (lazyClustering, backfillStats, autoConflictResolution) and recommends an explicit OPTIMIZE. It expands Unity Catalog permissions (ALL PRIVILEGES, EXTERNAL USE SCHEMA). It broadens cluster configuration guidance (rpk cluster config edit, REST catalog properties, secret handling, cloud login) with copyable examples and placeholders. It enhances SQL examples for catalog/table name parsing and shows result output. Placeholder handling distinguishes cloud vs. non-cloud and includes a Unity Catalog name placeholder.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant RP as Redpanda Cluster
  participant DB as Databricks Workspace
  participant UC as Unity Catalog

  U->>DB: Set workspace configs (lazyClustering, backfillStats, autoConflictResolution)
  Note over DB: Predictive optimization prerequisites configured

  U->>RP: rpk cluster config edit (iceberg_* properties, REST endpoint, auth, warehouse)
  U->>DB: Store secrets (service principal tokens) if needed
  U->>RP: Reference stored secrets in cluster properties

  U->>DB: Grant permissions in UC (ALL PRIVILEGES, EXTERNAL USE SCHEMA)
  DB->>UC: Apply grants

  U->>DB: Run SQL using fully-qualified names<br/>and optional OPTIMIZE command
  DB->>UC: Resolve catalog/schema/table
  DB->>RP: Read/Write Iceberg topic data via REST catalog
  DB-->>U: Query results / compaction outcome
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Assessment against linked issues

Objective Addressed Explanation
Document custom configuration required to enable Databricks compaction with Iceberg Topics, including predictive optimization prerequisites and explicit OPTIMIZE guidance DOC-1572

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Added Unity Catalog permission steps for ALL PRIVILEGES and EXTERNAL USE SCHEMA (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc) These permissions are broader than documenting compaction-specific configuration and are not explicitly required by DOC-1572.
Expanded REST catalog cluster properties and authentication instructions, including secret storage and rpk cloud login (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc) General cluster/auth configuration guidance exceeds the scope of documenting the two workspace settings needed for compaction.
Enhanced SQL examples for catalog/table quoting and result grid (modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc) SQL usage examples are ancillary and not directly tied to the compaction configuration objective.

Possibly related PRs

Suggested reviewers

  • mattschumpert
  • micheleRP
  • nvartolomei
  • Feediver1
  • rpdevmp
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch DOC-1572-task-document-custom-configuration-to-enable-datab

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc (7)

31-37: AsciiDoc code fence, not Markdown backticks

Even if you keep a code example here, switch from ```sql fences to AsciiDoc:

-```sql
-OPTIMIZE <table-name>;
-```
+[,sql]
+----
+OPTIMIZE <catalog-name>.redpanda.<table-name>;
+----

91-93: Permissions: good call-out on EXTERNAL USE SCHEMA; consider least-privilege example

  • Including EXTERNAL USE SCHEMA is correct and necessary; it isn’t implied by ALL PRIVILEGES. (docs.databricks.com)
  • To promote least privilege, consider showing a minimal GRANT example (avoiding broad ALL PRIVILEGES) alongside the current text:

[,sql]

GRANT USE CATALOG ON CATALOG TO <sp>;
GRANT USE SCHEMA ON SCHEMA .redpanda TO <sp>;
GRANT SELECT ON ALL TABLES IN SCHEMA .redpanda TO <sp>;
GRANT EXTERNAL USE SCHEMA ON SCHEMA .redpanda TO <sp>;

(docs.databricks.com)


100-118: Endpoint and OAuth settings look correct; add one clarifier

  • The Iceberg REST endpoint and OAuth server URI match Databricks docs (/api/2.1/unity-catalog/iceberg-rest and /oidc/v1/token on the workspace host). (docs.databricks.com, learn.microsoft.com)
  • Minor clarity improvement: add a comment that iceberg_rest_catalog_warehouse must be set to the Unity Catalog catalog name (you already explain later, but repeating inline near the sample reduces foot-guns).
 iceberg_rest_catalog_warehouse: <unity-catalog-name>
+# Set this to your Unity Catalog catalog name.

120-142: Cloud example is solid; optionally show post-apply verification

Consider adding a quick verification step after rpk cluster config set so users can confirm what was applied:

[,bash]

rpk cluster config get iceberg_rest_catalog_endpoint
rpk cluster config get iceberg_rest_catalog_warehouse


146-156: Placeholders are clear; tie ‘warehouse’ explicitly to catalog name

To eliminate ambiguity, explicitly state here that iceberg_rest_catalog_warehouse equals <unity-catalog-name> (the catalog where the redpanda schema appears). This echoes Databricks’ “warehouse” parameter for REST catalog clients. (docs.databricks.com)


204-206: Query example: nice use of quoted identifiers; consider LIMIT

Small UX tweak to avoid returning large datasets by default:

-SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`;
+SELECT * FROM `<catalog-name>`.redpanda.`<table-name>` LIMIT 10;

210-219: Block role and language

This is a rendered result grid, not SQL. Consider [source,text,role="no-copy no-wrap"] to avoid confusing syntax highlighting.

-[,sql,role="no-copy no-wrap"]
+[,text,role="no-copy no-wrap"]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4ce3a22 and a5b5637.

📒 Files selected for processing (1)
  • modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Redirect rules - redpanda-docs-preview
  • GitHub Check: Header rules - redpanda-docs-preview
  • GitHub Check: Pages changed - redpanda-docs-preview

Comment on lines 26 to 42
+
[NOTE]
====
When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables:

```sql
SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true;
SET spark.databricks.delta.computeStats.autoConflictResolution=true;

-- Explicitly run predictive optimization on an existing table
OPTIMIZE <table-name>;
```
====
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Remove undocumented Spark configs; fix code-block format; recommend supported steps

  • The two Spark configs shown (spark.databricks.delta.liquid.lazyClustering.backfillStats and spark.databricks.delta.computeStats.autoConflictResolution) are not publicly documented, and setting non-spark.sql.* properties via SQL SET in a notebook/SQL Editor is not supported. Databricks’ guidance is that only spark.sql.* can be set from SQL; other Spark configs must be set at compute level. Including these here will mislead users and likely won’t take effect. (kb.databricks.com)
  • Predictive Optimization already collects statistics and runs maintenance (OPTIMIZE/VACUUM/ANALYZE) automatically for Unity Catalog managed tables; no extra workspace Spark configs are required. For Iceberg, Managed Iceberg tables are explicitly optimized by Predictive Optimization (including Liquid Clustering). (databricks.com)
  • “Explicitly run predictive optimization on an existing table” is imprecise. Users manually trigger compaction with OPTIMIZE; Predictive Optimization is the automated service. (docs.databricks.com)
  • The code block uses Markdown triple backticks inside AsciiDoc, which won’t render as intended. Use AsciiDoc source blocks.

Apply this rewrite to the NOTE:

-When you enable predictive optimization, you must also set the following configurations in your workspace. These configurations allow predictive optimization to automatically generate column statistics and carry out background compaction for Iceberg tables:
-
-```sql
-SET spark.databricks.delta.liquid.lazyClustering.backfillStats=true;
-SET spark.databricks.delta.computeStats.autoConflictResolution=true;
-
--- Explicitly run predictive optimization on an existing table
-OPTIMIZE <table-name>;
-```
+When you enable predictive optimization for Unity Catalog, Databricks automatically manages statistics collection and background maintenance (for example, OPTIMIZE and VACUUM) for managed tables, including Managed Iceberg. No additional Spark configuration is required.
+
+To manually trigger maintenance on an existing table, run:
+
+[,sql]
+----
+-- Compaction / incremental clustering
+OPTIMIZE <catalog-name>.redpanda.<table-name>;
+-- Optional: explicitly compute optimizer stats
+ANALYZE TABLE <catalog-name>.redpanda.<table-name> COMPUTE STATISTICS;
+----

References: Predictive Optimization behavior and scope; OPTIMIZE command; Managed Iceberg support. (databricks.com, docs.databricks.com)


/*
Optionally trigger compaction and liquid clustering on an existing table.
Run OPTIMIZE to check the effect of predictive optimization on the table.
Copy link
Collaborator

@paulohtb6 paulohtb6 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this phrase. We need to run to check the effect? That doesn't sound right.

Suggested change
Run OPTIMIZE to check the effect of predictive optimization on the table.
Run OPTIMIZE to trigger compaction and liquid clustering on the table.

Consider this suggestion. Adjust to match your needs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattschumpert How does this sound: "Run OPTIMIZE to see any immediate effects of predictive optimization on the table"

@kbatuigas kbatuigas force-pushed the DOC-1572-task-document-custom-configuration-to-enable-datab branch from 9d27ebc to 1c95408 Compare August 25, 2025 04:29
@kbatuigas kbatuigas requested a review from paulohtb6 August 25, 2025 15:28
Copy link
Collaborator

@paulohtb6 paulohtb6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kbatuigas kbatuigas merged commit 1126e17 into main Aug 25, 2025
7 checks passed
@kbatuigas kbatuigas deleted the DOC-1572-task-document-custom-configuration-to-enable-datab branch August 25, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants