Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
ff6b33b
Working prototype
chi-yang-db Aug 7, 2025
02da2bf
Add basic tool scripts
chi-yang-db Aug 7, 2025
a3a7683
Parameterize raw table name
chi-yang-db Aug 7, 2025
1d79e2b
Finish the dev scripts
chi-yang-db Aug 7, 2025
6f2b5a4
Fix an identifier in script
chi-yang-db Aug 8, 2025
37700e8
User file trigger
chi-yang-db Aug 8, 2025
2e0cbbf
Fix file trigger path
chi-yang-db Aug 8, 2025
4fa17eb
Fix gitignore
chi-yang-db Aug 8, 2025
2a6062b
Unpause file trigger by default
chi-yang-db Aug 8, 2025
26ff855
Switch to streaming query
chi-yang-db Aug 11, 2025
c5212cf
Successful dynamic table prototype
chi-yang-db Aug 12, 2025
efb8e4a
Eliminate connector name
chi-yang-db Aug 12, 2025
44e4ce6
Fix shell script
chi-yang-db Aug 12, 2025
9a9e93b
Working prototype for dynamic table kernel
chi-yang-db Aug 12, 2025
6eb2c63
Fix comma and trigger job instead
chi-yang-db Aug 12, 2025
69436f6
Update property to all resource
chi-yang-db Aug 13, 2025
28a2891
Fix order number
chi-yang-db Aug 13, 2025
c8e5d0e
Enable bundle target in tool scripts
chi-yang-db Aug 13, 2025
54faae0
Embed bundle target to all scripts
chi-yang-db Aug 14, 2025
96811b8
Added CRUD
chi-yang-db Aug 14, 2025
fd73830
Add new default options to kernel
chi-yang-db Aug 16, 2025
0409893
Add helper script to open all resource
chi-yang-db Aug 20, 2025
644345e
Before migrating to new CUJ
chi-yang-db Sep 11, 2025
b089638
Successful deployment after dab conversion
chi-yang-db Sep 12, 2025
7b5c85f
Add path property if possible
chi-yang-db Sep 12, 2025
1465734
Successfully pass in parameters
chi-yang-db Sep 15, 2025
9eb37b7
Create basic folder structure in volume
chi-yang-db Sep 15, 2025
5378823
Infer data path
chi-yang-db Sep 15, 2025
ebff04a
tidy up directory
chi-yang-db Sep 15, 2025
802549c
Migrate filed and add initialization script
chi-yang-db Sep 16, 2025
5ef1e05
Fix relative path issue
chi-yang-db Sep 16, 2025
b9546d0
Add config manager and a working debug notebook
chi-yang-db Sep 16, 2025
299997e
Working debug notebook
chi-yang-db Sep 16, 2025
e7eb8ef
Refactor managers
chi-yang-db Sep 16, 2025
a4cbc52
Successfully create placeholder table
chi-yang-db Sep 17, 2025
322bc48
Solve the empty table DLT resolve issue
chi-yang-db Sep 17, 2025
04b1662
Better way to solve the empty DLT resolve
chi-yang-db Sep 17, 2025
d1f8373
Fix a flow name conflict issue
chi-yang-db Sep 17, 2025
29b8609
Working format manager
chi-yang-db Sep 18, 2025
1caa333
Fix multi table bug
chi-yang-db Sep 18, 2025
3e906e6
Fix option merge issue
chi-yang-db Sep 18, 2025
1c5ba0f
Add corrupted record columm to CSV and JSON option
chi-yang-db Sep 18, 2025
11540ce
Refactor and add expectation
chi-yang-db Sep 19, 2025
b0d27c8
Add warning for default storage
chi-yang-db Sep 19, 2025
b0beb24
Clean up workspace
chi-yang-db Sep 19, 2025
36bad25
Create cleansource move folder
chi-yang-db Sep 22, 2025
b88869b
Include cleansource move destination
chi-yang-db Sep 22, 2025
7a89c08
Fix issue on _corrupted_record column
chi-yang-db Sep 22, 2025
c26bf42
Tidy up
chi-yang-db Sep 23, 2025
628753e
Fix file name
chi-yang-db Sep 23, 2025
2fdaea3
Update default target and decription
chi-yang-db Sep 23, 2025
c17b732
Print environment to console
chi-yang-db Sep 23, 2025
e3fb633
More doc in debug notebook
chi-yang-db Sep 23, 2025
7dc7dcf
Add README
chi-yang-db Sep 23, 2025
e5ff8f4
Beautify README
chi-yang-db Sep 23, 2025
94b394a
Fix a display issue in README
chi-yang-db Sep 23, 2025
9003a6c
newline
chi-yang-db Sep 23, 2025
4df6380
Update example name
chi-yang-db Sep 23, 2025
2f55c3d
Fix typo in doc and enrich instructions
chi-yang-db Sep 24, 2025
ce62a8b
Add more comments
chi-yang-db Sep 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ conversational-agent-app @vivian-xie-db @yuanchaoma-db
database-diagram-builder @alexott
downstreams @nfx @alexott
feature-registry-app @yang-chengg @mparkhe @mingyangge-db @stephanielu5
filepush @chi-yang-db
go-libs @nfx @alexott
ip_access_list_analyzer @alexott
ka-chat-bot @taiga-db
Expand All @@ -19,4 +20,4 @@ runtime-packages @nfx @alexott
sql_migration_copilot @robertwhiffin
tacklebox @Jonathan-Choi
uc-catalog-cloning @esiol-db @vasco-lopes
.github @nfx @alexott @gueniai
.github @nfx @alexott @gueniai
1 change: 1 addition & 0 deletions filepush/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.vscode
179 changes: 179 additions & 0 deletions filepush/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
title: "Managed File Push"
language: python
author: "Chi Yang"
date: 2025-08-07

tags:
- ingestion
- file
- nocode
---

# Managed File Push

A lightweight, no‑code file ingestion workflow. Configure a set of tables, get a volume path for each, and drop files into those paths—your data lands in Unity Catalog tables via Auto Loader.

## Table of Contents
- [Quick Start](#quick-start)
- [Step 1. Configure tables](#step-1-configure-tables)
- [Step 2. Deploy & set up](#step-2-deploy--set-up)
- [Step 3. Retrieve endpoint & push files](#step-3-retrieve-endpoint--push-files)
- [Debug Table Issues](#debug-table-issues)
- [Step 1. Configure tables to debug](#step-1-configure-tables-to-debug)
- [Step 2. Deploy & set up in dev mode](#step-2-deploy--set-up-in-dev-mode)
- [Step 3. Retrieve endpoint & push files to debug](#step-3-retrieve-endpoint--push-files-to-debug)
- [Step 4. Debug table configs](#step-4-debug-table-configs)
- [Step 5. Fix the table configs in production](#step-5-fix-the-table-configs-in-production)

---

## Quick Start

### Step 1. Configure tables
Define the catalog and a **new** schema name where the tables will land in `./dab/databricks.yml`:

```yaml
variables:
catalog_name:
description: The existing catalog where the NEW schema will be created.
default: main
schema_name:
description: The name of the NEW schema where the tables will be created.
default: filepushschema
```

Edit table configs in `./dab/src/configs/tables.json`. Only `name` and `format` are required.

For supported `format_options`, see the [Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options). Not all options are supported here. If unsure, specify only `name` and `format`, or follow [Debug Table Issues](#debug-table-issues) to discover the correct options.

```json
[
{
"name": "table1",
"format": "csv",
"format_options": { "escape": "\"" },
"schema_hints": "id int, name string"
},
{
"name": "table2",
"format": "json"
}
]
```

> **Tip:** Keep `schema_hints` minimal; Auto Loader can evolve the schema as new columns appear.

### Step 2. Deploy & set up

```bash
cd dab
databricks bundle deploy
databricks bundle run configuration_job
```

Wait for the configuration job to finish before moving on.

### Step 3. Retrieve endpoint & push files
Fetch the volume path for uploading files to a specific table (example: `table1`):

```bash
databricks tables get main.filepushschema.table1 --output json \
| jq -r '.properties["filepush.table_volume_path_data"]'
```

Example output:

```text
/Volumes/main/filepushschema/main_filepushschema_filepush_volume/data/table1
```

Upload files to the path above using any of the [Volumes file APIs](https://docs.databricks.com/aws/en/volumes/volume-files#methods-for-managing-files-in-volumes).

**REST API example**:

```bash
# prerequisites: export DATABRICKS_HOST and DATABRICKS_TOKEN (PAT token)
curl -X PUT "$DATABRICKS_HOST/api/2.0/fs/files/Volumes/main/filepushschema/main_filepushschema_filepush_volume/data/table1/datafile1.csv" \
-H "Authorization: Bearer $DATABRICKS_TOKEN" \
-H "Content-Type: application/octet-stream" \
--data-binary @"/local/file/path/datafile1.csv"
```

**Databricks CLI example** (destination uses the `dbfs:` scheme):

```bash
databricks fs cp /local/file/path/datafile1.csv \
dbfs:/Volumes/main/filepushschema/main_filepushschema_filepush_volume/data/table1
```

Within about a minute, the data should appear in the table `main.filepushschema.table1`.

---

## Debug Table Issues
If data isn’t parsed as expected, use **dev mode** to iterate on table options safely.

### Step 1. Configure tables to debug
Configure tables as in [Step 1 of Quick Start](#step-1-configure-tables).

### Step 2. Deploy & set up in **dev mode**

```bash
cd dab
databricks bundle deploy -t dev
databricks bundle run configuration_job -t dev
```

Wait for the configuration job to finish. Example output:

```text
2025-09-23 22:03:04,938 [INFO] initialization - ==========
catalog_name: main
schema_name: dev_chi_yang_filepushschema
volume_path_root: /Volumes/main/dev_chi_yang_filepushschema/main_filepushschema_filepush_volume
volume_path_data: /Volumes/main/dev_chi_yang_filepushschema/main_filepushschema_filepush_volume/data
volume_path_archive: /Volumes/main/dev_chi_yang_filepushschema/main_filepushschema_filepush_volume/archive
==========
```

> **Note:** In **dev mode**, the schema name is **prefixed**. Use the printed schema name for the remaining steps.

### Step 3. Retrieve endpoint & push files to debug
Get the dev volume path (note the prefixed schema):

```bash
databricks tables get main.dev_chi_yang_filepushschema.table1 --output json \
| jq -r '.properties["filepush.table_volume_path_data"]'
```

Example output:

```text
/Volumes/main/dev_chi_yang_filepushschema/main_filepushschema_filepush_volume/data/table1
```

Then follow the upload instructions from [Quick Start → Step 3](#step-3-retrieve-endpoint--push-files) to send test files.

### Step 4. Debug table configs
Open the pipeline in the workspace:

```bash
databricks bundle open refresh_pipeline -t dev
```

Click **Edit pipeline** to launch the development UI. Open the `debug_table_config` notebook and follow its guidance to refine the table options. When satisfied, copy the final config back to `./dab/src/configs/tables.json`.

### Step 5. Fix the table configs in production
Redeploy the updated config and run a full refresh to correct existing data for an affected table:

```bash
cd dab
databricks bundle deploy
databricks bundle run refresh_pipeline --full-refresh table1
```

---

**That’s it!** You now have a managed file‑push workflow with debuggable table configs and repeatable deployments.

36 changes: 36 additions & 0 deletions filepush/dab/databricks.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# databricks.yml
# This is the configuration for the file push DAB dab.

bundle:
name: dab

include:
- resources/*.yml

targets:
# The deployment targets. See https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html
dev:
mode: development
workspace:
host: https://e2-dogfood.staging.cloud.databricks.com

prod:
mode: production
default: true
workspace:
host: https://e2-dogfood.staging.cloud.databricks.com
root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
permissions:
- user_name: ${workspace.current_user.userName}
level: CAN_MANAGE

variables:
catalog_name:
description: The existing catalog where the NEW schema will be created.
default: chi_catalog
schema_name:
description: The name of the NEW schema where the tables will be created.
default: filepushschema
resource_name_prefix:
description: The prefix for the resource names.
default: ${var.catalog_name}_${var.schema_name}_
46 changes: 46 additions & 0 deletions filepush/dab/resources/job.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# The main job for schema dab
# This job will trigger in the schema pipeline

resources:
jobs:
filetrigger_job:
name: ${var.resource_name_prefix}filetrigger_job
tasks:
- task_key: pipeline_refresh
pipeline_task:
pipeline_id: ${resources.pipelines.refresh_pipeline.id}
trigger:
file_arrival:
url: ${resources.volumes.filepush_volume.volume_path}/data/
configuration_job:
name: ${var.resource_name_prefix}configuration_job
tasks:
- task_key: initialization
spark_python_task:
python_file: ../src/utils/initialization.py
parameters:
- "--catalog_name"
- "{{job.parameters.catalog_name}}"
- "--schema_name"
- "{{job.parameters.schema_name}}"
- "--volume_path_root"
- "{{job.parameters.volume_path_root}}"
- "--logging_level"
- "${bundle.target}"
environment_key: serverless
- task_key: trigger_refresh
run_job_task:
job_id: ${resources.jobs.filetrigger_job.id}
depends_on:
- task_key: initialization
environments:
- environment_key: serverless
spec:
client: "3"
parameters:
- name: catalog_name
default: ${var.catalog_name}
- name: schema_name
default: ${resources.schemas.main_schema.name}
- name: volume_path_root
default: ${resources.volumes.filepush_volume.volume_path}
15 changes: 15 additions & 0 deletions filepush/dab/resources/pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# The table refresh pipeline for schema dab

resources:
pipelines:
refresh_pipeline:
name: ${var.resource_name_prefix}refresh_pipeline
catalog: ${var.catalog_name}
schema: ${resources.schemas.main_schema.name}
serverless: true
libraries:
- file:
path: ../src/ingestion.py
root_path: ../src
configuration:
filepush.volume_path_root: ${resources.volumes.filepush_volume.volume_path}
7 changes: 7 additions & 0 deletions filepush/dab/resources/schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# The schema dab

resources:
schemas:
main_schema:
name: ${var.schema_name}
catalog_name: ${var.catalog_name}
8 changes: 8 additions & 0 deletions filepush/dab/resources/volume.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# The file staging volume for schema dab

resources:
volumes:
filepush_volume:
name: ${var.resource_name_prefix}filepush_volume
catalog_name: ${var.catalog_name}
schema_name: ${var.schema_name}
10 changes: 10 additions & 0 deletions filepush/dab/src/configs/tables.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[
{
"name": "example_table",
"format": "csv",
"format_options": {
"escape": "\""
},
"schema_hints": "id int, name string"
}
]
63 changes: 63 additions & 0 deletions filepush/dab/src/debug_table_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Databricks notebook source
# MAGIC %md
# MAGIC ## Paste the table config JSON you would like to debug from `./configs/tables.json` and assign to variable `table_config`
# MAGIC For example,
# MAGIC ```
# MAGIC table_config = r'''
# MAGIC {
# MAGIC "name": "all_employees",
# MAGIC "format": "csv",
# MAGIC "format_options": {
# MAGIC "escape": "\"",
# MAGIC "multiLine": "false"
# MAGIC }
# MAGIC "schema_hints": "id int, name string"
# MAGIC }
# MAGIC '''
# MAGIC ```
# MAGIC Only `name` and `format` are required for a table.

# COMMAND ----------

table_config = r'''
{
"name": "employees",
"format": "csv",
"format_options": {
"escape": "\""
},
"schema_hints": "id int, name string"
}
'''

# COMMAND ----------

# MAGIC %md
# MAGIC ## Click `Run all` and inspect the parsed result. Iterate on the config until the result looks good

# COMMAND ----------

import json
import tempfile
from utils import tablemanager
from utils import envmanager

if not envmanager.has_default_storage():
print("WARNING: Current catalog is not using default storage, some file push feature may not be available")

# Load table config
table_config_json = json.loads(table_config)
tablemanager.validate_config(table_config_json)
table_name = table_config_json["name"]
table_volume_path_data = tablemanager.get_table_volume_path(table_name)

assert tablemanager.has_data_file(table_name), f"No data file found in {table_volume_path_data}. Please upload at least 1 file to {table_volume_path_data}"

# Put schema location in temp directory
with tempfile.TemporaryDirectory() as tmpdir:
display(tablemanager.get_df_with_config(spark, table_config_json, tmpdir))

# COMMAND ----------

# MAGIC %md
# MAGIC ## Copy and paste the modified config back to the `./configs/tables.json` in the DAB folder
Loading