
Databricks
This guide explains how to export data from a Databricks workspace on AWS to an Amazon S3 bucket you own, from which Auxia reads on a schedule.
Before proceeding, review the Source Data Requirements to ensure your tables are ready to be connected with Auxia.
1. Overview
Auxia reads data from your Databricks workspace via partitioned Parquet files written to an Amazon S3 bucket in your AWS account. You retain full control of the data — Auxia is granted read-only access to the bucket via a standard S3 bucket policy, and no credentials are exchanged. Auxia handles transferring data from your S3 bucket into our internal pipeline using GCS Storage Transfer Service (the same path documented in the Amazon S3 connection guide).
Flow:
Your environment (AWS) Auxia
│
│ Databricks notebook
│ │
│ │ scheduled to run on a cadence
│ ▼
│ S3: s3://your-auxia-export/
│ │
│ │ read-only access granted to Auxia via bucket policy
│ └──────────────────────────────────────────────────────────>
│ │
│ │ Auxia reads from your bucket
│ │ on your schedule
│ ▼
Time required: ~30 minutes.
Prerequisites:
- Databricks workspace on AWS with Unity Catalog enabled
- An AWS account where you can create an S3 bucket and grant bucket policy permissions
- Source tables in Delta format
- Workspace permissions to create Storage Credentials, External Locations, schemas, and notebooks
2. Export Patterns
Three export patterns are supported, depending on how your source table behaves. You can mix patterns across tables — most integrations do.
| Table type | Pattern | Cadence | CDF needed? |
|---|---|---|---|
| Append-only event / fact tables (new rows only) | Watermark-based incremental | Hourly or daily | No |
| Mutable tables (rows can be updated or deleted) | CDC via Delta Change Data Feed | Hourly or daily | Yes |
| Dimension / lookup tables (slowly changing) | Full snapshot (OVERWRITE) | Daily | No |
When in doubt for high-volume event data, default to the append-only pattern — it is the cheapest and simplest.
3. Create an S3 Bucket for Auxia Exports
Create a new S3 bucket in your AWS account dedicated to Auxia exports. Keep Block Public Access enabled (the default).
Confirm the bucket region with your Auxia solutions engineer before creating. Co-locating with Auxia's read infrastructure reduces cross-region transfer cost.
Either SSE-S3 or SSE-KMS encryption is supported. If you use SSE-KMS, you will also need to grant the Auxia IAM role kms:Decrypt on the key — see the Encryption section of the Amazon S3 guide.
4. Grant Auxia Read Access on the Bucket
Auxia reads from your S3 bucket using an Auxia-managed AWS IAM role (the same role described in the Amazon S3 connection guide). Your Auxia solutions engineer will provide the exact Role ARN during onboarding.
Add a bucket policy granting that role s3:GetObject, s3:ListBucket, and s3:GetBucketLocation:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<auxia-role-arn-provided-by-auxia>"
},
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<your-auxia-export-bucket>",
"arn:aws:s3:::<your-auxia-export-bucket>/*"
]
}
]
}
5. Configure Databricks to Write to Your Bucket
Use Unity Catalog Storage Credentials + External Locations so the cloud credential is managed centrally rather than configured per cluster.
-
Create an AWS IAM role in your AWS account that Databricks can assume. Trust policy and required S3 write permissions are documented in the Databricks Storage Credentials guide.
-
Create a Storage Credential pointing at that role: Catalog → External Data → Storage Credentials → Create → AWS IAM Role.
-
Create an External Location at
s3://<your-auxia-export-bucket>/using that credential: Catalog → External Data → External Locations → Create. Run Test connection to verify. -
Create a destination schema in Unity Catalog whose managed location is the external location above:
CREATE SCHEMA IF NOT EXISTS <your_catalog>.auxia_export
MANAGED LOCATION 's3://<your-auxia-export-bucket>/';
6. Enable Change Data Feed (CDC Tables Only)
Only enable Change Data Feed (CDF) on tables where rows are updated or deleted and you need those changes propagated to Auxia. Skip this step for:
- Append-only event / fact tables (use a watermark filter instead — see Pattern A)
- Dimension / lookup tables you fully replace each run (Pattern C)
For each mutable source table, enable CDF:
ALTER TABLE <your_catalog>.<your_schema>.<your_table>
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
7. Configure the Export
Each export is implemented as a Databricks notebook scheduled to run on a cadence. Choose the pattern that matches how each source table behaves.
To create a notebook, open the Databricks left navigation and click New → Notebook:
Set the language to Python when prompted. The examples below are illustrative starting points — adapt table names, watermark columns, and bootstrap windows to your data.
7.0 One-Time Setup: State-Tracking Table
For Patterns A and B, create a small Delta table that records progress across runs. Skip this for Pattern C (no state needed).
CREATE TABLE IF NOT EXISTS <your_catalog>.auxia_export._export_state (
source_table STRING,
last_exported_end_time TIMESTAMP,
last_exported_version BIGINT,
last_run_timestamp TIMESTAMP
) USING DELTA;
Each scheduled run reads this table at the start, writes its export, and updates the corresponding row on success.
7.1 Pattern A — Append-only Incremental
Use for: event / fact tables where rows are only ever inserted, never updated or deleted. This is the most common pattern for high-volume data.
Examples: user events, transactions, page views, log records, sensor data — anything where each row represents an immutable past event.
from datetime import datetime, timezone
from pyspark.sql.functions import date_format
SOURCE_TABLE = '<your_catalog>.<your_schema>.<your_table>'
STATE_TABLE = '<your_catalog>.auxia_export._export_state'
EXPORT_PATH = 's3://<your-auxia-export-bucket>/<your_table>_incremental/'
WATERMARK_COL = 'event_timestamp' # replace with your column
BOOTSTRAP_START = datetime(2026, 1, 1, tzinfo=timezone.utc) # first-run lower bound
# 1. Read the last successful end_time from state.
state = spark.sql(f"""
SELECT last_exported_end_time
FROM {STATE_TABLE}
WHERE source_table = '{SOURCE_TABLE}'
""").collect()
start_time = state[0]['last_exported_end_time'] if state else BOOTSTRAP_START
end_time = datetime.now(timezone.utc).replace(minute=0, second=0, microsecond=0)
if start_time >= end_time:
print('Nothing new to export.')
dbutils.notebook.exit('no_new_data')
print(f'Exporting window: {start_time} -> {end_time}')
# 2. Read source, add dt + hr partition columns, write Parquet.
df = (spark.table(SOURCE_TABLE)
.filter(f"{WATERMARK_COL} > TIMESTAMP '{start_time}'")
.filter(f"{WATERMARK_COL} <= TIMESTAMP '{end_time}'")
.withColumn('dt', date_format(WATERMARK_COL, 'yyyy-MM-dd'))
.withColumn('hr', date_format(WATERMARK_COL, 'HH')))
(df.write
.format('parquet')
.mode('append')
.partitionBy('dt', 'hr')
.save(EXPORT_PATH))
# 3. Update state on success (explicit columns to keep other state fields untouched).
spark.sql(f"""
MERGE INTO {STATE_TABLE} target
USING (
SELECT '{SOURCE_TABLE}' AS source_table,
TIMESTAMP '{end_time}' AS last_exported_end_time,
current_timestamp() AS last_run_timestamp
) updates
ON target.source_table = updates.source_table
WHEN MATCHED THEN UPDATE SET
target.last_exported_end_time = updates.last_exported_end_time,
target.last_run_timestamp = updates.last_run_timestamp
WHEN NOT MATCHED THEN INSERT
(source_table, last_exported_end_time, last_run_timestamp)
VALUES
(updates.source_table, updates.last_exported_end_time, updates.last_run_timestamp)
""")
print(f'Exported window committed: {start_time} -> {end_time}')
Choosing the watermark column. event_timestamp is a placeholder — replace with whichever monotonically-increasing column the table has. Common names: event_timestamp, created_at, ingestion_time, _etl_loaded_at. The column must never be back-dated for new rows.
Bootstrap considerations. The first run exports everything from BOOTSTRAP_START through "now." For high-volume historical data, consider:
- Setting
BOOTSTRAP_STARTconservatively (for example, 60 days ago) - Splitting the bootstrap into chunked windows before enabling the schedule
- Starting fresh from "now" if backfill is not required
Output layout (hourly schedule):
s3://<your-auxia-export-bucket>/<your_table>_incremental/
dt=YYYY-MM-DD/hr=HH/*.parquet
If you schedule daily rather than hourly, you can drop the hr partition by removing .withColumn('hr', …) and the 'hr' argument to partitionBy(...).
No CDF is required for this pattern — append-only tables don't need a change log.
7.2 Pattern B — CDC Incremental
Use for: mutable tables where rows can be updated or deleted after creation and you need those changes reflected in Auxia. Requires CDF enabled on the source table (see Step 6).
Examples: user profiles (when mutable), subscription status, account settings, anything stateful that changes over time.
The structure mirrors Pattern A — a scheduled notebook reads state from _export_state, exports a window of new data, and updates state — but the window is defined by Delta commit versions instead of an event-time watermark:
- Track
last_exported_version(BIGINT) in the same_export_statetable. - Read the source table's latest version via
DESCRIBE HISTORY <table>and select_commit_version. - Read the change feed for the new range with
table_changes(<table>, last_exported_version + 1, current_version). - Add a partition column derived from
_commit_timestamp(for example_commit_date), then write Parquet withmode('append')andpartitionBy('_commit_date'). - On success,
MERGE INTO _export_stateto recordlast_exported_version = current_version.
The output includes the _change_type column (insert / update_preimage / update_postimage / delete) along with _commit_version and _commit_timestamp, and is partitioned by commit date.
Output layout:
s3://<your-auxia-export-bucket>/<your_table>_cdc/
_commit_date=YYYY-MM-DD/*.parquet
Bootstrap. CDF only captures changes from the version at which it was enabled — there is no "historical" CDF for rows committed before that. If you need to seed Auxia with the existing table state, run Pattern C once as a one-time full snapshot, then start the CDF pipeline from the version at which CDF was enabled.
7.3 Pattern C — Full Snapshot
Use for: small dimension / lookup tables where it is simpler to send the full state each run than to track changes.
Examples: product catalog, country codes, role definitions, configuration tables.
The notebook reads the source table and writes its full contents as Parquet to the export path on each run. No state tracking is required.
To avoid Auxia reading a partial snapshot during the swap, write each run into a timestamped subdirectory rather than overwriting the same location in place:
s3://<your-auxia-export-bucket>/<your_dim_table>/snapshot_ts=YYYY-MM-DDTHH-MM-SSZ/*.parquet
Auxia reads the latest snapshot_ts=… directory on each run. Old directories can be vacuumed on whatever retention you prefer (typically a handful of recent snapshots are kept for rollback).
8. Schedule the Export
Save the notebook and schedule it:
- Open the notebook and click Schedule in the top right.
- Set the cadence:
- Append-only event tables (Pattern A): hourly or daily, matching how often new data arrives.
- CDC tables (Pattern B): hourly or daily, depending on freshness requirements.
- Dimension / lookup tables (Pattern C): daily or weekly.
- Choose a compute profile (a small all-purpose or serverless compute is typically sufficient).
- Configure notifications for failures (email or Slack).
- Save the schedule.
See the Databricks scheduled notebooks documentation for full options including retries, timeouts, and concurrent-run limits.
9. Notify Auxia
Once your exports are running and files are landing in the bucket, contact your Auxia solutions engineer with:
- Bucket URL (e.g.,
s3://<your-auxia-export-bucket>/) - AWS region of the bucket
- List of tables being exported, with the pattern for each (append-only, CDC, or snapshot)
- Schema for each table
- Run cadence for each table
Auxia will configure the connection on our side.
10. Validation
To confirm files are landing in the bucket:
aws s3 ls s3://<your-auxia-export-bucket>/ --recursive --human-readable --summarize
You should see:
dt=YYYY-MM-DD/hr=HH/partition directories for append-only tables (Pattern A)_commit_date=YYYY-MM-DD/partition directories for CDC tables (Pattern B)- Parquet files at the table root for snapshot tables (Pattern C)
11. FAQ
What S3 region should my bucket be in? Confirm with your Auxia solutions engineer before creating the bucket. Auxia maintains S3 read infrastructure in specific regions for performance and cost reasons.
What if my source tables already have Change Data Feed enabled? No need to re-enable. Skip Step 6 for those tables.
Can I export only a subset of columns?
Yes. Modify the SELECT (or spark.table(...).select(...)) projection in your export notebook to include only the columns you want to share. Auxia reads whatever schema you write.
What happens if a scheduled run fails? Databricks retries the scheduled run per your configuration and notifies the recipients you set up. The next successful run picks up from the last successfully recorded watermark (Pattern A) or commit version (Pattern B), so no data is lost. For Pattern C, the next run simply overwrites the snapshot.
How do I add a new table to the integration?
- Decide which pattern fits the table — append-only, CDC, or snapshot (see Section 2).
- (Pattern B only) Enable CDF on the source table (Step 6).
- Add a new export notebook for the table (Step 7).
- Schedule it (Step 8).
- Notify your Auxia solutions engineer with the new table details (Step 9).
How do I choose the right pattern for a table?
| Question | Pattern |
|---|---|
| Table only gets new rows — never updated or deleted? | Pattern A (append-only). Use a watermark column. |
| Rows get updated or deleted, and those changes need to reach Auxia? | Pattern B (CDC). Enable CDF first. |
| Small reference data, slowly changing, simpler to send the whole table each run? | Pattern C (snapshot). Full overwrite. |
When in doubt for high-volume event data, default to Pattern A — it is the cheapest and simplest.
What is the cost on my side?
- S3 storage: standard rates for the exported Parquet files (typically small relative to the source data).
- Databricks compute: depends on cluster / SQL warehouse size and run frequency.
- No cross-cloud egress on your side — Auxia handles AWS → GCP transfer internally.
Need Help?
Contact support@auxia.io or your Auxia solutions engineer.