Quick Start Guide¶
Complete guide for deploying, configuring, and operating DQX Data Quality Manager.
Prerequisites¶
Required Tools¶
| Tool | Installation |
|---|---|
| Databricks CLI v0.200+ | pip install databricks-cli or Install Guide |
| Python 3.13+ | For local development |
Databricks Workspace Requirements¶
| Requirement | Description |
|---|---|
| Unity Catalog | Must be enabled |
| SQL Warehouse | Any warehouse (Serverless recommended) |
| Databricks Apps | Must be enabled on workspace |
| Model Serving (optional) | For AI analysis feature |
| Lakebase (optional) | For saving rules with versioning |
User Permissions¶
| Resource | Permission |
|---|---|
| Unity Catalog | USE CATALOG, USE SCHEMA, SELECT on target tables |
| SQL Warehouse | CAN USE |
| Jobs | CAN MANAGE RUN (granted automatically via DAB) |
| Lakebase | OAuth access (if using save feature) |
Quick Start¶
# 1. Clone and configure
git clone https://github.com/dediggibyte/databricks_dqx_agent.git
cd databricks_dqx_agent
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
# 2. Deploy the bundle (notebooks are deployed automatically)
databricks bundle validate -t dev
databricks bundle deploy -t dev
# 3. Access your app
open https://your-workspace.cloud.databricks.com/apps/dqx-rule-generator-dev
That's it!
DAB automatically deploys notebooks to ${workspace.root_path}/notebooks/ and configures job permissions. No manual notebook upload required.
Step-by-Step Deployment¶
Step 1: Set Workspace¶
Step 2: Validate Bundle¶
This checks: - YAML syntax - Variable resolution - Resource definitions
Step 3: Deploy¶
This creates:
- Databricks App (dqx-rule-generator-dev)
- Generation Job (DQ Rule Generation - Dev)
- Validation Job (DQ Rule Validation - Dev)
- Syncs notebooks to workspace
Step 4: Access Application¶
Configuration¶
Required Configuration¶
After deployment, verify these are set in src/app.yaml:
Optional Configuration¶
| Feature | Configuration |
|---|---|
| Lakebase | Set LAKEBASE_HOST for rule storage |
| AI Analysis | Set MODEL_SERVING_ENDPOINT (default: databricks-claude-sonnet-4-5) |
See Configuration for full details.
Multi-Environment Deployment¶
| Target | Command | App Name |
|---|---|---|
| Development | databricks bundle deploy -t dev |
dqx-rule-generator-dev |
| Staging | databricks bundle deploy -t stage |
dqx-rule-generator-stage |
| Production | databricks bundle deploy -t prod |
dqx-rule-generator |
Using the Application¶
Workflow¶
flowchart LR
A[1. Select Table] --> B[2. Generate Rules]
B --> C[3. Review & Edit]
C --> D[4. Validate]
D --> E[5. Save]
Step 1: Select Table¶
- Select Catalog from dropdown
- Select Schema
- Select Table
- Preview sample data
Step 2: Generate Rules¶
- Enter natural language prompt describing your requirements
- Click "Generate Rules"
- Wait for job completion (typically 1-3 minutes)
Example Prompts:
"Ensure customer_id is not null and unique"
"Validate email format and check age is between 0 and 120"
"Check order_date is not in the future and amount is positive"
Step 3: Review & Edit¶
- Review generated rules in JSON format
- Edit rules as needed
- Use "Format JSON" to clean up formatting
Step 4: Validate (Optional)¶
- Click "Validate Rules" to test against actual data
- Review pass/fail statistics
- Fix rules with high violation counts
Step 5: Save to Lakebase¶
- Click "Analyze with AI" for rule coverage insights
- Click "Confirm & Save" to store rules
- Rules are versioned automatically
CI/CD Pipeline¶
Automated Deployment¶
| Environment | Trigger | Workflow |
|---|---|---|
dev |
Push to main, PR |
.github/workflows/ci-cd-dev.yml |
stage |
Manual | .github/workflows/ci-cd-stage.yml |
prod |
Manual | .github/workflows/ci-cd-prod.yml |
GitHub Secrets Required¶
Configure per environment in GitHub Settings → Secrets:
| Secret | Description |
|---|---|
DATABRICKS_HOST |
Workspace URL |
DATABRICKS_CLIENT_ID |
Service Principal Client ID |
SQL_WAREHOUSE_ID |
SQL Warehouse ID |
LAKEBASE_HOST |
(Optional) Lakebase host |
GitHub OIDC Setup¶
- Create Service Principal in Databricks Account Console
- Create Federation Policy:
- Grant workspace access to the service principal
See CI/CD Pipeline for detailed setup.
Troubleshooting¶
Common Issues¶
| Issue | Cause | Solution |
|---|---|---|
| "No catalogs available" | Missing permissions or warehouse down | Check USE CATALOG permission, verify SQL Warehouse is running |
| "Job failed: Unable to access notebook" | Notebook path issue | Ensure using ${workspace.root_path}/notebooks/... in variables.yml |
| "Lakebase connection failed" | Wrong host or not authenticated | Verify LAKEBASE_HOST, ensure user is logged in via Databricks Apps |
| "AI Analysis unavailable" | Endpoint not configured | Verify MODEL_SERVING_ENDPOINT exists and user has access |
| Bundle validation fails | Missing DATABRICKS_HOST | Run export DATABRICKS_HOST="..." |
Check Logs¶
App Logs:
# Via CLI
databricks apps logs dqx-rule-generator-dev
# Or via Console
# Databricks Console → Compute → Apps → Select app → Logs
Job Logs:
# List recent runs
databricks jobs list-runs --job-id <JOB_ID> --limit 5
# Get run output
databricks jobs get-run-output --run-id <RUN_ID>
Health Check¶
Expected response:
Debug Endpoint¶
Local Development¶
For local development without Databricks Apps:
# Required environment variables
export DATABRICKS_HOST="https://your-workspace.cloud.databricks.com"
export DATABRICKS_TOKEN="your-personal-access-token"
export DQ_GENERATION_JOB_ID="your-generation-job-id"
export DQ_VALIDATION_JOB_ID="your-validation-job-id"
export SQL_WAREHOUSE_ID="your-warehouse-id"
# Optional
export LAKEBASE_HOST="your-lakebase-host"
# Run the app
cd src
pip install -r requirements.txt
python wsgi.py
Access at: http://localhost:8000
Maintenance¶
Update Application¶
View Saved Rules¶
-- Query Lakebase directly
SELECT table_name, version, created_at, is_active
FROM dq_rules_events
WHERE table_name = 'catalog.schema.table'
ORDER BY version DESC;
Destroy Deployment¶
Destructive Operation
This removes the app, jobs, and all deployed resources. Saved rules in Lakebase are preserved.
Quick Reference¶
| Action | Command |
|---|---|
| Set workspace | export DATABRICKS_HOST="https://..." |
| Validate | databricks bundle validate -t dev |
| Deploy | databricks bundle deploy -t dev |
| List jobs | databricks jobs list |
| Check health | curl <app-url>/health |
| View app logs | databricks apps logs <app-name> |
| Destroy | databricks bundle destroy -t dev |
Related Documentation¶
- Configuration - Detailed config reference
- Authentication - OBO and security details
- Architecture - System design
- API Reference - REST endpoints
- CI/CD - GitHub Actions setup
- DQX Checks - Available check functions