Architecture¶

This document describes the system architecture and design of DQX Data Quality Manager.

System Overview¶

DQX Data Quality Manager is a multi-tier application built on the Databricks platform:

Application Flow¶

Rule Generation Flow¶

Rule Validation Flow¶

Component Architecture¶

Flask Application (`src/app/`)¶

The web application handles UI rendering and REST API endpoints:

src/app/
├── __init__.py          # Flask app factory
├── config.py            # Configuration management
├── routes/
│   ├── catalog.py       # Unity Catalog browsing endpoints
│   ├── rules.py         # DQ rule generation/validation endpoints
│   └── lakebase.py      # Lakebase connection endpoints
└── services/
    ├── databricks.py    # Databricks SDK wrapper (SQL + Jobs)
    ├── ai.py            # AI analysis service
    └── lakebase.py      # PostgreSQL operations

Services Layer¶

Service	Responsibility	Authentication
`DatabricksService`	SQL queries, job triggering	OBO for SQL, SP for jobs
`AIAnalysisService`	AI-powered rule analysis	OBO via Statement Execution
`LakebaseService`	Rule storage and versioning	User OAuth

Notebooks (`notebooks/`)¶

Long-running compute tasks executed as serverless jobs:

Notebook	Purpose	Key Libraries
`generate_dq_rules_fast.py`	Profile data and generate DQ rules	DQX Profiler, DQX Generator
`validate_dq_rules.py`	Apply rules and return results	DQX Engine

Project Structure¶

databricks_dqx_agent/
├── databricks.yml                    # DAB bundle configuration (main)
├── README.md                         # Quick start guide
│
├── src/                              # App source code (deployed to Databricks Apps)
│   ├── app.yaml                      # Databricks App runtime configuration
│   ├── wsgi.py                       # WSGI entry point (gunicorn)
│   ├── requirements.txt              # Python dependencies
│   │
│   ├── app/                          # Flask application package
│   │   ├── __init__.py               # Flask app factory
│   │   ├── config.py                 # Configuration management
│   │   ├── routes/                   # API endpoints
│   │   │   ├── catalog.py            # Unity Catalog routes
│   │   │   ├── rules.py              # DQ Rules routes
│   │   │   └── lakebase.py           # Lakebase routes
│   │   └── services/                 # Business logic
│   │       ├── databricks.py         # Databricks SDK service
│   │       ├── lakebase.py           # Lakebase service
│   │       └── ai.py                 # AI analysis service
│   │
│   ├── templates/                    # HTML templates
│   │   ├── base.html                 # Base template with navigation
│   │   ├── generator.html            # DQ rule generator page
│   │   └── validator.html            # DQ rule validator page
│   │
│   └── static/                       # Static assets
│       ├── css/main.css              # Styles
│       └── js/                       # JavaScript files
│           ├── common.js             # Shared utilities
│           ├── generator.js          # Generator page logic
│           └── validator.js          # Validator page logic
│
├── notebooks/                        # Databricks notebooks
│   ├── generate_dq_rules_fast.py     # DQ rule generation notebook
│   └── validate_dq_rules.py          # DQ rule validation notebook
│
├── resources/                        # DAB resource definitions
│   ├── apps.yml                      # App definition + permissions
│   ├── generation_job.yml            # Generation job (Serverless)
│   └── validation_job.yml            # Validation job (Serverless)
│
├── environments/                     # Per-environment configurations
│   ├── dev/
│   │   ├── targets.yml               # Dev target config
│   │   ├── variables.yml             # Dev variables
│   │   └── permissions.yml           # Dev permissions
│   ├── stage/
│   │   └── ...
│   └── prod/
│       └── ...
│
├── .github/                          # CI/CD workflows
│   ├── workflows/
│   │   ├── ci-cd-dev.yml             # Dev pipeline
│   │   ├── ci-cd-stage.yml           # Stage pipeline
│   │   ├── ci-cd-prod.yml            # Prod pipeline
│   │   └── docs.yml                  # Documentation deployment
│   └── actions/
│       ├── databricks-setup/         # GitHub OIDC setup
│       └── deploy-dab/               # Bundle deployment
│
└── docs/                             # Documentation (MkDocs)
    ├── index.md                      # Home page
    ├── runbook.md                    # Deployment guide
    ├── authentication.md             # Auth documentation
    ├── architecture.md               # This file
    ├── configuration.md              # Config reference
    ├── api-reference.md              # API endpoints
    ├── dqx-checks.md                 # DQX check functions
    └── ci-cd.md                      # CI/CD pipeline

Authentication Architecture¶

DQX uses a dual authentication model:

User Token (OBO) Path¶

Used for operations that should respect user permissions:

Operations:

SHOW CATALOGS/SCHEMAS/TABLES
SELECT * FROM table
SELECT ai_query(...)

Service Principal Path¶

Used for operations without user scope support:

Operations:

jobs.run_now()
jobs.get_run()
jobs.get_run_output()

OAuth Path (Lakebase)¶

Used for PostgreSQL connections:

For detailed authentication documentation, see Authentication.

Data Flow¶

Rule Storage Schema¶

Rules are stored in Lakebase with versioning:

CREATE TABLE dq_rules_events (
    id UUID PRIMARY KEY,
    table_name VARCHAR(500) NOT NULL,
    version INTEGER NOT NULL,
    rules JSONB NOT NULL,
    user_prompt TEXT,
    ai_summary JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    created_by VARCHAR(255),
    is_active BOOLEAN DEFAULT TRUE,
    metadata JSONB,
    UNIQUE(table_name, version)
);

Rule JSON Format¶

{
  "criticality": "error",
  "check": {
    "function": "is_not_null",
    "arguments": {
      "col_name": "customer_id"
    }
  },
  "name": "customer_id_not_null"
}

Deployment Architecture¶

Databricks Asset Bundles (DAB)¶

The application is deployed using DAB with modular configuration:

# databricks.yml
bundle:
  name: dqx-rule-generator

include:
  - ./resources/*.yml           # App + Job definitions
  - ./environments/dev/*.yml    # Dev config
  - ./environments/stage/*.yml  # Stage config
  - ./environments/prod/*.yml   # Prod config

Resource Bindings¶

# resources/apps.yml
resources:
  apps:
    dqx_app:
      user_api_scopes:
        - sql                   # Enable OBO for SQL

      resources:
        - name: "sql-warehouse"
          sql_warehouse:
            id: ${var.sql_warehouse_id}
            permission: "CAN_USE"

        - name: "generation-job"
          job:
            id: ${resources.jobs.dq_rule_generation.id}
            permission: "CAN_MANAGE_RUN"

Environment Isolation¶

Environment	App Name	Workspace Path
Development	`dqx-rule-generator-dev`	`/Users/<user>/.bundle/.../dev`
Staging	`dqx-rule-generator-stage`	`/Users/<user>/.bundle/.../stage`
Production	`dqx-rule-generator`	`/Users/<user>/.bundle/.../prod`

Security Architecture¶

Defense in Depth¶

Network Layer: All traffic over HTTPS/TLS
Authentication: OAuth tokens with limited lifetime
Authorization: User permissions enforced via OBO
Data Access: Unity Catalog access controls
Audit: All operations logged with user identity

Token Flow¶

No Stored Credentials¶

No passwords in configuration files
OAuth tokens from request headers only
Service principal via managed identity

Scalability Considerations¶

Component	Scaling Strategy
Flask App	Horizontal (multiple workers via Gunicorn)
SQL Warehouse	Serverless auto-scaling
Jobs	Serverless compute (auto-provisioned)
Lakebase	Managed PostgreSQL scaling

Configuration - Environment variables and settings
Authentication - Detailed auth documentation
API Reference - REST API endpoints
CI/CD Pipeline - Deployment automation