Data Connectors - Delta Forge

Visual Flattener

Turn any nested format into a SQL table — visually configure, automatically flatten

Delta Forge includes a visual schema discovery and configuration tool that transforms complex, nested data formats into flat, queryable SQL tables. One unified experience across six formats: JSON, XML, EDI, HL7, FHIR, and Protobuf.

How It Works

1

Discover

Scan files and automatically discover all nested paths, types, and sample values

2

Configure

Use an interactive tree view to select which fields to include, exclude, explode, or keep as JSON

3

Query

Flattened data appears as a standard SQL table. Missing paths become NULL. Configuration persists across queries.

Five Selection Modes Per Field

INCLUDE

Whitelist specific paths into the output

EXCLUDE

Remove entirely from output

EXPLODE

Create one row per array element (like SQL UNNEST)

JSON

Keep subtree as a JSON string column instead of flattening

Default

Automatic flattening behavior — all paths included with standard naming

Six Formats, One Visual Experience

JSON

JSONPath, SIMD-accelerated parsing

XML

XPath-like expressions, attribute handling, namespace support

EDI

Segment-based flattening, composite elements

HL7

Component flattening, friendly aliases

FHIR

Resource type discovery, bundle unbundling

Protobuf

Enum decoding, repeated field handling

Schema Evolution Built In

Automatic Structure Merging

Files with different structures handled automatically
Path aliases map multiple source paths to one output column
Missing paths become NULL — consistent schema across all files

-- Input (nested JSON):
-- {
--   "user": { "name": "Alice", "contact": { "email": "alice@example.com" } },
--   "tags": ["vip", "active"],
--   "metadata": { "source": "api", "raw": {...} }
-- }

-- Output (flattened SQL table):
-- user_name | user_contact_email  | tags             | metadata
-- Alice     | alice@example.com   | ["vip","active"] | (kept as JSON)

SELECT user_name, user_contact_email, tags, metadata
FROM customers;  -- flattened table, ready to query

One visual experience

All 6 formats share the same tree view — no format-specific tooling needed

Persistent configuration

Configuration persists to the table — query results are always consistent

SIMD-accelerated

500MB/s+ throughput for JSON processing

No code required

Point, click, query — flatten nested data without writing any transformation code

Database Connectors

Connect to relational and NoSQL databases with predicate pushdown. All connection credentials are stored securely in OS Keychain or Azure Key Vault, never in config files.

PostgreSQL

Full-featured connectivity with SSL, connection pooling, and predicate pushdown

RDBMS SSL

MySQL / MariaDB

MySQL 5.7+ and MariaDB support with binary protocol

RDBMS Replication

SQL Server

Microsoft SQL Server with Windows and Azure AD authentication

Enterprise Azure AD

Oracle Database

Oracle 12c+ with TNS and Easy Connect naming

Enterprise RAC

MongoDB

Document database with aggregation pipeline pushdown

NoSQL Document

Redis

Key-value store with cluster and sentinel support

Cache Cluster

Cloud Object Storage

Native integration with all major cloud providers

Amazon Web Services

Amazon S3 (all storage classes)
S3 Express One Zone
AWS Glue Catalog integration
IAM roles & STS credentials
Cross-account access

Microsoft Azure

Azure Blob Storage
Data Lake Storage Gen2
Azure Active Directory auth
Managed identity support
SAS token authentication

Google Cloud Platform

Google Cloud Storage
BigQuery external tables
Service account auth
Workload identity federation
Multi-regional buckets

File Format Support

Native support for all major data formats with optimized readers

Columnar Formats

Parquet Column pruning, predicate pushdown

ORC Hive-compatible, ACID support

Arrow IPC Zero-copy reads

Avro Schema evolution

Text, Semi-Structured & Binary

CSV Auto-dialect detection

JSON NDJSON, subtree capture

XML XPath, subtree capture

Excel Multi-sheet, XLSX/XLS/ODS

Protobuf Proto3 binary parsing

Protocol Buffers

Query Proto3 binary data with SQL — a capability most engines simply don't have

Schema-Driven Parsing

Read Proto3 binary files directly with a .proto descriptor
Specify the message type to decode from the schema
Glob patterns for multi-file ingestion
Streaming reads for large binary datasets

Nested Messages & Repeated Fields

Nested messages flattened into dot-notation columns
Repeated fields mapped to Arrow list arrays
Map fields decoded as key-value struct arrays
Oneof fields with automatic null-filling

Enum Decoding

Enum values decoded to human-readable string names
Unknown enum values preserved as integer fallbacks
Optional raw integer mode for performance

Well-Known Types

google.protobuf.Timestamp → Arrow TIMESTAMP
google.protobuf.Duration → Arrow INTERVAL
google.protobuf.StringValue & wrapper types
google.protobuf.Struct as JSON columns

-- Read IoT sensor data from Proto3 binary files
SELECT device_id, temperature, humidity, recorded_at
FROM read_protobuf(
    'sensors/*.pb',
    'sensor.proto',
    'SensorReading'
)
WHERE temperature > 35.0
ORDER BY recorded_at DESC;

Apache ORC

Production-grade ORC reading for Hive data warehouses — battle-tested across 6 industry demos

Hive-Compatible Reading

Read ORC files from Hive-managed and external tables
Full ACID transaction support (insert, update, delete)
Partition pruning with Hive-style directory layouts
Proven across banking, clinical trials, energy, insurance, server logs, and warehouse demos

Complex Types

STRUCT fields mapped to nested Arrow structs
MAP fields as key-value list arrays
ARRAY fields as Arrow list columns
Deeply nested combinations of all three

Stripe-Level Statistics

Min/max statistics per stripe for predicate pushdown
Bloom filters for high-cardinality column filtering
Row-group-level skipping for large files
Column-level statistics for query optimization

Compression Codecs

ZLIB — maximum compression ratio
Snappy — balanced speed and size
LZ4 — fastest decompression
ZSTD — best overall compression
Automatic codec detection per file

Apache Avro

Schema evolution across files with automatic type promotion and null-filling

Schema Evolution

Read files written with different schema versions together
New columns in newer files automatically NULL-filled for older rows
Removed columns gracefully excluded from the merged schema
Type promotion: int → long, float → double

Logical Types

date → Arrow DATE32
timestamp-millis / timestamp-micros → Arrow TIMESTAMP
decimal with precision and scale preserved
uuid, time-millis, time-micros

Compression Codec Mixing

Each Avro file can use a different codec
Snappy, Deflate, ZSTD, Bzip2 detected per-file
Transparent decompression during query execution
No configuration needed — codecs detected automatically

Nested Records

Avro records mapped to Arrow struct columns
Arrays mapped to Arrow list columns
Maps mapped to key-value struct arrays
Unions decoded with type-tag discrimination

JSON & NDJSON

Flexible JSON reading with subtree capture for semi-structured analytics

Subtree Capture with `json_paths`

Preserve nested objects as queryable JSON blob columns
Extract flat fields while keeping complex structures intact
Ideal for semi-structured data with variable nesting
JSON blob columns queryable with json_extract functions

Format Variants

NDJSON (newline-delimited) for streaming workloads
JSON arrays for bulk exports
Mixed-type arrays with automatic type widening
Deeply nested objects with configurable flatten depth

-- Keep nested 'address' as a JSON blob, extract flat fields normally
SELECT name, email, address
FROM read_json('customers/*.json',
    json_paths := '{address}'
);

-- Result: 'address' column contains full JSON objects
-- {"street": "123 Main St", "city": "Denver", "state": "CO", "zip": "80202"}

-- Then query into the captured subtree
SELECT name, json_extract(address, '$.city') AS city
FROM read_json('customers/*.json',
    json_paths := '{address}'
);

XML

Structured XML reading with subtree capture and schema evolution

Subtree Capture

Preserve nested XML elements as string columns
Extract parent-level attributes while keeping child trees intact
XPath-based element selection for targeted reading
Mixed content handling with text and element children

Schema Evolution

Merge schemas across XML files with different structures
New elements in newer files NULL-filled for older rows
Attribute and element unification in the output schema
Namespace-aware parsing for enterprise XML formats

RSS & Feed Parsing

RSS 2.0 and Atom feed ingestion as relational tables
Channel metadata extracted alongside item rows
Enclosure and media elements captured
Date normalization across feed date formats

Excel Workbooks

Multi-sheet reading with intelligent header detection and per-sheet type inference

Multi-Sheet Reading

Read specific sheets by name or index
Read all sheets at once into separate tables
Sheet name available as a metadata column
Support for XLSX, XLS (legacy), and ODS formats

Header Row Detection

Automatic header row identification
Skip leading blank rows and title rows
Configurable header row offset for non-standard layouts
Column name sanitization and deduplication

Type Inference Per Sheet

Independent type inference for each sheet
Excel date serial numbers decoded to proper dates
Currency and percentage formatting preserved
Formula cells read as computed values

Streaming Connectors

Real-time data ingestion from event streams

Apache Kafka

High-throughput distributed event streaming with consumer groups

Amazon Kinesis

AWS managed streaming with automatic scaling

Azure Event Hubs

Azure-native event ingestion at scale

Google Pub/Sub

GCP messaging with exactly-once delivery

Intelligent Schema Inference

Automatic type detection across 40+ locales with auto-generated transform views — no manual schema definitions

Culture-Aware Parsing

German dates: DD.MM.YYYY, US dates: MM/DD/YYYY
French decimals: 1 234 567,89
German grouping: 1.234.567,89
Spanish month names: Enero, Febrero, Marzo...
AM/PM designators and negative number formats across locales

12 Detected Types

Boolean, SmallInt, Int, BigInt, Decimal, Float
Date, Time, DateTime, UUID, Varchar
Configurable confidence thresholds (default 80%)
Automatic fallback to VARCHAR when confidence is low
SQL cast expression generation for each column
Auto-generated transform views from inferred types

Schema Merging & Evolution

Three modes: Merge (union), Strict, Intersection
Type widening: int → bigint, float → double
Null-filling for columns missing in older files
Column order preservation from first schema
Force-nullable mode for safe dynamic evolution

Parallel Processing

Rayon-based parallel inference across all CPU cores
Configurable sample sizes (1K fast to 100K+ thorough)
Compiled regex patterns cached for zero re-compilation
Schema fingerprinting for O(1) change detection
Automatic catalog sync without manual "Scan Files"

-- Same column, different locales — Delta Forge infers correctly

-- German (de-DE): period groups, comma decimal
order_total:  1.234.567,89  →  DECIMAL
order_date:   15.03.2024    →  DATE

-- US English (en-US): comma groups, period decimal
order_total:  1,234,567.89  →  DECIMAL
order_date:   03/15/2024    →  DATE

-- French (fr-FR): space groups, comma decimal
order_total:  1 234 567,89  →  DECIMAL

-- Auto-generated transform view based on inference
CREATE VIEW v_orders AS
SELECT
    CAST(order_total AS DECIMAL(12,2)) AS order_total,
    CAST(order_date AS DATE) AS order_date,
    customer_name
FROM raw_orders;

Connect to Any Data Source

Visual Flattener

How It Works

Discover

Configure

Query

Five Selection Modes Per Field

INCLUDE

EXCLUDE

EXPLODE

JSON

Default

Six Formats, One Visual Experience

JSON

XML

EDI

HL7

FHIR

Protobuf

Schema Evolution Built In

Automatic Structure Merging

Database Connectors

PostgreSQL

MySQL / MariaDB

SQL Server

Oracle Database

MongoDB

Redis

Cloud Object Storage

Amazon Web Services

Microsoft Azure

Google Cloud Platform

File Format Support

Columnar Formats

Text, Semi-Structured & Binary

Protocol Buffers

Schema-Driven Parsing

Nested Messages & Repeated Fields

Enum Decoding

Well-Known Types

Apache ORC

Hive-Compatible Reading

Complex Types

Stripe-Level Statistics

Compression Codecs

Apache Avro

Schema Evolution

Logical Types

Compression Codec Mixing

Nested Records

JSON & NDJSON

Subtree Capture with json_paths

Format Variants

XML

Subtree Capture

Schema Evolution

RSS & Feed Parsing

Excel Workbooks

Multi-Sheet Reading

Header Row Detection

Type Inference Per Sheet

Streaming Connectors

Apache Kafka

Amazon Kinesis

Azure Event Hubs

Google Pub/Sub

Intelligent Schema Inference

Culture-Aware Parsing

12 Detected Types

Schema Merging & Evolution

Parallel Processing

Connect all your data sources

Subtree Capture with `json_paths`