Delta Lake Table Format

Delta Protocol Support

Complete implementation of Delta Lake protocol specifications

Reader V1

Basic Reader Features

Column mapping (id, name)
Add/Remove file actions
Partition pruning
Schema evolution

Reader V2

Advanced Reader Features

Deletion vectors
Column invariants
Generated columns
Identity columns

Reader V3

Latest Reader Features

Timestamp without timezone
V2 checkpoints
Vacuum protocol check
Type widening

Writer V2-V7

Full Writer Support

Append-only tables
Change Data Feed
Invariants & constraints
Clustering & Z-order

Deletion Vectors

Surgical row-level deletes without file rewrites

Deletion vectors represent a fundamental advancement in lakehouse architecture. Instead of rewriting entire Parquet files to delete rows, Delta Forge tracks deleted row positions in compact bitmap structures.

How It Works

DELETE Statement - Identify rows matching predicate
Bitmap Creation - Record row positions in a compact bitmap
DV File Write - Store compact deletion vector file
Transaction Log - Link DV to original data file
Read Filtering - Apply DV during scan to skip deleted rows

Performance Impact

10-100x faster deletes vs file rewrite
Minimal storage overhead - DVs are highly compressed
Instant commits - No Parquet file I/O for delete
Efficient MERGE - Delete side uses DVs automatically

Parquet File

Row 0 Row 1 ✕ Row 2 Row 3 ✕ Row 4 Row 5 Row 6 ✕ Row 7

→

Deletion Vector

Bitmap {1, 3, 6} ~12 bytes

Time Travel

Query any historical state of your data

Version-Based Access

Query specific version numbers
Compare data between versions
Restore to previous versions
Clone tables at any version

SELECT * FROM events VERSION AS OF 42

Timestamp-Based Access

Query data as of specific timestamp
Point-in-time recovery
Audit trail reconstruction
Compliance reporting

SELECT * FROM events TIMESTAMP AS OF '2024-01-15 10:30:00'

Transaction Log

Complete operation history
User and application tracking
Operation metrics
Schema evolution history

DESCRIBE HISTORY events

Restore Operations

Instant restore to any version
Selective table restore
Schema-aware restore
Restore with constraints

RESTORE TABLE events TO VERSION AS OF 100

Schema Evolution

Evolve your schema without breaking pipelines

Add Columns

Add new columns at any position. Existing files return NULL for new columns.

ALTER TABLE t ADD COLUMN new_col STRING

Rename Columns

Rename columns while preserving column IDs. Zero data movement.

ALTER TABLE t RENAME COLUMN old_name TO new_name

Drop Columns

Remove columns from schema. Data remains until compaction.

ALTER TABLE t DROP COLUMN deprecated_col

Change Types

Widen column types (int → bigint, float → double).

ALTER TABLE t ALTER COLUMN amount TYPE DECIMAL(20,4)

Reorder Columns

Change column order for better organization.

ALTER TABLE t ALTER COLUMN col FIRST

Nested Evolution

Evolve struct and map types. Add fields to nested structures.

ALTER TABLE t ADD COLUMN address STRUCT<zip: STRING, city: STRING>

Change Data Feed

Track every row-level change for downstream processing

Change Types

INSERT - New rows added
UPDATE_PREIMAGE - Row before update
UPDATE_POSTIMAGE - Row after update
DELETE - Rows removed

Change Metadata

_change_type - Type of change
_commit_version - Transaction version
_commit_timestamp - When change occurred

Query Changes

Version range queries
Timestamp range queries
Incremental processing
Streaming consumption

Use Cases

ETL pipeline triggers
Real-time analytics sync
Audit trail generation
Cache invalidation

Change Data Feed Query

-- Get all changes since version 100
SELECT * FROM table_changes('customers', 100, 150)
WHERE _change_type IN ('UPDATE_POSTIMAGE', 'INSERT');

-- Get changes in time range
SELECT customer_id, email, _change_type, _commit_timestamp
FROM table_changes('customers',
    '2024-01-01 00:00:00',
    '2024-01-31 23:59:59')
ORDER BY _commit_timestamp;

UniForm: Delta + Iceberg Interoperability

Write once as Delta. Read anywhere as Iceberg. No data duplication.

Enable UniForm compatibility and Delta Forge automatically generates Iceberg metadata alongside the Delta transaction log. The same physical data files are readable by both Delta and Iceberg clients. No ETL pipeline to maintain a second copy, no storage duplication, no synchronization overhead.

How UniForm Works

Single Write Path - Data is written once as Delta Parquet files
Automatic Metadata - Iceberg manifest and metadata files generated on commit
Zero Data Duplication - Both formats point to the same physical files
Full Version Support - Compatible with Iceberg format V1, V2, and V3

Cross-Engine Access

Any Iceberg-compatible query engine can read your Delta tables directly. Use Delta Forge for writes and maintenance, while downstream consumers use whichever engine fits their workflow.

                            ALTER TABLE events SET TBLPROPERTIES ('delta.universalFormat.enabledFormats' = 'iceberg')
                        

Learn more about Apache Iceberg support →

Variant Type for Semi-Structured Data

JSON-like flexibility with columnar performance

Store semi-structured data natively in Delta tables using the Variant type. Unlike raw JSON strings, Variant uses an efficient binary encoding with automatic shredding into columnar storage, delivering up to 10x better query performance while preserving full schema flexibility.

Key Capabilities

Automatic Shredding - Frequently accessed fields are extracted into columnar storage for fast reads
Path-Based Extraction - Access nested fields without parsing the entire document
Zero Precision Loss - Numeric types preserve exact values in binary encoding
Schema Discovery - Automatic inference of structure from semi-structured data

Use Cases

Event streams with varying payloads
API response archival with evolving schemas
IoT sensor data with heterogeneous formats
Log aggregation across diverse sources

Variant Type Queries

-- Create table with Variant column
CREATE TABLE events (
  id BIGINT,
  event_date DATE,
  payload VARIANT
);

-- Path-based field extraction
SELECT
  payload:user.name AS user_name,
  payload:user.email AS email,
  payload:action AS action_type
FROM events
WHERE payload:user.region = 'us-east';

-- Automatic shredding means this runs
-- at columnar speed, not JSON parsing speed

GDPR Data Erasure

Right-to-be-forgotten compliance built into the storage layer

Delta Lake's combination of targeted DELETE, deletion vectors, and VACUUM gives you a complete, auditable pipeline for GDPR right-to-be-forgotten requests. Delete the logical record, then physically remove the underlying files so the data is cryptographically unrecoverable.

Compliance Workflow

Targeted DELETE - Remove specific user data by predicate without rewriting unrelated files
Deletion Vector - Rows are immediately invisible to all readers via bitmap
DRY RUN - Preview which files will be physically removed before committing
VACUUM - Permanently erase the old Parquet files containing the deleted data

Why Delta Lake

Surgical precision - delete one user without touching millions of unrelated rows
Cryptographic proof - once VACUUM completes, old files no longer exist on storage
Audit trail - the transaction log records exactly when the deletion occurred
No downtime - concurrent readers continue without interruption

-- GDPR: Right to be forgotten
DELETE FROM customers
WHERE customer_id = 'user-12345';

-- Preview files that will be removed
VACUUM customers RETAIN 0 HOURS DRY RUN;

-- Permanently remove old data files
VACUUM customers RETAIN 0 HOURS;

Constraints & Data Quality

Enforce data quality at the storage layer

NOT NULL Constraints

Prevent null values in critical columns.

ALTER TABLE t ALTER COLUMN id SET NOT NULL

CHECK Constraints

Custom validation expressions.

ALTER TABLE t ADD CONSTRAINT valid_price CHECK (price > 0)

Generated Columns

Auto-computed columns from expressions.

year INT GENERATED ALWAYS AS (YEAR(event_date))

Identity Columns

Auto-incrementing unique identifiers.

id BIGINT GENERATED ALWAYS AS IDENTITY

Default Values

Automatic value assignment.

created_at TIMESTAMP DEFAULT current_timestamp()

Column Invariants

Validation enforced on write.

status STRING CHECK (status IN ('active', 'inactive'))

Table Maintenance Operations

Keep tables optimized and performant

🔧

OPTIMIZE

Compact small files into larger ones. Improves read performance by reducing file count.

OPTIMIZE events WHERE date > '2024-01-01'

Bin-packing algorithm
Target file size: 1GB
Predicate-based scoping

🗑️

VACUUM

Remove old files no longer referenced by any version. Reclaim storage space.

VACUUM events RETAIN 168 HOURS

Safe retention period
Dry-run mode available
Respects time travel

📊

ANALYZE

Compute column statistics for query optimization. Updates histograms and NDV estimates.

ANALYZE TABLE events COMPUTE STATISTICS FOR ALL COLUMNS

Column-level stats
Histogram generation
Incremental updates

🔄

Z-ORDER

Co-locate related data for better data skipping. Optimizes multi-dimensional queries.

OPTIMIZE events ZORDER BY (user_id, event_type)

Space-filling curves
Multi-column optimization
Improved file pruning

Proactive Table Intelligence

Preventive monitoring and actionable recommendations, not reactive debugging

Health Score

Single 0-100 health score per table
Specific issue identification
File size distribution analysis
Deletion vector density tracking
Clustering quality assessment

Audit & Integrity

Detect corruption early
Find orphaned files
Missing checkpoint detection
Protocol violation alerts
On-demand or automated checks

Storage Analytics

Storage breakdown by file type
Efficiency metrics and trends
Small file ratio monitoring
DV overhead tracking
Cost attribution per table

Recommendations

Auto-generated optimization suggestions
Priority ranking by estimated benefit
Timeline analysis of table evolution
Write volume and pattern insights
Expected improvement estimates

Most platforms leave table health monitoring to you. Delta Forge monitors continuously and tells you what needs attention before it becomes a problem.

Predictive Optimization

Automatic, workload-aware maintenance scheduling with zero manual tuning

Predictive Optimization analyzes table activity patterns and automatically schedules maintenance operations at the right time. The system estimates the benefit of each operation and prioritizes accordingly. Your tables stay healthy without manual intervention.

Automatic Triggers

Auto-VACUUM - Schedules cleanup based on file accumulation rate and retention policy
Auto-OPTIMIZE - Triggers compaction when small file ratio exceeds thresholds
Auto-ANALYZE - Refreshes statistics after significant data changes
Auto-Cluster - Re-clusters data when new writes degrade layout quality

Workload-Aware Scheduling

Activity Monitoring - Tracks write volume, query patterns, and access frequency
Benefit Estimation - Predicts performance improvement before running operations
Priority Ranking - Most impactful operations run first
Resource Budgeting - Maintenance respects configurable resource limits

Predictive Optimization

-- Enable predictive optimization
ALTER TABLE events SET TBLPROPERTIES (
  'delta.enablePredictiveOptimization' = 'true'
);

-- The system automatically:
--   Monitors write patterns
--   Triggers OPTIMIZE when small files accumulate
--   Runs VACUUM after retention window passes
--   Refreshes ANALYZE after significant changes
--   Re-clusters when layout quality degrades

-- Check optimization status
DESCRIBE DETAIL events;

Complete Delta Lake Protocol Implementation

Delta Protocol Support

Basic Reader Features

Advanced Reader Features

Latest Reader Features

Full Writer Support

Deletion Vectors

How It Works

Performance Impact

Time Travel

Version-Based Access

Timestamp-Based Access

Transaction Log

Restore Operations

Schema Evolution

Add Columns

Rename Columns

Drop Columns

Change Types

Reorder Columns

Nested Evolution

Change Data Feed

Change Types

Change Metadata

Query Changes

Use Cases

UniForm: Delta + Iceberg Interoperability

How UniForm Works

Cross-Engine Access

Variant Type for Semi-Structured Data

Key Capabilities

Use Cases

GDPR Data Erasure

Compliance Workflow

Why Delta Lake

Constraints & Data Quality

NOT NULL Constraints

CHECK Constraints

Generated Columns

Identity Columns

Default Values

Column Invariants

Table Maintenance Operations

OPTIMIZE

VACUUM

ANALYZE

Z-ORDER

Proactive Table Intelligence

Health Score

Audit & Integrity

Storage Analytics

Recommendations

Predictive Optimization

Automatic Triggers

Workload-Aware Scheduling

Enterprise-grade Delta Lake. Native performance. Zero Spark dependency.