Skip to content
SQL Engine

The Most Complete SQL Implementation for Delta Lake

200+ functions, full PL/pgSQL procedural language, advanced window functions, and production-grade query optimization. This isn't just SQL support - it's PostgreSQL-level capability on lakehouse architecture.

Query Engine Architecture

A ground-up implementation delivering enterprise SQL semantics

Query Interface
SQL Parser PL/pgSQL Parser Prepared Statements Parameter Binding
Logical Planning
AST Analysis Type Inference Semantic Validation View Expansion
Optimization
Predicate Pushdown Projection Pruning Join Reordering Partition Pruning Statistics-Based CBO
Physical Execution
Vectorized Engine Hash Aggregation Sort-Merge Join Parallel Scan
Storage Layer
Delta Protocol Parquet Reader Column Pruning Row Group Filtering

200+ Built-in Functions

Comprehensive function library covering every analytical need

Mathematical Functions (40+)

abs, ceil, floor, round, trunc Basic arithmetic
power, sqrt, cbrt, exp, ln, log, log10 Exponential & logarithmic
sin, cos, tan, asin, acos, atan, atan2 Trigonometric
sinh, cosh, tanh, asinh, acosh, atanh Hyperbolic
degrees, radians, pi Angle conversion
gcd, lcm, factorial, sign Number theory
random, setseed Random generation
width_bucket, div, mod Bucketing & division

String Functions (50+)

length, char_length, octet_length, bit_length Length operations
upper, lower, initcap Case conversion
trim, ltrim, rtrim, btrim Whitespace handling
substring, left, right, substr Extraction
position, strpos, locate Search
replace, translate, overlay Replacement
concat, concat_ws, repeat, reverse Concatenation
lpad, rpad, format Formatting
split_part, string_to_array, array_to_string Splitting & joining
regexp_match, regexp_replace, regexp_split_to_array Regular expressions
ascii, chr, encode, decode Character encoding
md5, sha256, sha512 Hashing

Date/Time Functions (35+)

now, current_timestamp, current_date, current_time Current values
date_trunc, date_part, extract Component extraction
date_add, date_sub, date_diff Arithmetic
to_timestamp, to_date, to_char Conversion
year, month, day, hour, minute, second Component access
dayofweek, dayofyear, weekofyear, quarter Calendar functions
age, isfinite, justify_days, justify_hours Interval operations
timezone, at time zone Timezone handling

Aggregate Functions (30+)

count, sum, avg, min, max Basic aggregates
stddev, stddev_pop, stddev_samp Standard deviation
variance, var_pop, var_samp Variance
corr, covar_pop, covar_samp Correlation
regr_slope, regr_intercept, regr_r2 Linear regression
percentile_cont, percentile_disc Percentiles
mode, median Statistical mode
array_agg, string_agg, json_agg Collection aggregates
bool_and, bool_or, every Boolean aggregates
bit_and, bit_or, bit_xor Bitwise aggregates

Advanced Window Functions

Full SQL:2011 window function specification with frame support

Ranking Functions

  • ROW_NUMBER() - Sequential row numbering
  • RANK() - Ranking with gaps
  • DENSE_RANK() - Ranking without gaps
  • NTILE(n) - Bucket distribution
  • PERCENT_RANK() - Relative rank
  • CUME_DIST() - Cumulative distribution

Value Functions

  • FIRST_VALUE() - First in partition
  • LAST_VALUE() - Last in partition
  • NTH_VALUE(n) - Nth row value
  • LAG(n) - Previous row access
  • LEAD(n) - Next row access

Frame Specifications

  • ROWS BETWEEN - Physical row frames
  • RANGE BETWEEN - Logical value frames
  • GROUPS BETWEEN - Peer group frames
  • UNBOUNDED PRECEDING
  • CURRENT ROW
  • UNBOUNDED FOLLOWING

Frame Exclusion

  • EXCLUDE NO OTHERS - Include all
  • EXCLUDE CURRENT ROW
  • EXCLUDE GROUP - Exclude peers
  • EXCLUDE TIES - Exclude peer ties
Advanced Window Query Example
SELECT
    customer_id,
    order_date,
    amount,
    -- Running total with frame
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) as running_total,
    -- 7-day moving average
    AVG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
        RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
    ) as moving_avg_7d,
    -- Percent of customer total
    amount / SUM(amount) OVER (PARTITION BY customer_id) * 100 as pct_of_total,
    -- Compare to previous order
    amount - LAG(amount, 1, 0) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) as diff_from_prev
FROM orders;

Full PL/pgSQL Procedural Language

Enterprise stored procedure support with complete control flow

Variable Declaration

  • Scalar variables with types
  • Record types for row data
  • %TYPE column type reference
  • %ROWTYPE table row type
  • Array variables
  • Composite types
  • Default value initialization
  • CONSTANT declaration

Control Structures

  • IF/THEN/ELSIF/ELSE
  • CASE WHEN expressions
  • LOOP/EXIT/CONTINUE
  • WHILE loops
  • FOR loops (integer range)
  • FOREACH array iteration
  • FOR ... IN SELECT query loops
  • RETURN NEXT for set-returning

Exception Handling

  • EXCEPTION WHEN blocks
  • Predefined exception types
  • RAISE EXCEPTION
  • RAISE NOTICE/WARNING
  • SQLSTATE error codes
  • SQLERRM error messages
  • Exception propagation
  • Transaction savepoints

Cursors

  • Implicit cursor FOR loops
  • Explicit cursor declaration
  • Bound cursor variables
  • Parameterized cursors
  • OPEN/FETCH/CLOSE
  • FOUND attribute
  • SCROLL cursors
  • REFCURSOR return type
PL/pgSQL Stored Procedure Example
CREATE OR REPLACE FUNCTION process_monthly_billing(
    p_month DATE,
    p_discount_threshold DECIMAL DEFAULT 1000.00
) RETURNS TABLE(
    customer_id INT,
    total_amount DECIMAL,
    discount_applied DECIMAL,
    final_amount DECIMAL
) AS $$
DECLARE
    v_customer RECORD;
    v_total DECIMAL;
    v_discount DECIMAL;
    v_processed INT := 0;
BEGIN
    -- Loop through all active customers
    FOR v_customer IN
        SELECT id, tier FROM customers WHERE status = 'active'
    LOOP
        -- Calculate monthly total
        SELECT COALESCE(SUM(amount), 0) INTO v_total
        FROM orders
        WHERE customer_id = v_customer.id
          AND date_trunc('month', order_date) = date_trunc('month', p_month);

        -- Apply tiered discount logic
        v_discount := CASE
            WHEN v_customer.tier = 'platinum' AND v_total > p_discount_threshold THEN v_total * 0.15
            WHEN v_customer.tier = 'gold' AND v_total > p_discount_threshold THEN v_total * 0.10
            WHEN v_total > p_discount_threshold * 2 THEN v_total * 0.05
            ELSE 0
        END;

        -- Return row
        customer_id := v_customer.id;
        total_amount := v_total;
        discount_applied := v_discount;
        final_amount := v_total - v_discount;
        RETURN NEXT;

        v_processed := v_processed + 1;
    END LOOP;

    RAISE NOTICE 'Processed % customers for month %', v_processed, p_month;
EXCEPTION
    WHEN OTHERS THEN
        RAISE EXCEPTION 'Billing error: % - %', SQLSTATE, SQLERRM;
END;
$$ LANGUAGE plpgsql;

Advanced Query Optimization

Cost-based optimizer with Delta Lake-aware pruning

🎯

Predicate Pushdown

Filters pushed to storage layer, leveraging Delta Lake file-level statistics to skip entire Parquet files that can't contain matching rows.

WHERE date > '2024-01-01' → Skip files with max_date < '2024-01-01'
📊

Partition Pruning

Automatic detection of partition columns in WHERE clauses, eliminating entire partition directories from scan.

WHERE region = 'US' → Read only /region=US/ partition
📐

Projection Pruning

Only requested columns are read from Parquet files. Column metadata used to skip unnecessary row groups.

SELECT name, email → Read only 2 of 50 columns
🔄

Join Reordering

Statistics-based join order optimization. Smaller tables joined first, reducing intermediate result sizes.

A ⋈ B ⋈ C → Reorder to (A ⋈ C) ⋈ B if |C| < |B|

Expression Simplification

Constant folding, dead code elimination, and Boolean expression simplification at compile time.

WHERE 1=1 AND active → WHERE active
🔗

Subquery Decorrelation

Correlated subqueries transformed to efficient joins. EXISTS/IN converted to semi-joins.

WHERE id IN (SELECT...) → Semi-join

Intelligent Join Execution

Multiple join algorithms selected based on data characteristics

Hash Join

Default for equality joins. Build side chosen based on estimated cardinality. Supports spill-to-disk for large builds.

Sort-Merge Join

Optimal when inputs are pre-sorted or when sort can be reused. Handles range joins efficiently.

Nested Loop Join

Used for cross joins and complex predicates that cannot use equality-based algorithms.

Semi/Anti Joins

Optimized execution for EXISTS, NOT EXISTS, IN, and NOT IN patterns with early termination.

Rich Type System

Comprehensive data types with full coercion rules

Numeric Types

TINYINT (i8) SMALLINT (i16) INT (i32) BIGINT (i64) FLOAT (f32) DOUBLE (f64) DECIMAL(p,s)

String Types

VARCHAR(n) CHAR(n) TEXT BINARY VARBINARY

Temporal Types

DATE TIME TIMESTAMP TIMESTAMPTZ INTERVAL

Complex Types

ARRAY<T> MAP<K,V> STRUCT<...> JSON UUID

Experience the power of enterprise SQL on Delta Lake

200+ functions. Full PL/pgSQL. Production-grade optimization.