Apache Spark — Reimagined

Spark Queries That Take Hours.
Now Done in Minutes.

TabbyDB is a drop-in fork of Apache Spark that eliminates compile-time blowup, OutOfMemory failures, and nested join underperformance — at the optimizer level. Zero code changes. Zero cluster changes.

Download Trial Version

13% TPC-DS Improvement at 1TB–2TB

46%+ Iceberg Performance Gain

20+ Apache Spark JIRA Fixes

The Problem

The Real Bottleneck Is in Apache Spark

Production workloads hit walls that benchmarks never reveal. Compile time, optimizer failures, and bad workarounds compound silently until it's too late.

Root Cause: Query Planning

Compile time can take minutes to hours — but Spark's UI only registers a query after plan submission. The bottleneck is in planning, not execution, and it's invisible to your metrics.

Why Tuning Fails

Compile-time problems are routinely misdagnosed as runtime issues. Runtime tuning doesn't reduce planning time. More compute doesn't mean faster planning. The fixes you try don't touch the root cause.

Workarounds Backfire

Disabling optimizer rules can reduce compile time — but at the cost of runtime performance. Every workaround forces a tradeoff: faster planning vs. slower execution. TabbyDB eliminates this tradeoff.

The result: minutes to hours of delay, wasted compute, and missed SLAs — with no clear path to fix it in stock Spark.

The Solution

TabbyDB — Turbocharged Apache Spark

A strict superset of Apache Spark 4.0.1 and 4.1.1. Same APIs. Same clusters. Better engine. Drop in the jars and get back hours.

Compile Time

Intelligent Compile-Time Optimizations

Fundamental improvements to critical optimizer rules: optimized constraint propagation, early project collapsing during analysis, reduced Hive Metastore calls, and targeted rule application to avoid expensive tree traversals. Complex queries that took 8 hours now compile in minutes.

Memory

Scalable Query Tree Management

Safely collapses project nodes early in the query lifecycle, preventing unbounded query plan growth and reducing memory pressure during compilation. Faster compilation and dramatically reduced risk of out-of-memory failures on deeply nested workloads.

Runtime

Advanced Broadcast Hash Join Handling

Dynamic file pruning using Broadcast Hash Join data on non-partitioned columns — the fix Spark never shipped. Extended to enable dynamic file pruning for non-partitioned joins, reducing data scan time. 13% improvement on TPC-DS at 1TB and 2TB.

Cache

Improved Cache Lookup Efficiency

Enhances how cached in-memory query plans are matched and reused, increasing the likelihood of successful cache hits. Higher cache reuse and lower execution overhead — especially for repeated or structurally similar queries.

Iceberg

Apache Iceberg — 46%+ Faster

Early testing on 50GB non-partitioned Iceberg tables shows 46% improvement. Gains are expected to grow beyond 46% at 1TB–2TB scale. The Iceberg performance layer the ecosystem has been waiting for.

Deployment

Seamless Spark Compatibility. No Lock-In.

Maintains full compatibility with Apache Spark APIs, features, and tooling while delivering every performance improvement above. Replace Spark jars with TabbyDB jars. To revert: swap back. No code rewrites. No cluster changes. No disabling of optimizer rules.

Performance That Redefines Complex Query Execution

TPC-DS Runtime Improvement

1TB & 2TB on AWS r6gd.4xlarge — non-partitioned Hive Parquet tables

0%+

Iceberg Performance Gain

50GB non-partitioned tables. Gains expected to increase at 1TB–2TB scale

8hrs → mins

Compile Time Reduction

Complex DataFrame API queries — hours to compile, now minutes

TPC-DS gains shown on non-partitioned tables. Standard TPC-DS (partitioned) shows equivalent performance to stock Spark by design.

See the performance difference for yourself. Click below to open our Zeppelin notebooks, where you can run the same query on both Stock Spark and TabbyDB side by side. Note: running the Stock Spark paragraph may take 5–12 minutes.

Compare Performance Now →

TPC-DS Benchmark Results

Click to expand

1 TB TPC-DS Benchmark — 6 nodes AWS r6gd.4xlarge

Click to expand

2 TB TPC-DS Benchmark — 6 nodes AWS r6gd.4xlarge

Click to expand

50 GB TPC-DS Benchmark — Mac Mini M4, single node

Under the Hood

Spark SQL Modules Optimized in TabbyDB

TabbyDBClick to expand

Stock SparkClick to expand

What We Fixed

20+ Apache Spark JIRA Issues — Resolved

Performance Issues

SPARK-33152Constraint Propagation causing compile times to run into hours

SPARK-36786Inefficiency in PushDownPredicates for complex expressions

SPARK-44662Dynamic file pruning for non-partition column joins

SPARK-45373Minimizing calls to HMS layer for repeated table references

SPARK-45866Reuse of Exchange broken in AQE when runtime filters pushed down

SPARK-45959Uncapped tree size in analysis phase — compilation runs into hours

SPARK-46671Redundant filter creation from buggy Constraint Propagation rule

SPARK-47609Cached Plan lookup may miss valid plans

SPARK-49618Canonicalization differences in Union causing failure in re-use of exchange or cached plans

SPARK-49881Minimizing cost of DeduplicateRelations in the analyzer

SPARK-54881BooleanSimplification using transformExpressionsUp instead of Down — inefficient in some cases

SPARK-55072Inferring new Constraint misses IsNotNull on Left Leg when Outer Join converts to Inner Join

SPARK-55110Order of BooleanSimplification and SimplifyBinaryComparison rules is suboptimal for idempotency

Functional Issues

SPARK-47320Self-join inconsistencies and exceptions

SPARK-47217DeduplicateRelations may cause failure in plan resolution

SPARK-49727Data loss when POJO Dataset converted to DataFrame and back

SPARK-49789Exception encoding POJOs with generic type fields

SPARK-51016Incorrect results during retry when join column is indeterminate

SPARK-45658Canonicalization of DynamicPruningSubquery is broken

SPARK-53264Incorrect nullability when correlated subquery converted to Left Outer Join

SPARK-55185Adding InferFiltersFromConstraints to Optimization batch causes idempotency break

SPARK-55241Idempotency of SQL Streaming with Joins broken when InferFiltersFromConstraints and PropagateEmptyRelation are added as Optimization rules

Why Choose TabbyDB

More Than a Faster Engine

TabbyDB is built for teams running complex, production-critical Apache Spark workloads, where query compilation time, optimizer behavior, and correctness matter as much as execution speed.

Engine-Level Enhancements

Improvements are made at the optimizer and execution engine level — not as application-layer patches. This means every workload benefits, without requiring any tuning or code changes.

Built for Stability & Correctness

Beyond raw performance, TabbyDB addresses 20+ long-standing Spark correctness issues — self-join inconsistencies, data loss in POJO conversions, broken idempotency in streaming joins. Stability and predictability matter as much as speed.

Compatible With Everything You Have

TabbyDB maintains full compatibility with existing Spark APIs, tooling, and workflows. No migration. No vendor lock-in. If you ever need to roll back, swap the jars.

Open to Collaboration

Partner With the Team Behind TabbyDB

If you are encountering functional or performance issues in Apache Spark — particularly within the SQL or optimizer layer — we're open to collaborating on solutions tailored to your workload or codebase.

Whether it's diagnosing a bottleneck, validating a fix, or contributing targeted improvements or customizations, we're happy to engage.

Get Started Today

Try TabbyDB on Your Actual Workloads

Speak with our experts to get a solution tailored to your business goals and data needs. We help you plan the right strategy for faster growth and better results.

Compare Performance Live →

Technical Depth