Apache Spark — Reimagined

Spark Queries That Take Hours.
Now Done in Minutes.

TabbyDB is a drop-in fork of Apache Spark that eliminates compile-time blowup, OutOfMemory failures, and nested join underperformance — at the optimizer level. Zero code changes. Zero cluster changes.

13% TPC-DS Improvement at 1TB–2TB
46%+ Iceberg Performance Gain
20+ Apache Spark JIRA Fixes
The Problem

The Real Bottleneck Is in Apache Spark

Production workloads hit walls that benchmarks never reveal. Compile time, optimizer failures, and bad workarounds compound silently until it's too late.

Root Cause: Query Planning

Compile time can take minutes to hours — but Spark's UI only registers a query after plan submission. The bottleneck is in planning, not execution, and it's invisible to your metrics.

Why Tuning Fails

Compile-time problems are routinely misdagnosed as runtime issues. Runtime tuning doesn't reduce planning time. More compute doesn't mean faster planning. The fixes you try don't touch the root cause.

Workarounds Backfire

Disabling optimizer rules can reduce compile time — but at the cost of runtime performance. Every workaround forces a tradeoff: faster planning vs. slower execution. TabbyDB eliminates this tradeoff.

The result: minutes to hours of delay, wasted compute, and missed SLAs — with no clear path to fix it in stock Spark.

The Solution

TabbyDB — Turbocharged Apache Spark

A strict superset of Apache Spark 4.0.1 and 4.1.1. Same APIs. Same clusters. Better engine. Drop in the jars and get back hours.

01
Compile Time

Intelligent Compile-Time Optimizations

Fundamental improvements to critical optimizer rules: optimized constraint propagation, early project collapsing during analysis, reduced Hive Metastore calls, and targeted rule application to avoid expensive tree traversals. Complex queries that took 8 hours now compile in minutes.

02
Memory

Scalable Query Tree Management

Safely collapses project nodes early in the query lifecycle, preventing unbounded query plan growth and reducing memory pressure during compilation. Faster compilation and dramatically reduced risk of out-of-memory failures on deeply nested workloads.

03
Runtime

Advanced Broadcast Hash Join Handling

Dynamic file pruning using Broadcast Hash Join data on non-partitioned columns — the fix Spark never shipped. Extended to enable dynamic file pruning for non-partitioned joins, reducing data scan time. 13% improvement on TPC-DS at 1TB and 2TB.

04
Cache

Improved Cache Lookup Efficiency

Enhances how cached in-memory query plans are matched and reused, increasing the likelihood of successful cache hits. Higher cache reuse and lower execution overhead — especially for repeated or structurally similar queries.

05
Iceberg

Apache Iceberg — 46%+ Faster

Early testing on 50GB non-partitioned Iceberg tables shows 46% improvement. Gains are expected to grow beyond 46% at 1TB–2TB scale. The Iceberg performance layer the ecosystem has been waiting for.

06
Deployment

Seamless Spark Compatibility. No Lock-In.

Maintains full compatibility with Apache Spark APIs, features, and tooling while delivering every performance improvement above. Replace Spark jars with TabbyDB jars. To revert: swap back. No code rewrites. No cluster changes. No disabling of optimizer rules.

Performance That Redefines Complex Query Execution

0%
TPC-DS Runtime Improvement

1TB & 2TB on AWS r6gd.4xlarge — non-partitioned Hive Parquet tables

0%+
Iceberg Performance Gain

50GB non-partitioned tables. Gains expected to increase at 1TB–2TB scale

8hrs mins
Compile Time Reduction

Complex DataFrame API queries — hours to compile, now minutes

TPC-DS gains shown on non-partitioned tables. Standard TPC-DS (partitioned) shows equivalent performance to stock Spark by design.

See the performance difference for yourself. Click below to open our Zeppelin notebooks, where you can run the same query on both Stock Spark and TabbyDB side by side. Note: running the Stock Spark paragraph may take 5–12 minutes.

Compare Performance Now →

TPC-DS Benchmark Results

Click to expand
1 TB TPC-DS Benchmark — 6 nodes AWS r6gd.4xlarge
Click to expand
2 TB TPC-DS Benchmark — 6 nodes AWS r6gd.4xlarge
Click to expand
50 GB TPC-DS Benchmark — Mac Mini M4, single node
Under the Hood

Spark SQL Modules Optimized in TabbyDB

TabbyDBClick to expand
TabbyDB Spark SQL processing pipeline
Stock SparkClick to expand
Stock Spark SQL processing pipeline
What We Fixed

20+ Apache Spark JIRA Issues — Resolved

Performance Issues
SPARK-33152Constraint Propagation causing compile times to run into hours
SPARK-36786Inefficiency in PushDownPredicates for complex expressions
SPARK-44662Dynamic file pruning for non-partition column joins
SPARK-45373Minimizing calls to HMS layer for repeated table references
SPARK-45866Reuse of Exchange broken in AQE when runtime filters pushed down
SPARK-45959Uncapped tree size in analysis phase — compilation runs into hours
SPARK-46671Redundant filter creation from buggy Constraint Propagation rule
SPARK-47609Cached Plan lookup may miss valid plans
SPARK-49618Canonicalization differences in Union causing failure in re-use of exchange or cached plans
SPARK-49881Minimizing cost of DeduplicateRelations in the analyzer
SPARK-54881BooleanSimplification using transformExpressionsUp instead of Down — inefficient in some cases
SPARK-55072Inferring new Constraint misses IsNotNull on Left Leg when Outer Join converts to Inner Join
SPARK-55110Order of BooleanSimplification and SimplifyBinaryComparison rules is suboptimal for idempotency
Functional Issues
SPARK-47320Self-join inconsistencies and exceptions
SPARK-47217DeduplicateRelations may cause failure in plan resolution
SPARK-49727Data loss when POJO Dataset converted to DataFrame and back
SPARK-49789Exception encoding POJOs with generic type fields
SPARK-51016Incorrect results during retry when join column is indeterminate
SPARK-45658Canonicalization of DynamicPruningSubquery is broken
SPARK-53264Incorrect nullability when correlated subquery converted to Left Outer Join
SPARK-55185Adding InferFiltersFromConstraints to Optimization batch causes idempotency break
SPARK-55241Idempotency of SQL Streaming with Joins broken when InferFiltersFromConstraints and PropagateEmptyRelation are added as Optimization rules
Why Choose TabbyDB

More Than a Faster Engine

TabbyDB is built for teams running complex, production-critical Apache Spark workloads, where query compilation time, optimizer behavior, and correctness matter as much as execution speed.

Engine-Level Enhancements

Improvements are made at the optimizer and execution engine level — not as application-layer patches. This means every workload benefits, without requiring any tuning or code changes.

Built for Stability & Correctness

Beyond raw performance, TabbyDB addresses 20+ long-standing Spark correctness issues — self-join inconsistencies, data loss in POJO conversions, broken idempotency in streaming joins. Stability and predictability matter as much as speed.

Compatible With Everything You Have

TabbyDB maintains full compatibility with existing Spark APIs, tooling, and workflows. No migration. No vendor lock-in. If you ever need to roll back, swap the jars.

Open to Collaboration

Partner With the Team Behind TabbyDB

If you are encountering functional or performance issues in Apache Spark — particularly within the SQL or optimizer layer — we're open to collaborating on solutions tailored to your workload or codebase.

Whether it's diagnosing a bottleneck, validating a fix, or contributing targeted improvements or customizations, we're happy to engage.

Get Started Today

Try TabbyDB on Your Actual Workloads

Speak with our experts to get a solution tailored to your business goals and data needs. We help you plan the right strategy for faster growth and better results.

Live in 15 Minutes. Revert in 30 Seconds.

01

Download TabbyDB jars (4.0.1 or 4.1.1)

02

Replace your existing Spark jars

03

Run your existing pipelines — unchanged

To revert to stock Spark: remove TabbyDB jars, restore Spark jars. No configuration changes. No code changes. No cluster changes.

Stop Waiting for Spark to Compile.

Download the trial. Run it on your actual queries. See the difference on your own cluster.

100% Apache Spark API compatible. No code changes. Full rollback in seconds.