Join algorithms

CrateDB supports various SQL join types, including:

CROSS JOIN
INNER JOIN
EQUI JOIN
LEFT JOIN
RIGHT JOIN
FULL JOIN

These joins are executed using either the nested loop join or the hash join algorithm. CrateDB applies additional optimizations based on the query structure and data distribution to ensure efficient execution at scale.

Nested Loop Join

The nested loop join is a simple and flexible join algorithm used for all join types except equi-joins.

How It Works

One table (relation) is designated as the outer relation, and the other as the inner relation. For each row in the outer relation, the algorithm compares it to each row in the inner relation. If the join condition is met, the combined row is added to the result set.

Pseudocode:

for each tuple l in L do
    for each tuple r in R do
        if l.a Θ r.b then
            put tuple(l, r) into result

Variants

1. Primitive Nested Loop Join

In simple cases, such as CROSS JOIN or joins on system tables (information_schema), the nested loop is executed directly on the handler node. Each shard sends relevant data to this node, which then executes the join, applies limits, and returns the results.

Nested joins are also supported, where input rows may result from previous joins or table functions rather than direct table scans.

2. Distributed Nested Loop Join

In most production scenarios, relations are distributed across nodes. CrateDB optimizes performance by broadcasting the smaller dataset to the nodes that hold the larger one.

Each receiving node performs a local nested loop join on its partition of the data. The results are then sent to the handler node, which merges and returns the final output to the client.

The smaller relation is broadcast to nodes holding the larger relation for parallel join execution.

Performance Optimizations

CrateDB can terminate nested loop joins early in the following cases:

ORDER BY allows short-circuiting when sorted rows are no longer needed.
LIMIT sets an upper bound on the number of rows.
For INNER JOIN and EQUI JOIN, conditions can be leveraged to avoid unnecessary comparisons.

These optimizations significantly reduce processing time and memory usage.

Hash Join

The hash join algorithm is used for EQUI JOINs and is typically more efficient than nested loops, especially for large datasets.

Basic Algorithm

All rows from the left relation are loaded into a hash table using a hash function on the join key.
Rows from the right relation are read one-by-one. For each row:
- The same hash function is applied.
- The resulting hash value is used to look up matches in the hash table.
- If matches are found and validated (in case of hash collisions), the combined row is added to the result set.

Hash table built on the left relation, scanned against the right relation.

Block Hash Join

Hash joins require the entire left relation to be stored in memory. When this isn't feasible, CrateDB uses a block hash join:

The left relation is divided into memory-sized blocks.
Each block is processed in memory, and the right relation is scanned repeatedly for each block.
The hash table is rebuilt for each iteration.

This strategy avoids memory overflow at the cost of scanning the right relation multiple times.

Memory-Aware Block Sizing

CrateDB dynamically determines block size based on:

Available memory on the node
Number and size of rows in the relation

Switch Tables Optimization

To reduce repeated scanning of large tables, CrateDB ensures the smaller table becomes the right relation in the join. If the planner detects that the original right relation is larger than the left, it switches them automatically for optimal performance.

Distributed Block Hash Join

CrateDB extends the hash join to operate in parallel across all nodes in a cluster:

A hash function is applied to rows from both relations.
A modulo operation on the hash value (based on the number of nodes) determines the destination node.
Each node receives a partition of both relations containing all candidate matches.
Each node performs a block hash join on its local partition.
Results are sent back to the handler node, which merges, sorts, and limits them as needed.

Hash-based partitioning distributes data evenly across nodes for parallel join processing.

This is the default algorithm used for equi-joins in CrateDB, except when joining on subqueries with LIMIT or OFFSET, in which case execution remains local.

PreviousJoin types NextJoin optimizations

Last updated 4 months ago

Good evening