Storage and consistency
This page explains how CrateDB stores and distributes data across the cluster, and the consistency and durability guarantees it provides.
Table of Contents
Data Storage
All CrateDB tables are automatically sharded, split into smaller parts distributed across nodes in the cluster. Each shard is a Lucene index made up of immutable segments stored on the file system under one of the node’s configured data directories.
Lucene follows an append-only write model, meaning it never mutates data in place. This enables efficient replication and recovery: a shard can be synchronized simply by fetching new data starting from a known marker.
Each table can be configured with zero or more replicas. Every replica maintains a fully synchronized copy of its corresponding primary shard.
Read operations can be routed to either the primary or any of the replicas. By default, CrateDB distributes read load randomly across available shards. You can fine-tune this behavior for multi-zone deployments (see the multi-zone best practices guide).
Write operations follow a coordinated and synchronous pattern:
The cluster state is consulted to locate the primary and all active replicas.
The operation is routed to the primary shard.
The operation is executed on the primary.
If successful, the change is sent to all replicas for parallel execution.
Once all replicas have acknowledged, the result is returned.
If any replica fails to apply the change (or times out), it is marked as unavailable.
Document-Level Atomicity
Each row in CrateDB is stored as a semi-structured document, with support for deeply nested fields via objects and arrays.
Write operations are atomic at the document level: they either complete entirely or not at all, regardless of the size or complexity of the document.
Durability
Every shard in CrateDB maintains a write-ahead log (WAL), also known as the translog. This ensures that operations are persisted to disk before being written into Lucene’s index segments.
The translog is periodically flushed to Lucene. After flushing, the translog is cleared.
In the event of an unclean shutdown, CrateDB replays the translog to restore any in-flight operations.
When a new replica is initialized, the translog is transferred directly from the primary. This avoids the need to flush segments just for replica recovery.
Document Addressing
Documents in CrateDB are assigned a unique internal identifier:
If a primary key is defined, it is used to generate the document ID.
Otherwise, an auto-generated ID is assigned during insertion.
Each document is routed to a specific shard based on a routing column:
By default, the primary key is used as the routing column.
Alternatively, you can define a custom routing column using the
CLUSTERED BY
clause when creating the table.
Access Methods
CrateDB supports two methods to access documents:
Get (direct access by ID): The most efficient option. Used when the routing key and ID are fully specified (e.g., via a complete primary key match). Only one shard is accessed, and a fast index lookup is performed.
Search (query across shards): Used when matching based on field values. Involves scanning all candidate shards.
Consistency Model
Search Consistency
CrateDB provides eventual consistency for search operations:
Searches are executed on Lucene IndexReaders, which are bound to specific segments.
Changes only become visible to search after a refresh. This happens periodically or can be triggered manually using the
REFRESH
command.
Write Consistency
Writes are synchronously replicated to all active replicas.
All replicas contain the same data once the write completes.
Reads from any replica yield consistent results—as long as the same refresh cycle applies.
Query Behavior Caveats
In specific cases, get
queries may fall back to a full scan (collect
operator):
When the
WHERE
clause contains complex filters or large lists of values for composite primary keys.In these cases, data may not be visible until a
REFRESH
has occurred.
Example:
SELECT * FROM t
WHERE pk1 IN (<large_list>) AND pk2 = 42 AND pk3 = 'value';
Cluster Metadata and State
CrateDB maintains cluster-wide metadata in the cluster state, including:
Table schemas
Primary and replica shard mappings
Shard status (e.g., active, initializing, unassigned)
Node discovery and health information
Configuration settings
Master Node Election
Every node holds a full copy of the cluster state, but only one node (the master) can apply updates to it. The master node:
Receives change requests (e.g.,
ALTER TABLE
)Applies the change locally
Broadcasts the updated cluster state to all nodes
Nodes apply the update and take any required local actions
For more on this, see the master election documentation.
Last updated