Storage and consistency

This page explains how CrateDB stores and distributes data across the cluster, and the consistency and durability guarantees it provides.

CrateDB builds on core technologies from Elasticsearch and Lucene. As a result, many of the concepts outlined here will be familiar to users of Elasticsearch.

Data Storage

All CrateDB tables are automatically sharded, split into smaller parts distributed across nodes in the cluster. Each shard is a Lucene index made up of immutable segments stored on the file system under one of the node’s configured data directories.

Lucene follows an append-only write model, meaning it never mutates data in place. This enables efficient replication and recovery: a shard can be synchronized simply by fetching new data starting from a known marker.

Each table can be configured with zero or more replicas. Every replica maintains a fully synchronized copy of its corresponding primary shard.

Read operations can be routed to either the primary or any of the replicas. By default, CrateDB distributes read load randomly across available shards. You can fine-tune this behavior for multi-zone deployments (see the multi-zone best practices guide).
Write operations follow a coordinated and synchronous pattern:
1. The cluster state is consulted to locate the primary and all active replicas.
2. The operation is routed to the primary shard.
3. The operation is executed on the primary.
4. If successful, the change is sent to all replicas for parallel execution.
5. Once all replicas have acknowledged, the result is returned.
If any replica fails to apply the change (or times out), it is marked as unavailable.

Document-Level Atomicity

Each row in CrateDB is stored as a semi-structured document, with support for deeply nested fields via objects and arrays.

Write operations are atomic at the document level: they either complete entirely or not at all, regardless of the size or complexity of the document.

CrateDB does not support multi-statement transactions. To manage concurrent updates, use the document versioning system and apply patterns like Optimistic Concurrency Control (OCC).

Durability

Every shard in CrateDB maintains a write-ahead log (WAL), also known as the translog. This ensures that operations are persisted to disk before being written into Lucene’s index segments.

The translog is periodically flushed to Lucene. After flushing, the translog is cleared.
In the event of an unclean shutdown, CrateDB replays the translog to restore any in-flight operations.
When a new replica is initialized, the translog is transferred directly from the primary. This avoids the need to flush segments just for replica recovery.

Document Addressing

Documents in CrateDB are assigned a unique internal identifier:

If a primary key is defined, it is used to generate the document ID.
Otherwise, an auto-generated ID is assigned during insertion.

Each document is routed to a specific shard based on a routing column:

By default, the primary key is used as the routing column.
Alternatively, you can define a custom routing column using the CLUSTERED BY clause when creating the table.

Access Methods

CrateDB supports two methods to access documents:

Get (direct access by ID): The most efficient option. Used when the routing key and ID are fully specified (e.g., via a complete primary key match). Only one shard is accessed, and a fast index lookup is performed.
Search (query across shards): Used when matching based on field values. Involves scanning all candidate shards.

Consistency Model

Search Consistency

CrateDB provides eventual consistency for search operations:

Searches are executed on Lucene IndexReaders, which are bound to specific segments.
Changes only become visible to search after a refresh. This happens periodically or can be triggered manually using the REFRESH command.

For immediate visibility, use get queries with a complete primary key. These read directly from the translog and always reflect the most recent changes.

Write Consistency

Writes are synchronously replicated to all active replicas.
All replicas contain the same data once the write completes.
Reads from any replica yield consistent results—as long as the same refresh cycle applies.

Note: CrateDB nodes ping the master node every second (by default). If a node becomes isolated (e.g., due to network issues), it may temporarily accept writes that can't be replicated. These are "dirty reads" and can be observed briefly until the node detects isolation and stops accepting writes.

Query Behavior Caveats

In specific cases, get queries may fall back to a full scan (collect operator):

When the WHERE clause contains complex filters or large lists of values for composite primary keys.
In these cases, data may not be visible until a REFRESH has occurred.

Example:

SELECT * FROM t
WHERE pk1 IN (<large_list>) AND pk2 = 42 AND pk3 = 'value';

Cluster Metadata and State

CrateDB maintains cluster-wide metadata in the cluster state, including:

Table schemas
Primary and replica shard mappings
Shard status (e.g., active, initializing, unassigned)
Node discovery and health information
Configuration settings

Master Node Election

Every node holds a full copy of the cluster state, but only one node (the master) can apply updates to it. The master node:

Receives change requests (e.g., ALTER TABLE)
Applies the change locally
Broadcasts the updated cluster state to all nodes
Nodes apply the update and take any required local actions

Master eligibility is automatic: any node can become master by default. If the current master goes offline, CrateDB automatically elects a new one using a quorum-based algorithm to prevent split-brain scenarios.

For more on this, see the master election documentation.

PreviousHybrid Index NextBLOB store

Last updated 4 months ago

Good evening