Resiliency

CrateDB is designed for fault tolerance and high availability in real-world, distributed environments. Whether it’s a network issue, disk failure, or node outage, CrateDB is built to continue operating reliably and recover automatically whenever possible.

When deployed using best practices, CrateDB consistently delivers resilient, production-grade performance.

Monitoring Cluster Status

The CrateDB Admin UI includes a cluster status indicator, providing real-time insights into the health and stability of your deployment.

Green status: All shards are available, replicated, and not undergoing relocation. This is the ideal state.
Yellow status: Indicates elevated risk due to issues such as a node failure or temporary network outage. Some shards may be unassigned or being relocated, but the cluster remains operational.

Status information is updated automatically every few seconds, helping operators respond quickly if an issue arises.

Storage and Consistency

CrateDB does not offer full ACID transactional guarantees like traditional relational databases (e.g. MySQL). Instead, it favors high availability and eventual consistency—core principles of distributed systems.

Writes are atomic at the row level.
Data changes are eventually consistent across nodes. This means that a row written on one node may not be immediately visible when queried from another node until the table refreshes (typically within one second).
Once you receive an INSERT OK, the data is stored successfully. However, due to asynchronous propagation, immediate reads may not reflect this on all nodes.

Applications should be built with this model in mind, especially if strong read-after-write consistency is required.

For more details, see Clustering.

Deployment Strategies

Designing a resilient CrateDB deployment requires balancing availability, durability, and operational complexity. Here are best practices and considerations:

Horizontal Scaling: CrateDB is designed to scale out. Use SSDs, up to 64 GB of RAM, and multi-core CPUs for optimal performance. Add or remove nodes dynamically based on workload.
High Availability: To tolerate zone-level failures, deploy nodes across multiple availability zones (e.g. different data centers or regions). This improves fault tolerance and system uptime.
Replication: Increase the number of table replicas to enhance read performance and reduce the risk of data loss. Be aware this consumes more disk space and network bandwidth.
Disaster Recovery: Use snapshots to back up your data regularly. Snapshots can be stored in cold storage (e.g. object storage) for long-term recovery scenarios.
Resilient Ingestion Pipelines: When using CrateDB as part of a high-volume data pipeline, employ a message queue with backup and replay capabilities. This allows you to replay missed or failed writes in case of temporary outages.

This pipeline approach is also the recommended strategy for handling rare edge cases involving data inconsistency or loss due to hardware or network issues.

PreviousCluster behavior NextReference architectures

Last updated 3 months ago

Good morning

Resiliency

Table of Contents

Monitoring Cluster Status

Storage and Consistency

Deployment Strategies