Infrastructure
Understanding the architecture of a CrateDB cluster is crucial when right-sizing it. CrateDB employs a distributed shared-nothing architecture, meaning each node in the cluster operates independently without relying on shared storage or central coordination. This allows for high availability, horizontal scalability, and fault tolerance, making it an optimal choice for workloads requiring real-time analytics, machine learning, or high-throughput data ingestion.
A shared-nothing database architecture ensures that:
Each node owns its compute and storage resources, eliminating bottlenecks caused by centralized data control.
The system can scale horizontally by adding new nodes, with CrateDB handling data redistribution automatically.
Nodes fail independently, meaning the cluster remains operational even if individual nodes go offline.
CrateDB simplifies cluster management by handling sharding and replication out of the box:
Sharding: Each table in CrateDB is automatically broken into shards, which are distributed across nodes for parallel execution of queries. Sharding optimizes performance by:
Distributing the data load across nodes.
Enabling efficient read and write parallelism.
Allowing queries to be executed across multiple shards simultaneously, reducing query response time.
Replication: CrateDB replicates shards across multiple nodes to provide high availability and data redundancy. If a node fails, its replica shard is automatically promoted to ensure uninterrupted operation. This also boosts read performance, as multiple replicas can serve queries.
Storage
CrateDB is designed for high-volume concurrent reads and writes, making SSDs the best choice for most workloads. The key reasons include:
High IOPS (Input/Output Operations Per Second): SSDs can handle CrateDB's parallel query execution and large-scale indexing far better than HDDs.
Low Latency: Ensures quick access to data, making them ideal for real-time analytics and search queries.
Fast Random Read/Write Performance: Columnar storage benefits significantly from fast random access speeds.
Efficient Sharding & Replication: Reduces the performance overhead caused by CrateDB's distributed architecture.
Compute
CrateDB is designed for distributed, high-performance query execution, meaning CPU selection and optimization are crucial for cluster efficiency. The compute power of each node directly impacts query execution speed, indexing performance, sharding efficiency, and concurrency handling.
As a rule of thumb:
Use high-core count CPUs (8+ cores for production clusters)
Prioritize CPUs with high clock speeds (3.0 GHz or higher)
Memory
RAM is critical in ensuring high performance, particularly for query execution, indexing, and caching. Since CrateDB is built on Java (JVM), memory needs to be allocated wisely between:
JVM Heap Memory (for processing)
OS File System Cache (for Memory Mapped Indexes and disk operations)
As a rule of thumb, you should not allocate more than 25% of the total RAM to the CrateDB_Heap.
Network
Latency (we cannot defy the laws of nature, so make sure you think about the geographical location of the cluster and the other components in the "pipeline"). And if you run in the cloud (whether full or self-managed), try to minimize latency by setting up a private link between the tenants. The CrateDB Cloud team can assist you with that.
Last updated