Sizing
Before deploying CrateDB at scale, estimating your expected data volume and ingestion rate is essential to determine the correct cluster size.
Estimating Cluster Size
To plan your cluster, consider the following factors:
Ingest Rate – How much data will be written per second? (e.g., 10K, 100K, 1M+ rows/sec)
Data Retention – How long will the data be stored? (e.g., days, months, years)
Storage Needs – How much storage is required per day, month, or year?
Query Load – How many concurrent users and queries will the system handle?
Data Volume
Raw size
Based on the raw volume of data, you can start sizing the cluster. CrateDB uses Apache Lucene as its underlying indexing engine, which includes built-in compression mechanisms. Depending on the type of data you are storing in CrateDB, you can assume the following factors:
CSV
0.4x (40% of raw size)
JSON
0.5x (50% of raw size)
PARQUET
3.0x (3x raw size)
Disclaimer: The values presented here are based on typical data sets and should serve as general guidance. In some cases, results may differ depending on the characteristics of your specific data model.
Example Calculation
Let's assume you have 4 TiB of raw JSON data:
Storage in CrateDB = 4 TiB × 0.5 = 2 TiB
If using CSV, it would be:
Storage in CrateDB = 4 TiB × 0.4 = 1.6 TiB
If using Parquet, it would be:
Storage in CrateDB = 4 TiB × 3.0 = 12 TiB
Retention
How long you plan to keep the data in CrateDB is also important for rightsizing your cluster. Scaling horizontally and vertically is one of the advantages and strengths of using CrateDB.
Example Calculation
Assume:
Raw data = 4 TiB (JSON)
Retention = 12 months
New data = 4 TiB per month
Storage for retention =
2 TiB × 12 months = 24 TiB
Sharding
The table will be divided into the number of primary shards. Since each shard can be written independently, we aim for a number that allows enough concurrency. Shards are automatically distributed equally across the nodes.
When using partitioned tables, sharding is applied to each partition separately, meaning each partition will have its own set of primary shards. This ensures scalability and balanced distribution across nodes as partitions grow.
Shard Sizing Rules
Rule
Recommendation
Shards per node
Soft limit 1000 per node
Shard size
Optimal 10 GB to 50 GB per shard
Max documents per Shard
Should be below 200 million per shard
For example, the soft-limited number of shards for a 3-node cluster is 3000. When four tables are configured with four shards and one replica, partitioned by month, and retained for 10 years plus one replica, the number of shards per node is 960.
Replicas
At least one replica should always be set up in production environments. This ensures availability and enables operations like rolling upgrades.
This is a replication factor of 2 (1 primary + 1 replica).
Buffer
CrateDB advises having at least a 10% buffer when sizing a cluster.
Example Full Calculation
For the calculation of cluster size and the number of nodes needed, we can use these formulas:
Total data volume = Raw data volume * raw format factor * replication factor * buffer
Number of nodes = ROUNDUP(Total data volume / maximum disk size)
Let's size a cluster for:
10 TiB raw JSON data
Total data volume = 10 TiB * 0,5 * 2 * 1,1 = 11 TiB
Number of nodes = ROUNDUP(11 / 2) = 6 Nodes
By Ingest
Peak Ingest/Sec
The amount of expected ingest/sec is significant in terms of sizing.
As a rule of thumb, use this as a starting point to calculate the minimal number of nodes needed.
Node type
Throughput in rows/s/node (no replication)
Throughput in rows/s/node (1 replica)
CR1
30k
18k
CR2
60k
36k
CR3
120k
72k
CR4
240k
144k
Avg row size
The number of inserts per second and the average row size affect storage footprint, indexing performance, and memory usage; ingest tests must validate this.
Amount of (indexed) columns
CrateDB indexes everything by default, providing excellent query performance without requiring manual indexing. However, excessive indexing can impact storage, write speed, and resource utilization.
All columns are automatically indexed when inserted.
CrateDB optimizes indexing using Lucene-based columnar storage.
A soft limit of 1,000 indexed columns (or JSON fields) per table exists.
Going beyond this limit may impact performance.
In cases with many fields and columns, it is advised to determine which columns need to be indexed.
Batching
Batching reduces overhead by grouping multiple inserts into a single transaction. It is essential to find the correct number for your cluster and workload.
Last updated