Sizing

Before deploying CrateDB at scale, estimating your expected data volume and ingestion rate is essential to determine the correct cluster size.

Estimating Cluster Size

To plan your cluster, consider the following factors:

Ingest Rate – How much data will be written per second? (e.g., 10K, 100K, 1M+ rows/sec)
Data Retention – How long will the data be stored? (e.g., days, months, years)
Storage Needs – How much storage is required per day, month, or year?
Query Load – How many concurrent users and queries will the system handle?

Data Volume

Raw size

Based on the raw volume of data, you can start sizing the cluster. CrateDB uses Apache Lucene as its underlying indexing engine, which includes built-in compression mechanisms. Depending on the type of data you are storing in CrateDB, you can assume the following factors:

Data Format

CrateDB Storage Factor*

CSV

0.4x (40% of raw size)

JSON

0.5x (50% of raw size)

PARQUET

3.0x (3x raw size)

Disclaimer: The values presented here are based on typical data sets and should serve as general guidance. In some cases, results may differ depending on the characteristics of your specific data model.

Example Calculation

Let's assume you have 4 TiB of raw JSON data:

Storage in CrateDB = 4 TiB × 0.5 = 2 TiB

If using CSV, it would be:

Storage in CrateDB = 4 TiB × 0.4 = 1.6 TiB

If using Parquet, it would be:

Storage in CrateDB = 4 TiB × 3.0 = 12 TiB

Retention

How long you plan to keep the data in CrateDB is also important for rightsizing your cluster. Scaling horizontally and vertically is one of the advantages and strengths of using CrateDB.

Example Calculation

Assume:

Raw data = 4 TiB (JSON)
Retention = 12 months
New data = 4 TiB per month
Storage for retention = 2 TiB × 12 months = 24 TiB

Sharding

The table will be divided into the number of primary shards. Since each shard can be written independently, we aim for a number that allows enough concurrency. Shards are automatically distributed equally across the nodes.

When using partitioned tables, sharding is applied to each partition separately, meaning each partition will have its own set of primary shards. This ensures scalability and balanced distribution across nodes as partitions grow.

Shard Sizing Rules

Rule

Recommendation

Shards per node

Soft limit 1000 per node

Shard size

Optimal 10 GB to 50 GB per shard

Max documents per Shard

Should be below 200 million per shard

For example, the soft-limited number of shards for a 3-node cluster is 3000. When four tables are configured with four shards and one replica, partitioned by month, and retained for 10 years plus one replica, the number of shards per node is 960.

Replicas

At least one replica should always be set up in production environments. This ensures availability and enables operations like rolling upgrades.

This is a replication factor of 2 (1 primary + 1 replica).

Buffer

CrateDB advises having at least a 10% buffer when sizing a cluster.

Example Full Calculation

For the calculation of cluster size and the number of nodes needed, we can use these formulas:

Total data volume = Raw data volume * raw format factor * replication factor * buffer

Number of nodes = ROUNDUP(Total data volume / maximum disk size)

Let's size a cluster for:

10 TiB raw JSON data

Total data volume = 10 TiB * 0,5 * 2 * 1,1 = 11 TiB

Number of nodes = ROUNDUP(11 / 2) = 6 Nodes

By Ingest

Peak Ingest/Sec

The amount of expected ingest/sec is significant in terms of sizing.

As a rule of thumb, use this as a starting point to calculate the minimal number of nodes needed.

Node type

Throughput in rows/s/node (no replication)

Throughput in rows/s/node (1 replica)

CR1

30k

18k

CR2

60k

36k

CR3

120k

72k

CR4

240k

144k

Avg row size

The number of inserts per second and the average row size affect storage footprint, indexing performance, and memory usage; ingest tests must validate this.

Amount of (indexed) columns

CrateDB indexes everything by default, providing excellent query performance without requiring manual indexing. However, excessive indexing can impact storage, write speed, and resource utilization.

All columns are automatically indexed when inserted.
CrateDB optimizes indexing using Lucene-based columnar storage.
A soft limit of 1,000 indexed columns (or JSON fields) per table exists.
Going beyond this limit may impact performance.

In cases with many fields and columns, it is advised to determine which columns need to be indexed.

Batching

Batching reduces overhead by grouping multiple inserts into a single transaction. It is essential to find the correct number for your cluster and workload.

PreviousInfrastructure NextBenchmarking

Last updated 3 months ago

Good evening