Bulk inserts

When importing large volumes of pre-existing data into CrateDB, using optimized bulk insert strategies can dramatically improve performance and reduce system load.

This guide covers best practices, configuration tips, and common pitfalls when performing high-throughput imports.

Key Recommendations (TL;DR)

Temporarily reduce the number of replicas to 0
Use only as many shards as necessary
Disable table refreshes by setting refresh_interval = 0
Tune bulk_size for optimal memory and performance balance
Partition tables and use the PARTITION clause during imports
Use compressed input files where possible

Tip: Upgrading your disks (e.g., to SSDs) or adding more nodes will almost always improve bulk insert speed. CrateDB scales horizontally.

1. Table Configuration

Define Your Table Structure

Before starting your import, define your table thoughtfully to minimize costly schema changes later. For example:

CREATE TABLE users (
  id INT PRIMARY KEY,
  name TEXT,
  day_joined TIMESTAMP,
  bio TEXT INDEX USING fulltext,
  address OBJECT (dynamic) AS (
    city TEXT,
    country TEXT
  )
);

Shards and Replicas

By default:

Shards: Automatically determined by the number of data nodes
Replicas: Set to 1

For bulk imports:

Set number_of_replicas to 0 to eliminate redundant write overhead
Choose the minimum number of shards necessary for your expected data volume (A good rule of thumb is 2 shards per node)

CREATE TABLE users (
  ...
) CLUSTERED INTO 12 SHARDS
  WITH (number_of_replicas = 0);

Disable Table Refresh During Import

The refresh_interval determines how frequently data is made searchable. Setting it to 0 disables automatic refreshes:

ALTER TABLE users SET (refresh_interval = 0);

Or set it at creation time:

CREATE TABLE users (
  ...
) CLUSTERED INTO 12 SHARDS
  WITH (
    number_of_replicas = 0,
    refresh_interval = 0
  );

After the import, reset refresh_interval to a reasonable value (e.g., 1000 ms):

ALTER TABLE users SET (refresh_interval = 1000);

2. Importing the Data

Supported Format: JSON Lines

CrateDB supports JSON Lines format for bulk ingest. Each line represents a single JSON object:

{"id": 1, "name": "foo", ...}
{"id": 2, "name": "bar", ...}

Use COPY FROM to load the data:

COPY users FROM '/tmp/users.jsonl';

CrateDB reads the file locally on each node. Ensure the file exists at the specified path on every node.

On Windows, include the full drive path (e.g., C://tmp/users.jsonl).

Bulk Size

You can control how many rows are read per batch using the bulk_size option:

COPY users FROM '/tmp/users.jsonl'
  WITH (bulk_size = 2000);

Default: 10,000 lines per batch
Reduce bulk_size if your records are large or numerous
A batch size that’s too high may exhaust memory; one that’s too low may increase overhead

Compression

CrateDB supports importing gzip-compressed JSON Lines:

COPY users FROM '/tmp/users.jsonl.gz'
  WITH (compression = 'gzip');

CrateDB does not auto-detect compression—you must set it manually.

3. Partitioned Tables

Partitioning can improve performance by splitting large tables into manageable subsets.

Create a Partitioned Table

CREATE TABLE users (
  id INT PRIMARY KEY,
  name TEXT,
  day_joined TIMESTAMP PRIMARY KEY,
  ...
) CLUSTERED INTO 6 SHARDS
  PARTITIONED BY (day_joined)
  WITH (number_of_replicas = 0);

If a PRIMARY KEY is defined, the partition column must be included in it.

Import into Specific Partitions

Use the PARTITION clause with COPY FROM:

COPY users PARTITION (day_joined = 1408312800)
  FROM '/tmp/users_1408312800.jsonl';

When using this clause:

CrateDB skips resolving partition metadata for every row
The JSON data must exclude the partition column

{"id": 1, "name": "foo", ...}
{"id": 2, "name": "bar", ...}

Avoid importing partitioned data without the PARTITION clause; it can hurt performance.

4. Advanced Tuning

Adjusting Shards for Future Partitions

To change the number of shards used for new partitions:

ALTER TABLE users SET (number_of_shards = 12);

Too many shards can degrade performance. Aim for a balanced ratio of data size per shard and nodes per cluster.

Refresh Settings for New Shards

If your table has existing partitions, you can disable refresh_interval for new ones only:

ALTER TABLE ONLY users SET (refresh_interval = 0);

This prevents newly created partitions from refreshing automatically during import.

5. Cautions and Gotchas

CrateDB does not cast data types during COPY FROM. Ensure JSON values match your schema.
If a file is missing, CrateDB still reports success with COPY OK, 0 rows affected.
Network file systems or inconsistencies between nodes can cause partial or silent failures. Validate your results post-import.

PreviousUnnesting NextParallel inserts

Last updated 4 months ago

Good evening