Bulk inserts
When importing large volumes of pre-existing data into CrateDB, using optimized bulk insert strategies can dramatically improve performance and reduce system load.
This guide covers best practices, configuration tips, and common pitfalls when performing high-throughput imports.
Key Recommendations (TL;DR)
Temporarily reduce the number of replicas to
0
Use only as many shards as necessary
Disable table refreshes by setting
refresh_interval = 0
Tune
bulk_size
for optimal memory and performance balancePartition tables and use the
PARTITION
clause during importsUse compressed input files where possible
1. Table Configuration
Define Your Table Structure
Before starting your import, define your table thoughtfully to minimize costly schema changes later. For example:
CREATE TABLE users (
id INT PRIMARY KEY,
name TEXT,
day_joined TIMESTAMP,
bio TEXT INDEX USING fulltext,
address OBJECT (dynamic) AS (
city TEXT,
country TEXT
)
);
Shards and Replicas
By default:
Shards: Automatically determined by the number of data nodes
Replicas: Set to
1
For bulk imports:
Set
number_of_replicas
to0
to eliminate redundant write overheadChoose the minimum number of shards necessary for your expected data volume (A good rule of thumb is 2 shards per node)
CREATE TABLE users (
...
) CLUSTERED INTO 12 SHARDS
WITH (number_of_replicas = 0);
Disable Table Refresh During Import
The refresh_interval
determines how frequently data is made searchable. Setting it to 0
disables automatic refreshes:
ALTER TABLE users SET (refresh_interval = 0);
Or set it at creation time:
CREATE TABLE users (
...
) CLUSTERED INTO 12 SHARDS
WITH (
number_of_replicas = 0,
refresh_interval = 0
);
After the import, reset refresh_interval
to a reasonable value (e.g., 1000 ms):
ALTER TABLE users SET (refresh_interval = 1000);
2. Importing the Data
Supported Format: JSON Lines
CrateDB supports JSON Lines format for bulk ingest. Each line represents a single JSON object:
{"id": 1, "name": "foo", ...}
{"id": 2, "name": "bar", ...}
Use COPY FROM
to load the data:
COPY users FROM '/tmp/users.jsonl';
Bulk Size
You can control how many rows are read per batch using the bulk_size
option:
COPY users FROM '/tmp/users.jsonl'
WITH (bulk_size = 2000);
Default: 10,000 lines per batch
Reduce
bulk_size
if your records are large or numerousA batch size that’s too high may exhaust memory; one that’s too low may increase overhead
Compression
CrateDB supports importing gzip-compressed JSON Lines:
COPY users FROM '/tmp/users.jsonl.gz'
WITH (compression = 'gzip');
3. Partitioned Tables
Partitioning can improve performance by splitting large tables into manageable subsets.
Create a Partitioned Table
CREATE TABLE users (
id INT PRIMARY KEY,
name TEXT,
day_joined TIMESTAMP PRIMARY KEY,
...
) CLUSTERED INTO 6 SHARDS
PARTITIONED BY (day_joined)
WITH (number_of_replicas = 0);
Import into Specific Partitions
Use the PARTITION
clause with COPY FROM
:
COPY users PARTITION (day_joined = 1408312800)
FROM '/tmp/users_1408312800.jsonl';
When using this clause:
CrateDB skips resolving partition metadata for every row
The JSON data must exclude the partition column
{"id": 1, "name": "foo", ...}
{"id": 2, "name": "bar", ...}
4. Advanced Tuning
Adjusting Shards for Future Partitions
To change the number of shards used for new partitions:
ALTER TABLE users SET (number_of_shards = 12);
Refresh Settings for New Shards
If your table has existing partitions, you can disable refresh_interval
for new ones only:
ALTER TABLE ONLY users SET (refresh_interval = 0);
This prevents newly created partitions from refreshing automatically during import.
5. Cautions and Gotchas
CrateDB does not cast data types during
COPY FROM
. Ensure JSON values match your schema.If a file is missing, CrateDB still reports success with
COPY OK, 0 rows affected
.Network file systems or inconsistencies between nodes can cause partial or silent failures. Validate your results post-import.
Last updated