Testing Insert Performance

To test insert performance in CrateDB, we recommend using the cr8 command-line tool. This tool is purpose-built for benchmarking CrateDB and provides a simple yet powerful interface to run insert performance tests.

Why Start Small?

Insert performance generally scales linearly with cluster size. Therefore, you should begin by testing performance on a single node. Once you establish a baseline, you can expand your tests to include larger clusters and measure scalability.

Step 1: Prepare Your Test Data

Option 1: Use Real Data

If you already have a table containing real data, it’s best to test insert performance using a representative sample of that data.

  1. Export the data using COPY TO:

    COPY my_table TO DIRECTORY '/tmp/crate';

    You should see output similar to:

    COPY OK, 1000000 rows affected (... sec)
  2. Capture the table schema:

    SHOW CREATE TABLE my_table;

    Use the output to create a new test table (e.g., my_table_test) that mirrors the production table.

Option 2: Generate Fake Data

If you don’t have real data available, you can generate synthetic data using cr8 insert-fake-data.

  1. Create a test table:

CREATE TABLE my_table_test (
  id INTEGER,
  name TEXT
);
  1. Insert fake data:

cr8 insert-fake-data \
  --hosts localhost:4200 \
  --table my_table_test \
  --num-records 1000000

This command automatically detects the table schema and inserts 1 million records of appropriate fake data. You can change the --num-records value to match your needs.

  1. Export the inserted fake data:

COPY my_table_test TO DIRECTORY '/tmp/crate';
  1. Clear the table (to prepare for performance testing):

DELETE FROM my_table_test;

Step 2: Run the Insert Performance Test

Use cr8 insert-json to replay the exported JSON data and measure insert performance.

cat /tmp/crate/my_table_*.json | cr8 insert-json \
  --hosts localhost:4200 \
  --table my_table_test \
  --bulk-size 1000 \
  --concurrency 25

The flags --bulk-size and --concurrency are optional. If omitted, they default to 1000 and 25, respectively.

Example Output

Executing inserts: bulk_size=1000 concurrency=25
1000 requests [00:35, 27.84 requests/s]
Runtime (in ms):
    mean:    103.556 ± 3.957
    min/max: 11.587 → 521.434
Percentile:
    50:   89.764 ± 63.851 (stdev)
    95:   220.739
    99.9: 475.568

Step 3: Analyze and Iterate

With performance test results in hand:

  • Adjust the --bulk-size and --concurrency values to observe how performance changes.

  • Plot results (e.g., throughput vs. concurrency) to visualize trends.

  • Experiment with different node setups or configurations (e.g., refresh intervals, indexing strategies, replicas).

  • Repeat the test after making changes to get a comprehensive performance profile.

Additional Tips

  • Avoid measuring insert and data generation at the same time; always generate data first, then insert.

  • When benchmarking, ensure the cluster is not under additional load from other operations.

  • Use --dry-run with cr8 commands to preview what the tool will do before executing a full test.

By following this methodical approach, you'll gain a deeper understanding of your system’s insert performance under different conditions and be better equipped to tune it for production workloads.

Last updated