Rolling Upgrade

CrateDB supports rolling upgrades to help you upgrade your cluster with zero downtime, by upgrading one node at a time.

A rolling upgrade is possible only between compatible versions—typically between consecutive feature releases. Some examples:

✅ You can do a rolling upgrade from X.5.z to X.6.0 ✅ You can do a rolling upgrade from last X version release to first (X+1) version release ❌ You cannot upgrade directly from X.5.x to X.8.x unless explicitly stated in the release notes


How It Works

Rolling upgrades involve stopping and upgrading one node at a time using CrateDB’s graceful stop mechanism. This ensures ongoing operations complete before the node shuts down.

Graceful Stop Behavior

  • The node stops accepting new requests

  • It completes all in-progress operations

  • It then reallocates shards based on your availability configuration


Data Availability Options

CrateDB offers three levels of minimum data availability during a graceful stop, configurable via the cluster.graceful_stop.min_availability setting:

Setting
Behavior
Cluster Health Impact

full

All primary and replica shards are moved off the node

Cluster stays green

primaries

Only primary shards are moved; replicas stay

Cluster may go yellow

none

No guarantees; node stops even if data becomes temporarily unavailable

Cluster may go red


Requirements

For full Minimum Availability

Your cluster must have enough nodes and disk space to hold the full replica count even after one node shuts down.

  • Rule of thumb: number_of_nodes > max_number_of_replicas + 1

Examples:

  • If a table has 1 replica, you need at least 3 nodes

  • If a table allows a range of replicas (e.g., 0-1), CrateDB uses the maximum number for allocation logic

If the requirements are not met, the graceful stop will fail.

For primaries Minimum Availability

  • Ensure that enough shards remain to maintain write consistency

  • By default, CrateDB requires a quorum of active shards: quorum = floor(replicas / 2) + 1


Rolling Upgrade Procedure

Step 1: Disable Allocations (Optional)

To prevent CrateDB from reallocating shards while nodes are offline, temporarily restrict routing:

sqlCopierModifierSET GLOBAL TRANSIENT "cluster.routing.allocation.enable" = 'new_primaries';

Skip this step if you are using min_availability = full, as CrateDB will handle shard movement internally.


Step 2: Gracefully Stop the Node

Use the DECOMMISSION SQL command to initiate a graceful shutdown:

  • Moves shards off the node according to the min_availability setting

  • Ensures ongoing operations complete before the node shuts down

Avoid stopping nodes using TERM signals (e.g., Ctrl+C or systemctl stop) unless you want a non-graceful shutdown

Monitor Reallocation

You can track shard reallocation progress with:

-- Remaining shards on the node
SELECT count(*) AS remaining_shards
FROM sys.shards
WHERE _node['name'] = 'your_node_name';
-- Detailed view
SELECT schema_name AS schema, table_name AS "table", id, "primary", state
FROM sys.shards
WHERE _node['name'] = 'your_node_name' AND schema_name IN ('blob', 'doc')
ORDER BY schema, "table", id, "primary", state;
-- Tables with 0 replicas (primaries that will be moved)
SELECT table_schema AS schema, table_name AS "table"
FROM information_schema.tables
WHERE number_of_replicas = 0 AND table_schema IN ('blob', 'doc')
ORDER BY schema, "table";

Note: When using the Admin UI, you may briefly see a red cluster state during shutdown. This is usually a UI timing artifact, not an actual failure.


Step 3: Upgrade CrateDB

Once the node is stopped, perform the upgrade using your preferred method.

Examples:

Tarball:

/path/to/bin/crate

RHEL/YUM:

yum update -y crate

Refer to your OS or package manager documentation for specific upgrade instructions.


Step 4: Restart the Node

After upgrading, restart CrateDB:

Tarball:

/path/to/bin/crate

RHEL/YUM:

service crate start

Step 5: Repeat for All Nodes

Repeat steps 2–4 for each remaining node in your cluster.


Step 6: Re-enable Allocations

Once all nodes are upgraded and running:

SET GLOBAL TRANSIENT "cluster.routing.allocation.enable" = 'all';

Final Notes

  • Always test the upgrade in a staging environment first

  • Monitor logs and metrics during and after each upgrade step

  • Consider enabling alerts for cluster health changes during upgrades

  • If using snapshots, verify their validity before beginning the upgrade

Last updated