Circuit breaker

CrateDB’s circuit breakers protect cluster stability by preventing queries and background processes from exhausting a node’s memory. They do this by estimating the memory required for each operation and aborting the process before the JVM heap is overwhelmed.


What Is a Circuit Breaker?

A circuit breaker is a software safeguard designed to halt operations when resource usage crosses a critical threshold. The concept is similar to a household fuse box: if too many appliances draw power from a single line, the circuit trips to prevent damage. In a software system, the stressed resource might be memory, CPU, file descriptors, or even external services.

In CrateDB, the primary resource under pressure is RAM. Queries often run in parallel across many shards. A single complex aggregation or JOIN can allocate gigabytes of memory in milliseconds. CrateDB’s circuit breakers detect this and proactively stop the query by throwing a CircuitBreakingException, preventing an out-of-memory crash that could bring down the node.


How Circuit Breakers Work in CrateDB

Each query in CrateDB is executed as a series of logical operations. Before executing each step, CrateDB performs a best-effort estimate of the additional memory required. If the projected usage exceeds the configured limit for a circuit breaker, the operation is aborted immediately, and a CircuitBreakingException is returned.

This preemptive behavior protects the JVM from reaching an unrecoverable out-of-memory state that would destabilize the node or cluster.


Types of Circuit Breakers

CrateDB includes six types of circuit breakers, each guarding a specific component or resource:

  • query – Tracks memory used during query execution.

  • request – Covers memory used during request handling.

  • jobs_log – Tracks memory used when writing to the jobs log.

  • operations_log – Tracks memory used when writing to the operations log.

  • total – Also known as the parent breaker, it tracks overall memory usage across all other breakers.

  • accounting – Deprecated; will be removed in a future version.

The total breaker acts as a global safety net. Even if individual breakers are within limits, the total breaker can still trip if their combined usage exceeds the cluster-wide memory threshold.

For details on configuring breaker limits, see the cluster settings documentation.


Monitoring and Observability

To monitor circuit breaker behavior:

  • Use JMX metrics. Refer to the JMX Monitoring Guide, particularly the CircuitBreakers MXBean section.

  • For hosted deployments, follow the CrateDB Cloud monitoring documentation.

  • For self-managed clusters, refer to the on-prem monitoring guide, which includes setup instructions for collecting metrics and visualizing them in Grafana.


Exception Handling

When a circuit breaker is triggered, CrateDB returns a CircuitBreakingException. For example:

CircuitBreakingException[Allocating 2mb for 'query: mergeOnHandler' failed, breaker would use 976.4mb in total. Limit is 972.7mb. Either increase memory and limit, change the query or reduce concurrent query load]

Interpreting the Error

This exception indicates that the estimated memory for mergeOnHandler exceeded the configured limit (indices.breaker.query.limit). As a result, CrateDB aborted the operation to protect the node.

Immediate Actions

1. Optimize the Query

Poorly written or overly complex queries can trigger breakers. See the Performance tuning guide for practical tips.

2. Identify High-Memory Queries

You can identify the most memory-intensive active queries by running:

SELECT  js.id,
        stmt,
        username,
        sum(used_bytes) AS sum_bytes
FROM sys.operations op
JOIN sys.jobs js ON op.job_id = js.id
GROUP BY js.id, stmt, username
ORDER BY sum_bytes DESC;

To inspect completed jobs and operations, use the sys.jobs_log and sys.operations_log system tables. Note that table access permissions apply.

3. Scale the Cluster

If breakers continue to trip even after optimizing queries, consider scaling out your cluster to provide additional resources.

Similar exceptions may occur for other breaker types like [request], [parent], or [jobs_log]. A [parent] exception means multiple queries or background tasks exceeded the combined total memory limit (indices.breaker.total.limit).


Summary

Circuit breakers are an essential safety mechanism in CrateDB, helping maintain performance and reliability under high memory pressure. By monitoring breaker metrics, tuning queries, and scaling resources as needed, you can avoid unexpected interruptions and ensure smooth cluster operation.

Last updated