Ever had your Elasticsearch cluster suddenly throw a CircuitBreakerException
and leave you scrambling to figure out what went wrong? This error might look simple on the surface, but it’s actually pointing to critical memory management issues that can cripple your search infrastructure.
Let’s dive into the real causes behind these exceptions and walk through proven solutions that actually work in production environments.
1. What is CircuitBreakerException?
Elasticsearch’s circuit breaker acts like an electrical circuit breaker in your home—it’s a safety mechanism designed to prevent catastrophic failures. When Elasticsearch estimates that an operation might consume more memory than available, it throws a CircuitBreakerException
instead of risking an OutOfMemory error that could crash the entire node.
Think of it as Elasticsearch saying: “I’d rather reject this request than let the whole system go down.”
Typical error message:
{
"error": {
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [16355096754/15.2gb], which is larger than the limit of [16213167308/15gb]",
"bytes_wanted": 16355096754,
"bytes_limit": 16213167308
},
"status": 429
}
2. Circuit Breaker Types and Their Limits
Elasticsearch operates several types of circuit breakers, each serving a specific purpose:
Type and limit:
Circuit Breaker Type | Default Limit | Purpose |
---|---|---|
Parent Breaker | 95% of JVM heap | Controls total memory usage across all breakers |
Request Breaker | 60% of JVM heap | Limits memory for request processing (aggregations, etc.) |
Fielddata Breaker | 40% of JVM heap | Controls fielddata cache memory usage |
In-flight Requests | 100% of JVM heap | Manages memory for ongoing requests |
3. Diagnosing the Root Cause
Before jumping into solutions, you need to understand what’s actually happening. Here’s how to get the real picture:
Check current memory usage:
# Get heap usage per node
curl -X GET "localhost:9200/_cat/nodes?v&h=name,heap.percent,heap.current,heap.max"
# Detailed circuit breaker stats
curl -X GET "localhost:9200/_nodes/stats/breaker"
Examine fielddata usage:
# Check fielddata memory consumption
curl -X GET "localhost:9200/_nodes/stats/indices/fielddata"
Review slow logs for problematic queries:
# Check for resource-intensive queries
tail -f /var/log/elasticsearch/elasticsearch_index_search_slowlog.log
4. Immediate Emergency Fixes
When your cluster is down and users are complaining, here’s what you do first:
4-1. Increase JVM Heap Size
The most direct approach is expanding available memory. The golden rule: set heap to 50% of available RAM, but never exceed 32GB.
For Docker deployments:
# docker-compose.yml
services:
elasticsearch:
environment:
ES_JAVA_OPTS: "-Xmx16g -Xms16g" # Increased from 8g
For traditional installations:
# Edit jvm.options
-Xms16g
-Xmx16g
4-2. Temporarily Raise Circuit Breaker Limits
Quick fix without restart:
# Increase request circuit breaker
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"persistent": {
"indices.breaker.request.limit": "65%"
}
}'
# Adjust parent circuit breaker
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"persistent": {
"indices.breaker.total.limit": "80%"
}
}'
Warning: This is a temporary fix. Simply raising limits without addressing root causes can lead to OutOfMemory errors.
5. Permanent Circuit Breaker Configuration
For long-term stability, configure these settings in your elasticsearch.yml
:
# Recommended production settings
indices.breaker.total.limit: 80%
indices.breaker.request.limit: 65%
indices.breaker.fielddata.limit: 45%
indices.breaker.total.use_real_memory: true
6. Tackling Fielddata Issues
Fielddata is often the culprit behind circuit breaker exceptions, especially with high-cardinality text fields.
6-1. Clear Fielddata Cache Immediately
# Clear all fielddata cache (cluster-wide)
curl -X POST "localhost:9200/*/_cache/clear?fielddata=true"
# Clear specific index fielddata
curl -X POST "localhost:9200/your_index/_cache/clear?fielddata=true"
6-2. Fix Your Mappings
The real solution is preventing fielddata usage in the first place:
// Instead of this problematic mapping:
{
"mappings": {
"properties": {
"category": {
"type": "text",
"fielddata": true // Memory killer!
}
}
}
}
// Use this optimized version:
{
"mappings": {
"properties": {
"category": {
"type": "keyword" // Much more memory-efficient
}
}
}
}
7. Query Optimization
7-1. Limit Aggregation Sizes
Large aggregations are circuit breaker killers:
// Problematic query
{
"aggs": {
"all_categories": {
"terms": {
"field": "category",
"size": 50000 // Way too large!
}
}
}
}
// Optimized query
{
"aggs": {
"top_categories": {
"terms": {
"field": "category",
"size": 100 // Reasonable limit
}
}
}
}
7-2. Use Composite Aggregations for Large Datasets
When you need more than a few hundred buckets:
{
"aggs": {
"categories": {
"composite": {
"size": 100,
"sources": [
{"category": {"terms": {"field": "category"}}}
]
}
}
}
}
7-3. Implement Time-Based Filtering
Always limit your search scope:
{
"query": {
"bool": {
"filter": [
{
"range": {
"@timestamp": {
"gte": "now-24h",
"lte": "now"
}
}
}
]
}
}
}
8. Monitoring and Prevention
8-1. Set Up Proactive Monitoring
Don’t wait for circuit breakers to trip. Monitor heap usage continuously:
# Monitor heap usage every 30 seconds
watch -n 30 'curl -s "localhost:9200/_cat/nodes?v&h=name,heap.percent" | column -t'
8-2. Configure Alerting
Set up alerts when memory usage consistently exceeds 85%:
Using Elasticsearch Watcher:
{
"trigger": {
"schedule": {"interval": "5m"}
},
"input": {
"search": {
"request": {
"indices": [".monitoring-es-*"],
"body": {
"query": {
"bool": {
"filter": [
{"range": {"timestamp": {"gte": "now-10m"}}},
{"range": {"node_stats.jvm.mem.heap_used_percent": {"gte": 85}}}
]
}
}
}
}
}
},
"actions": {
"send_email": {
"email": {
"to": ["admin@yourcompany.com"],
"subject": "High heap usage detected"
}
}
}
}
9. When to Scale Your Cluster
Sometimes the solution isn’t tuning—it’s adding resources:
Scale horizontally when:
- Memory usage consistently exceeds 85% despite optimization
- Query response times are degrading
- You’re processing more data than a single node can handle efficiently
Scale vertically when:
- You have complex aggregations that require more heap
- Your working dataset exceeds current node capacity
Scaling best practices:
- Add data nodes to distribute shard load
- Consider dedicated master nodes for large clusters (3+ data nodes)
- Use instance types with at least 32GB RAM for production
10. Emergency Response Checklist
When circuit breakers are firing:
Immediate Assessment (< 5 minutes):
- [ ] Check heap usage:
curl "localhost:9200/_cat/nodes?v&h=heap.percent"
- [ ] Identify breaker type from error logs
- [ ] Look for running heavy queries in slow logs
- [ ] Check fielddata usage if fielddata breaker tripped
Emergency Actions (< 15 minutes):
- [ ] Cancel resource-intensive queries if identified
- [ ] Clear fielddata cache if applicable
- [ ] Temporarily increase circuit breaker limits
- [ ] Restart nodes with increased heap if necessary
Root Cause Resolution (within hours):
- [ ] Optimize problematic queries
- [ ] Fix fielddata mappings
- [ ] Implement proper monitoring
- [ ] Plan capacity upgrades if needed
11. Production-Ready Settings
Here’s a battle-tested configuration for production environments:
# elasticsearch.yml
indices.breaker.total.limit: 85%
indices.breaker.request.limit: 60%
indices.breaker.fielddata.limit: 40%
indices.breaker.total.use_real_memory: true
# JVM settings (jvm.options)
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
Circuit breaker exceptions aren’t just errors—they’re warnings that your cluster is under memory pressure. The key is treating them as early warning signals rather than problems to simply work around. Fix the underlying causes, monitor proactively, and your Elasticsearch cluster will thank you with stable, predictable performance.