How to Fix Elasticsearch 'master_not_discovered_exception' Error - 헤이든의 전산실 (Hayden's Server Room)

If you’ve been working with Elasticsearch for any length of time, you’ve probably encountered the dreaded master_not_discovered_exception error. This error can be particularly frustrating because it effectively renders your cluster unusable until resolved.

I’ve spent countless hours debugging this issue across different environments, and I’m here to share the most effective solutions that actually work. Let’s dive into the root causes and systematic fixes that will get your cluster back up and running.

Table of Contents

1. Understanding ‘master_not_discovered_exception’ Error

The master_not_discovered_exception occurs when Elasticsearch nodes cannot locate or elect a master node within the cluster. When this happens, your cluster becomes unresponsive:

{
  "error": {
    "root_cause": [{
      "type": "master_not_discovered_exception",
      "reason": null
    }],
    "type": "master_not_discovered_exception",
    "reason": null
  },
  "status": 503
}

This error essentially means your cluster is in a state where no node can be designated as the master, preventing any cluster-level operations from executing.

2. Root Cause Analysis

Network Connectivity Issues

The most common culprit is network connectivity problems between nodes. If nodes can’t communicate with each other, they can’t perform master election.

Configuration Problems

Since Elasticsearch 7.x, cluster initialization settings became mandatory. Missing or incorrect configurations will trigger this error.

Firewall and Security Group Restrictions

Cloud environments, particularly AWS EC2, often have restrictive security settings that block the necessary ports for inter-node communication.

Split-Brain Scenarios

In multi-node clusters, incorrect quorum settings can lead to split-brain situations where multiple nodes attempt to become master simultaneously.

3. Step-by-Step Resolution Guide

3-1. Network Connectivity Verification

Before diving into configuration changes, verify that your nodes can actually communicate with each other.

Step 1: Check HTTP connectivity

# Test the HTTP API (port 9200)
curl -X GET "localhost:9200/_cluster/health?pretty"

Step 2: Verify transport layer connectivity

# Test inter-node communication (port 9300)
telnet [target_node_ip] 9300

If either of these fails, you have a network connectivity issue that needs to be resolved first.

Resolution Steps:

Configure firewall rules (Linux)

# For CentOS/RHEL systems
sudo firewall-cmd --permanent --add-port=9200/tcp
sudo firewall-cmd --permanent --add-port=9300/tcp
sudo firewall-cmd --reload

# For Ubuntu systems
sudo ufw allow 9200
sudo ufw allow 9300
sudo ufw reload

AWS EC2 Security Group Configuration
- Navigate to EC2 → Security Groups
- Add inbound rules for TCP ports 9200 and 9300
- Set source to either specific IP ranges or the security group itself for internal communication

3-2. Single Node Configuration Fix

For single-node deployments, edit your /etc/elasticsearch/elasticsearch.yml file with these essential settings:

# Cluster identification
cluster.name: my-elasticsearch-cluster

# Node identification
node.name: node-1

# Network binding (be careful with 0.0.0.0 in production)
network.host: 0.0.0.0
http.port: 9200

# Critical: Bootstrap setting for ES 7.x+
cluster.initial_master_nodes: ["node-1"]

# Discovery configuration
discovery.seed_hosts: ["127.0.0.1"]

Important Configuration Notes:

Setting network.host to 0.0.0.0 puts Elasticsearch into production mode, which enforces additional system checks
The cluster.initial_master_nodes setting is mandatory for Elasticsearch 7.x and later
Node name in cluster.initial_master_nodes must match the node.name setting exactly

After making changes:

# Restart Elasticsearch service
sudo systemctl restart elasticsearch

# Verify the fix
curl -X GET "localhost:9200/_cluster/health?pretty"

3-3. Multi-Node Cluster Configuration

For multi-node clusters, each node requires specific configuration. Here’s how to set up a 3-node cluster properly:

Master Node Configuration (IP: 192.168.1.10)

cluster.name: production-cluster
node.name: master-node-1
node.roles: ["master"]

# Network settings
network.host: 192.168.1.10
http.port: 9200
transport.port: 9300

# Discovery settings - list all potential master nodes
discovery.seed_hosts: ["192.168.1.10:9300", "192.168.1.11:9300", "192.168.1.12:9300"]

# Bootstrap settings - only specify master-eligible nodes
cluster.initial_master_nodes: ["master-node-1", "master-node-2"]

Data Node Configuration (IP: 192.168.1.11)

cluster.name: production-cluster
node.name: data-node-1
node.roles: ["data"]

# Network settings
network.host: 192.168.1.11
http.port: 9200
transport.port: 9300

# Discovery settings - same as master nodes
discovery.seed_hosts: ["192.168.1.10:9300", "192.168.1.11:9300", "192.168.1.12:9300"]

# Note: Data nodes don't need cluster.initial_master_nodes

Critical Multi-Node Setup Rules:

All nodes must have identical cluster.name
Each node must have a unique node.name
discovery.seed_hosts should list all cluster nodes
Only master-eligible nodes should be in cluster.initial_master_nodes

3-4. Version-Specific Considerations

Elasticsearch Version	Key Changes	Required Settings
6.x and earlier	Uses `discovery.zen.*` settings	`discovery.zen.minimum_master_nodes`
7.x	Introduces `cluster.initial_master_nodes`	`cluster.initial_master_nodes` mandatory
8.x	Removes all `discovery.zen.*` settings	`cluster.initial_master_nodes` only

For Elasticsearch 8.x users: Remove any discovery.zen.* configurations from your elasticsearch.yml file, as they’re no longer supported and will cause startup failures.

4. Advanced Troubleshooting

4-1. Log Analysis

Elasticsearch logs contain crucial information for diagnosing master discovery issues:

# Monitor logs in real-time
sudo tail -f /var/log/elasticsearch/[cluster-name].log

# Search for specific error patterns
sudo grep -i "master_not_discovered\|zen discovery\|not enough master nodes" /var/log/elasticsearch/*.log

Common log patterns and their meanings:

"not enough master nodes discovered during pinging" → Network connectivity or configuration issue
"master not discovered yet" → Cluster initialization problem
"failed to join" → Node cannot connect to existing cluster

4-2. Cluster State Diagnostics

Use these commands to diagnose cluster issues:

# Check overall cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# List all nodes in the cluster
curl -X GET "localhost:9200/_cat/nodes?v&h=name,role,master,ip"

# Identify current master
curl -X GET "localhost:9200/_cat/master?v"

# Check cluster settings
curl -X GET "localhost:9200/_cluster/settings?pretty"

4-3. Emergency Recovery Procedures

When your cluster is completely unresponsive:

Step 1: Stop all Elasticsearch services

# On each node, stop Elasticsearch
sudo systemctl stop elasticsearch

Step 2: Backup your data (critical step)

# Create a backup of your data directory
sudo cp -r /var/lib/elasticsearch /var/lib/elasticsearch_backup_$(date +%Y%m%d)

Step 3: Clean cluster state (if necessary)

# Only if you're sure about cluster corruption
# This will reset cluster metadata but preserve indices
sudo rm -rf /var/lib/elasticsearch/nodes/*/cluster-state/

Step 4: Fix configuration and restart

# Start master nodes first
sudo systemctl start elasticsearch

# Wait 30-60 seconds, then start data nodes
sudo systemctl start elasticsearch

5. Prevention and Monitoring

5-1. Automated Health Monitoring

Implement automated monitoring to catch issues early:

#!/bin/bash
# elasticsearch-health-check.sh

CLUSTER_STATUS=$(curl -s -X GET "localhost:9200/_cluster/health" | jq -r '.status')
MASTER_NODE=$(curl -s -X GET "localhost:9200/_cat/master?h=node")

if [ "$CLUSTER_STATUS" != "green" ]; then
    echo "ALERT: Cluster status is $CLUSTER_STATUS"
    echo "Current master: $MASTER_NODE"
    # Add your notification logic here (Slack, email, etc.)
fi

5-2. Recommended Monitoring Metrics

Monitor these key indicators:

Cluster status (green/yellow/red)
Master node stability (frequent changes indicate problems)
Node connectivity (transport layer failures)
JVM heap usage (high memory pressure can cause master issues)

5-3. Best Practices for Cluster Stability

Use dedicated master nodes in production clusters with heavy workloads
Maintain odd number of master-eligible nodes to avoid split-brain scenarios
Set appropriate heap sizes (typically 50% of RAM, max 32GB)
Regular cluster snapshots for disaster recovery

6. Frequently Asked Questions

Q: Do I need cluster.initial_master_nodes for a single node? A: Yes, starting with Elasticsearch 7.x, this setting is required even for single-node deployments. Set it to your node’s name.

Q: Why does this error keep happening on AWS? A: Usually due to security group restrictions. Ensure TCP port 9300 is open for inter-node communication within your cluster’s security group.

Q: Can I remove cluster.initial_master_nodes after cluster startup? A: Yes, but it’s recommended to keep it for cluster restart scenarios. You can comment it out after successful cluster formation.

Q: What’s the difference between discovery.seed_hosts and cluster.initial_master_nodes? A: discovery.seed_hosts helps nodes find each other, while cluster.initial_master_nodes specifies which nodes can be elected as master during initial cluster bootstrap.

The master_not_discovered_exception error, while intimidating, follows predictable patterns and has reliable solutions. Most cases stem from network connectivity issues or configuration oversights that can be systematically resolved.

Start with network verification, then move to configuration fixes, and always test your changes in a controlled environment before applying them to production. With proper monitoring and the troubleshooting steps outlined above, you can minimize downtime and maintain cluster stability. Remember: in production environments, prevention through proper monitoring and regular health checks is far more valuable than reactive troubleshooting.

1. Understanding ‘master_not_discovered_exception’ Error

2. Root Cause Analysis

Network Connectivity Issues

Configuration Problems

Firewall and Security Group Restrictions

Split-Brain Scenarios

3. Step-by-Step Resolution Guide

3-1. Network Connectivity Verification

3-2. Single Node Configuration Fix

3-3. Multi-Node Cluster Configuration

3-4. Version-Specific Considerations

4. Advanced Troubleshooting

4-1. Log Analysis

4-2. Cluster State Diagnostics

4-3. Emergency Recovery Procedures

5. Prevention and Monitoring

5-1. Automated Health Monitoring

5-2. Recommended Monitoring Metrics

5-3. Best Practices for Cluster Stability

6. Frequently Asked Questions

관련

Leave a ReplyCancel reply

1. Understanding ‘master_not_discovered_exception’ Error

2. Root Cause Analysis

Network Connectivity Issues

Configuration Problems

Firewall and Security Group Restrictions

Split-Brain Scenarios

3. Step-by-Step Resolution Guide

3-1. Network Connectivity Verification

3-2. Single Node Configuration Fix

3-3. Multi-Node Cluster Configuration

3-4. Version-Specific Considerations

4. Advanced Troubleshooting

4-1. Log Analysis

4-2. Cluster State Diagnostics

4-3. Emergency Recovery Procedures

5. Prevention and Monitoring

5-1. Automated Health Monitoring

5-2. Recommended Monitoring Metrics

5-3. Best Practices for Cluster Stability

6. Frequently Asked Questions

이 글 공유하기:

관련

Leave a ReplyCancel reply