[VMware] ESXi "Host not responding" Troubleshooting - 헤이든의 전산실 (Hayden's Server Room)

Running a VMware environment, you’ll eventually encounter ESXi hosts showing as “Not Responding” in vCenter. While this can be alarming at first, most cases are resolvable through systematic troubleshooting. This guide walks you through diagnosing and resolving these issues step by step.

ESXi hosts can enter a “not responding” state for various reasons, from network connectivity issues to management service failures and storage problems. The key is methodically checking each potential cause to identify the root issue.

Common symptoms include:

ESXi host appears grayed out in vCenter
“Cannot synchronize host” error messages
Virtual machines displayed as inactive (grayed out)
Direct connection attempts fail

Table of Contents

1. Basic Status Verification

Start with fundamental checks of your ESXi host’s condition.

Physical Power Status Check

Verify the ESXi server is physically operational:

Check server power LED status
Access remote management console (iDRAC, iLO, etc.)
Look for PSOD (Purple Screen of Death) occurrence

If PSOD occurred, reference the error codes in VMware KB 343033 to determine if it’s hardware or software related.

Reconnect from vCenter

Try this simple but often effective first step:

Right-click the ESXi host in vCenter Web Client
Select Connection > Connect
Check if reconnection succeeds

If this resolves the issue, it was likely a temporary communication problem.

2. Network Connectivity Testing

Verify network communication between ESXi and vCenter.

Basic Ping Tests

From your vCenter server, test connectivity to the ESXi host:

# Test by IP address
ping 192.168.1.100

# Test by FQDN
ping esxi-host.domain.local

Port 902 Connectivity Check

ESXi hosts send heartbeat packets to vCenter every 10 seconds via UDP port 902. If vCenter doesn’t receive a heartbeat within 60 seconds, it marks the host as “Not Responding.”

Windows port 902 test:

telnet 192.168.1.100 902

Linux port 902 test:

nc -zv 192.168.1.100 902

If connections fail, firewalls are likely blocking UDP port 902 traffic.

DNS Configuration Check

Test for DNS resolution issues causing connectivity problems:

SSH to the ESXi host
Check /etc/hosts file contents
Verify DNS server configuration (Network > TCP/IP Configuration)

3. Management Agent Restart

ESXi has two primary management agents: hostd manages most host operations, while vpxa is activated when ESXi joins vCenter.

Restart Management Services via DCUI

Through physical or remote console access:

Press Alt + F2 to access DCUI
Select Troubleshooting Options
Choose Restart Management Agents
Press F11 to confirm

Individual Service Restart via SSH

For more precise control:

# Restart hostd service
/etc/init.d/hostd restart

# Restart vpxa service
/etc/init.d/vpxa restart

⚠️ Important Notes:

Never use /sbin/services.sh restart in NSX environments as it temporarily interrupts network connections
In VDI environments using shared graphics, restart individual services only to avoid xorg service interruption

Remote Restart via PowerCLI

# Connect directly to ESXi host
Connect-VIServer -Server 192.168.1.100

# Restart management service
Get-VMHostService | Where {$_.Key -eq "hostd"} | Restart-VMHostService -Confirm:$false

4. Disk Space and Resource Check

If disk partitions / or /var/log are full, hostd cannot start properly.

Check Disk Space

SSH to the ESXi host and run:

# Check disk usage
vdf -h

# Check log partition usage
du -sh /var/log/*

CPU and Memory Utilization Check

# Real-time system resource monitoring
esxtop

Sustained CPU usage above 90% may indicate resource constraints causing the issue.

5. Storage Connectivity Verification

ESXi hosts can disconnect from vCenter due to shared storage problems.

Storage Mount Status Check

# Check VMFS volume status
ls /vmfs/volumes

# Check storage adapter status
esxcli storage core adapter list

If commands take excessive time or return errors, storage connectivity issues are likely.

iSCSI/FC Connection Status Check

# Check iSCSI session status
esxcli iscsi session list

# Check FC adapter status
esxcli storage core adapter list -t fc

6. Log Analysis for Detailed Diagnosis

Key Log File Locations

Log File	Path	Purpose
hostd log	`/var/log/hostd.log`	Host daemon related issues
vpxa log	`/var/log/vpxa.log`	vCenter agent related problems
vmkernel log	`/var/log/vmkernel.log`	Kernel level errors and hardware issues
vCenter log	`/var/log/vmware/vpxd/vpxd.log`	vCenter server side problems

Common Error Patterns

Heartbeat loss:

Missed 2 heartbeats for host esx.example.com
No heartbeats received from host; time since last heartbeat: 6745344ms

hostd unresponsive:

hostd detected to be non-responsive

Certificate issues (post vCenter 8.0U2):

Discarding non-CA certificate

7. Advanced Troubleshooting

vCenter 8.0U2 Certificate Issues

After vCenter 8.0U2 upgrades, non-CA certificates in the TRUSTED_ROOTS certificate store can cause hostd to discard certificates and restart vpxa, creating connection loops.

Resolution:

# Check TRUSTED_ROOTS store certificates in vCenter
/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | egrep 'Alias|Key Usage' -A 1

# Remove non-CA certificates if needed
/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias <certificate_alias>

Increase Heartbeat Timeout

As a temporary solution, increase heartbeat timeout in vCenter:

Select vCenter object in vCenter Web Client
Go to Configure > Advanced Settings
Modify config.vpxd.heartbeat.hostTimeout from default (60 seconds) to a higher value

Management Network Restart

For network configuration issues:

Select Configure Management Network in DCUI
Choose Restart Management Network
Run network tests to verify connectivity

Post Views: 210

1. Basic Status Verification

Physical Power Status Check

Reconnect from vCenter

2. Network Connectivity Testing

Basic Ping Tests

Port 902 Connectivity Check

DNS Configuration Check

3. Management Agent Restart

Restart Management Services via DCUI

Individual Service Restart via SSH

Remote Restart via PowerCLI

4. Disk Space and Resource Check

Check Disk Space

CPU and Memory Utilization Check

5. Storage Connectivity Verification

Storage Mount Status Check

iSCSI/FC Connection Status Check

6. Log Analysis for Detailed Diagnosis

Key Log File Locations

Common Error Patterns

7. Advanced Troubleshooting

vCenter 8.0U2 Certificate Issues

Increase Heartbeat Timeout

Management Network Restart

관련

Leave a ReplyCancel reply

1. Basic Status Verification

Physical Power Status Check

Reconnect from vCenter

2. Network Connectivity Testing

Basic Ping Tests

Port 902 Connectivity Check

DNS Configuration Check

3. Management Agent Restart

Restart Management Services via DCUI

Individual Service Restart via SSH

Remote Restart via PowerCLI

4. Disk Space and Resource Check

Check Disk Space

CPU and Memory Utilization Check

5. Storage Connectivity Verification

Storage Mount Status Check

iSCSI/FC Connection Status Check

6. Log Analysis for Detailed Diagnosis

Key Log File Locations

Common Error Patterns

7. Advanced Troubleshooting

vCenter 8.0U2 Certificate Issues

Increase Heartbeat Timeout

Management Network Restart

이 글 공유하기:

관련

Leave a ReplyCancel reply