vCenter raised several alarms because its /storage/log partition was full. The errors were:

  • Alert: Log Disk Exhaustion
  • Alert: RBD Health Alarm
  • Warning: PostgreSQL Service Health Alarm
vCenter error overview

Find the large logs

Logging into the vCenter appliance using SSH is a good starting point for troubleshooting.

root@vcenter [ ~ ]# df -h
Filesystem                                   Size  Used Avail Use% Mounted on
...
/dev/mapper/archive_vg-archive                49G   29G   19G  61% /storage/archive
/dev/mapper/core_vg-core                      25G  2.6G   21G  11% /storage/core
/dev/mapper/log_vg-log                       9.8G  9.8G    5M 100% /storage/log
/dev/mapper/db_vg-db                         9.8G  698M  8.6G   8% /storage/db
/dev/mapper/seat_vg-seat                     541G  279M  513G   1% /storage/seat
...

When searching for the largest file in the /storage/log partition it showed:

root@vcenter [ /storage/log ]# find /storage/log -type f -size +25M
/storage/log/vmware/vc-ws1a-broker/federation-gc.log.1
/storage/log/vmware/vpxd/vpxd-34.log
/storage/log/vmware/vpxd/vpxd-profiler-126.log
/storage/log/vmware/wcp/stdstream.log-0.stderr
/storage/log/vmware/sso/vmware-identity-sts-default-9.log
/storage/log/vmware/vsphere-ui/logs/threadmonitor.log
/storage/log/vmware/vsphere-ui/logs/threadmonitor1.log
/storage/log/vmware/vsphere-ui/logs/vsphere_client_virgo.log
/storage/log/vmware/rbd/preUpgradeLogs.zip
/storage/log/vmware/vmware-updatemgr/updatemgr-vmon.log.stderr
/storage/log/vmware/procstate

Of the files listed above, the updatemgr-vmon.log.stderr file was by far the largest

-rw------- 1 updatemgr updatemgr 5.8G Dec 4 01:51 updatemgr-vmon.log.stderr

Log cleanup

Now let’s delete all larger files in the /storage/logs/vmware directory

find /storage/log/vmware -type f -size +25M -delete

While browsing though the other directories and partions it showed that two others also contained old log files. I decided that all files older than 30 days can safely be removed.

The directories are:

  • /storage/log/vmware/vc-ws1a-broker
  • /storage/archive/vpostgres
find /storage/log/vmware/vc-ws1a-broker -type f -mtime +30 -delete
find /storage/archive/vpostgres -type f -mtime +30 -delete

Cleanup result

Before:

/dev/mapper/archive_vg-archive 49G 29G 19G 61% /storage/archive
/dev/mapper/log_vg-log 9.8G  9.8G    5M 100% /storage/log

After:

/dev/mapper/archive_vg-archive 49G 3.1G 44G 7% /storage/archive
/dev/mapper/log_vg-log 9.8G  2.3G  7.1G 25% /storage/log

In total this saved about 33 GB of unneeded logs. But unfortunately the /storage/log partition started filling quickly again at a rate of around 100MB per 15 min. It turned out the the updatemgr-vmon.log.stderr file started growing again since space was available again.

Fixing the vLCM Baselines

While tailing (tail -f) the updatemgr-vmon.log.stderr file it showed that vLCM has severe issues compiling the vLCM image baseline for two of my clusters which flooded the screen with error logs. Unfortunately I do not have the exact logs anymore. It showed that vLCM had issues compiling the image baseline related to the Mellanox Connect-X (nmlx5_core) and Intel E810 (icen) ESXi driver

I knew that I’ve added a couple of async drivers to the vLCM image baseline (using a custom json file) a while ago. That was done to make sure the hosts had the latest NIC drivers and VMware Tools.

After deleting both of the ESXi NIC drivers (nmlx5_core and icen) from the vLCM image baseline component list, the flooding of log messages stopped.

FYI: vCenter version 8.0 Update 1, was used while this issue showed.


0 Comments

Leave a Reply