Nutanix Cluster – Enabling Maintenance mode on ESXi Host

Lets have an overview about Nutanix Virtual Computing Platform prior directly jumping in to the steps on enabling maintenance mode.

Hyper-Converged Infra (HCI)

HCI is a software based architecture which tightly integrates the compute, storage, network and Virtualization. Here the vital part is that the local storage of physical servers which are part of cluster are converged and provided as a pool of shared storage resource to utilize the Virtualization features.

Nutanix Virtual Computing Platform

Nutanix Virtual Computing Platform is a Hyper-Converged Infra. Here, Nutanix  uses its own developed NDFS (Nutanix Distributed File System) for converging storage resources.

In general, the physical hosts which are part of Nutanix Cluster are installed with Standard Hypervisor (In this case assume ESXi) and they have their own hardware resources such as Processor, Memory, Storage and Network.

Here Nutanix places its Controller-VM (CVM) in each host \ node of cluster. It is the one which is responsible for forming the unified shared storage resource and serving the IOPS from hypervisor. So, CVM is the key one which enables storage level convergence.

Logically we can say

  • Compute level clustering is happening with help of vSphere HA & DRS of hypervisor.
  • Storage level convergence \ clustering is happening with help of Nutanix CVM.

So, while taking a host \ node for activity. Two level of maintenance have to be placed.

  • Hypervisor level maintenance
  • CVM level maintenance.

Consider we are having Nutanix Cluster with 5-ESXi hosts and resiliency factor is set to withstand single node failure. So, we can safely take one node for maintenance activity.

Steps to enable maintenance mode:

It is good to collect the NCC log and have it verified with Nutanix to ensure that there is no existing critical issues in cluster.

Verify the “Data Resiliency status” of Nutanix Cluster in PRISM portal, it should be Normal prior starting the activity.

As a first step, we have to migrate all the User VMs which are residing in the target host (except CVM) to other hosts available in cluster.

Connect the CVM of target ESXi host via SSH and execute the below mentioned command to find its UUID.

ncli host ls | grep -C7 [IP-Adress of CVM]

Place the CVM in maintenance mode using its UUID which we have fetched in previous step.

ncli host edit id=[UUID] enable-maintenance-mode=”true”

Verify that the CVM has been placed in maintenance mode using following command. In this stage, CVM level Maintenance mode is enabled and confirmed.

cluster status | grep CVM

Now do the shutdown of CVM using below command.

cvm_shutdown -h now

Now it is safe to enable maintenance mode at hypervisor level. All the user VMs were migrated to other nodes and CVM also brought down gracefully as per previous steps.

Place the target ESXi host in maintenance mode and take it for your maintenance activity.

Steps to exit from maintenance mode:

Once completed with the maintenance activity, now we have to add the nodes back to cluster.

Exit the ESXi host from Maintenance mode and Power-ON the CVM.

Connect a neighbor CVM available in Cluster via SSH.

Check the status of CVM which we have Powered-ON. In this stage it should be reported as it is in maintenance mode.

ncli host ls | grep -C7 [IP-Adress of CVM]

Exit the CVM from maintenance mode using its UUID which we have fetched in previous step.

ncli host edit id=[UUID] enable-maintenance-mode=”false”

Verify that the CVM has been removed from maintenance mode using following command.

cluster status | grep CVM

CVM came out of maintenance mode now.

Ensure that the “Data resiliency and Meta-data sync status” came normal in PRISM portal. It may take few minutes to reflect.

Note: In the given commands, parameters in the brackets [ ] should be replaced with appropriate value.

For example –

ncli host ls | grep -C7 [IP-Address of CVM]   –>   ncli host ls | grep -C7 169.254.20.1

Intended Audience – Administrators of Nutanix Virtual Computing Platform with vSphere ESXi.

Thanks for reading the post and do share your views ūüôā


Never Stop Learning !

Advertisements

ESXi Host Sync Issue with vCenter – A general System error occurred: Failed to login with vim administrator password

Few weeks back, have experienced an ESXi host not responding state issue. The symptoms are –

  • Host reported as not responding in vCenter
  • Unable to connect the host directly through vSphere client
  • Not responding to any of the commands in DCUI
  • But VMs are running without any issues and accessible in network

ESXi host found enabled with Lockdown mode, SSH also disabled and no remote logging configured. So, there is no option available to connect it remotely during that scenario to troubleshoot.

Few days later, ESXi host starts responding to commands in DCUI. So, we have enabled the SSH and disabled the Lockdown mode which would help us to troubleshoot further in this state.

Fortunately, we could connect the ESXi host directly through vSphere client post disabling lockdown mode. But the host found still reported as ‚ÄúNot responding‚ÄĚ in vCenter.

We have tried reconnecting the host back to vCenter since it is finely accessible through vSphere client in direct. But the reconnect operation fails with below error:

Error – A general System error occurred: Failed to login with vim administrator password

We could see – vim.fault.InvalidLogin event traces in vpxa log reporting “Cannot complete login due to incorrect user name and password”¬†

Considering the above error, we suspect issue with vpxa agent. So we were in plan to completely remove the host from vCenter inventory and add it back.

But at the same time, While processing the vCenter connect request from hypervisor end, below mentioned event found logged in ESXi events:

Error – The ramdisk ‚Äėvar‚Äô is full. As a result, the file /var/run/vmware/tickets/vmtck¬† could not be written

The above error clearly indicates that the var partition is full and we have confirmed the same with following command –> ¬†vdf -h in putty session of ESXi host.

Ramdisk                Size    Used  Available Use% Mounted on
root ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† 32M ¬† ¬† 2M ¬† ¬† ¬† 29M ¬† ¬† ¬† ¬† 9% ¬† ¬† ¬† ¬†—
etc ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†28M ¬†396K ¬† ¬† ¬†27M ¬† ¬† ¬† ¬† 1% ¬† ¬† ¬† ¬†—
opt ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†32M ¬† ¬† ¬†0B ¬† ¬† ¬†32M ¬† ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬†—
var ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†48M ¬† 48M ¬† ¬† ¬† ¬†0M ¬† ¬† 100% ¬† ¬† ¬† ¬†—
tmp ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† 256M ¬†180K ¬† ¬†255M ¬† ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬†—
iofilters ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†32M ¬† ¬† ¬† 0B ¬† ¬† ¬† 32M ¬† ¬† ¬† ¬†0% ¬† ¬† ¬† ¬†—
hostdstats ¬† ¬† ¬† ¬† ¬† ¬†2053M ¬† ¬† ¬†7M ¬† 2045M ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬† —
stagebootbank ¬† ¬† ¬†250M ¬† ¬† ¬†4K ¬† ¬† ¬†249M ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬† —

The above output confirms that the var partition is full.

So, we have moved the old log files which blocks the space in var partition to local Datastore. Post which we could see the sufficient free space available in var partition. If you are not familiar with command to move the files, you can use WinSCP tool for the same.

Ramdisk                Size    Used  Available Use% Mounted on
root ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† 32M ¬† ¬† 2M ¬† ¬† ¬† 29M ¬† ¬† ¬† ¬† 9% ¬† ¬† ¬† ¬†—
etc ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†28M ¬†396K ¬† ¬† ¬†27M ¬† ¬† ¬† ¬† 1% ¬† ¬† ¬† ¬†—
opt ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†32M ¬† ¬† ¬†0B ¬† ¬† ¬†32M ¬† ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬†—
var ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†48M ¬† ¬† ¬†1M ¬† ¬† 47M ¬† ¬† ¬† ¬†2% ¬† ¬† ¬† ¬†—
tmp ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† 256M ¬†180K ¬† ¬†255M ¬† ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬†—
iofilters ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬† ¬†32M ¬† ¬† ¬† 0B ¬† ¬† ¬† 32M ¬† ¬† ¬† ¬†0% ¬† ¬† ¬† ¬†—
hostdstats ¬† ¬† ¬† ¬† ¬† ¬†2053M ¬† ¬† ¬†7M ¬† 2045M ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬† —
stagebootbank ¬† ¬† ¬†250M ¬† ¬† ¬†4K ¬† ¬† ¬†249M ¬† ¬† ¬† 0% ¬† ¬† ¬† ¬† —

Going back to the main problem, now we have tried connecting back the ESXi host in vCenter. The task has gone successful and host also came in sync with vCenter.

So whenever you face host sync issue with vCenter, please have a check on ramdisk usage as well.


Never Stop Learning !