CISCO UCSC-C240-M4SX – Raid Write Policy Consideration

Hi All –

Recently in our infrastructure we have deployed ESXi on CISCO UCSC-C240-M4SX model hardware and used the same to migrate VM workload from old hardwares which has  reached End of Life.

During the migration, we have experienced slowness in data transfer. Speed of Data transfer was approximately 16MBps & time taken for VM migration with 250GB used space was near to 10Hrs. Migration of VM workload was happening with in same datacenter & Management network of ESXi hosts are part of same VLAN.

In our past experience, similar type of 250GB-VM migration within Datacenter expected to complete with in 2Hrs max. So, we have decided to trace the cause of slowness experienced with this CISCO Model servers.

At first, checked the speed & connectivity status of network ports used by management network. No Issues found.

Next moved further to performance statistics of Local SCSI controller since the VM data migration targeted to local raid store. Performance chart indicates latency was consistently above 20ms while performing the migration write process. This made us to suspect there might be some issue with SCSI controller.

CISCO 12G SAS Modular Raid Controller with 1GB Flash Backed Write Cache found to be available with this model servers. 24 x 300GB SAS Magnetic Disks attached to it & they were partitioned as [ 2 Disks – Raid 1 Store & Remaining Disks – Split in to 3 No of Raid 5 Stores ]. Even though the controller has flash backed cache, raid write policy found set as “Write through” by default. This was the cause which created slowness during VM migration.

Lets see what is Flash Backed Write Cache & Raid write policy setting ??

Flash Backed Write Cache is a component which helps controller to move data available in cache memory to non-volatile flash device in case of unexpected power failure.

White paper from HP explaining FBWC for reference – https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c00687518

Raid write policy is the setting which instructs controller to use cache memory for write operation or not.

Basically there are two settings available – Write through & Write Back.
Write through – Data write happen directly to disks. Controller sends data transfer completion signal to host once disks receive all the data in transaction.

Write Back – Data write happens in cache memory of controller at first and then it is moved to disks. Controller sends data transfer completion signal to host once cache receive all the data in transaction.

We have changed the write policy setting of Raid stores to “Write-Back” since we have flash backed write cache with the SCSI controller. After this change, reinitiated migration of VM with 250GB used space to identify the difference. This time migration speed was around 200MBps and duration took was approximately 23 minutes to complete 🙂

In Essence, CISCO UCSC-C240-M4SX  model servers are coming with “Write through” Raid Write Policy setting by default. Hence, suggesting to validate and change them according to your requirement & usage before hosting the workloads.

White paper available in below link for more insight:

https://www.cisco.com/c/en/us/products/servers-unified-computing/ucs-c-series-rack-servers/whitepaper-c11-738090.html

Thanks for reading the post and do share your views 🙂


Never Stop Learning !

 

Advertisements

VM Power-ON Operation fails with Error – Device ‘Bootstrap’ is not available

In this post we are going to see about an issue occurred in Power-ON operation of virtual server. Recently we have experienced APD scenario in a cluster, issue found to be at storage layer and same was resolved.

After that we observed some of the virtual servers left in Powered-OFF state which were Online before the occurrence of APD Scenario.

Symptoms pertaining to those VMs are listed below:

  • Power-ON operation of VM fails with error – Device ‘Bootstrap’ is not available.
  • Mostly the lock reported with –  .vmx.lck (or) .vswp file. You could see this information in ‘Events’ portion of Virtual machine.
  • VM continuously roams around hosts part of cluster.

How did we resolve the issue?

As the VM is roaming around, we have disabled both HA & DRS for time being.

Now, Decided to find out the host which is still holding the lock using below mentioned steps.

  • Logged in to one of the host in cluster via putty. Navigated to VM Folder location using – cd command [ e.g.  cd /vmfs/volumes/<Datastore-Name>/<VM-Name> ]. Once navigated, used ls – command to verify the presence of files related to Virtual machine.
  • Identified the owner of file experiencing lock using following command.
  • Command – vmkfstools -D <VM-Name>.vmx.lck. For example – If the VM name is Test01, then the command would be – vmkfstools -D Test01.vmx.lck
  • In the output, we could see the MAC address of  host in last frame of owner portion.
  • e.g. owner 00000000-00000000-0000-000000000000 –> Shows MAC address associated to VMKernel-Management Network Port group of Lock Owner.
  • If the owner portion displays – all Zero, then no lock available. In our case, we have got the MAC address associated to locking owner.

Manually migrated the VM to lock owner host which we have found in previous step & Powered-ON. This time, VM came to online state successfully since the lock ownership relies with this host.

Once completed with above remedial actions, we have enabled the HA & DRS of cluster.

 

Thanks for reading the post and do share your views 🙂


Never Stop Learning !

HCI – A Path to SDDC

In existing situation, we are in need of software plane which enables following key requirements:

  • Virtualize, Unifies, Optimize and Manage DC components such as compute, storage & network.
  • Integrate with automation & cloud platform.
  • Support for Disaster Recovery.

Here We Go …

Hyper-Converged Infrastructure [HCI]

HCI is a software based architecture which tightly integrates the compute, storage, network and Virtualization. Here the vital part is that the local storage of physical servers which are part of cluster are unified and provided as a pool of shared storage resource to utilize Virtualization features. By adding automation and network virtualization capability to HCI you can turn it as a complete SDDC.

In HCI, we are having two variants –

  • Software piece which converges storage resource built directly in hypervisor layer.
  • Third party software piece which converges storage resource runs in a virtual machine that sits on top of hypervisor.

In this post we are going to see about the hot features pertaining to In-built HCI – vSAN further.

How the hyper converged Infrastructure paves path to SDDC?

Data Locality

Traditional Infra comprises Rack/Blade Servers, Purpose-built Storage arrays and network components as separate units. To form a vSphere cluster, we need to get shared LUNs from storage array, In this model the data flow from server to storage array has to pass-through the intermediate fabric \ network switches. Data path is long & multiple components involved with this approach.

But in HCI, compute and storage are tightly coupled, data getting stored in local storage [Flash & Magnetic Disk] which in turn highly reduces the length of data path and latency.

In Recent Days, Technology advancements focused on bringing the data closer to processing unit resulted in introduction of flash devices in PCIe bus & memory channels [DIMM slots]. These kind of hardware level developments also encourages the evolution of Hyper-Converged Infrastructure further.

Scalability

Here we are having flexibility to Scale-out & Scale-up as per the demand. Scale-out can be done by adding new host to cluster. Scale-up by adding drives to host. In both the cases we can expand both capacity and cache layer in uniform manner. This Scalability mechanism helps to optimize the cost involved as we are expanding the cluster based on demand.

Performance

HCI delivers superior performance by following two things:

  • Hot VM data resides in Cache layer backed by Flash devices \ SSDs which enables high speed data access.
  • SPBM (Storage Policy based Management) provides granular-approach to VM Level. This allows administrator to change policy on demand to meet application requirement. SPBM definitions captures – capacity, performance, availability metrics in to account. It’s Cool right.

Efficiency

HCI started to offer storage efficiency features such as Erasure Coding, Compression and Deduplication. Which means effective utilization of storage resource happens in line with demand of User VMs.

Availability

Policy definition is key here to maintain data availability in case of node\host failure. For example, If the policy is defined to tolerate one host failure then two copies of VM data maintained across different nodes. In essence, to tolerate “N” failure “N+1” copy of VM data maintained. So, HCI is capable of providing enterprise class availability.

Management

Unified management availability with vCenter & web client. You could manage the vSAN cluster with same user interface which you used to take care your virtual infra. Even if you extend vSAN with network virtualization capability (VMware NSX) it is also manageable in same platform. Single window to manage your SDDC.

Disaster Recovery

vSphere Replication is a solution which can be used for replicating VMs asynchronously with required RPM, HCI is a well suitable and cost-effective platform to support it. We can customize the replication configuration per VM level, In essence granular approach available to set here as well.

To Conclude, I see all of these points as a solid proof that HCI going to play vital role in the rise of SDDC.

Thanks for reading the post and do share your views 🙂


Never Stop Learning !

Nutanix Cluster Shutdown and Start-Up

Hi all –

In this post, we are going to see about the steps involved in shutdown and start-up of Nutanix Cluster.

In general, we will not be in a scenario very often to bring down Nutanix Cluster. Most of the time for host \ node level maintenance activity, we can take one (or) two host at a time based on Resiliency Factor (RF) set.

However, some DC Infra & Network layer activities may disturb network availability and power state of all the nodes \ hosts part of Nutanix Cluster. During those scenarios, we need to shutdown the Nutanix Cluster prior to maintenance and bring up once completed with it.

Steps to shutdown the Nutanix Cluster:

  1. Ensure that the “Data resiliency” status of the target cluster is normal and there is no active critical alerts present.
  2. Shut down the User VMs present in Nutanix provided Datastore.
  3. Log in to any one of the CVM through SSH.
  4. Run the command “cluster stop” to bring down the Nutanix Cluster.
  5. Ensure all the services came down except Zeus and Scavenger in all the CVMs post executing above command. You can check the same by running the command “cluster status“.
  6. Once the Cluster came down, Nutanix provided datastores mounted on ESXi host will go inaccessible.
  7. Now you can bring down all the CVMs associated to that cluster from vSphere client (or) webclient gracefully [Using Shutdown guest option].
  8. All the VMs and Nutanix Cluster brought down properly. Now you can place the ESXi hosts in maintenance mode.

Steps to Start-up the Nutanix Cluster:

  1. Exit the ESXi hosts from maintenance mode.
  2. Power-ON the CVMs, wait for some time to complete boot-up.
  3. Once all the CVMs brought up, log in to any one of them via SSH.
  4. Run the command “cluster start” to bring up the Nutanix Cluster.
  5. Post executing the above command, ensure all the services came up by running the command “cluster status“.
  6. With in few minutes, PRISM portal will come accessible. You can see that the “Data resiliency” status reporting back as Normal.
  7. All the Nutanix provided datastores will come back accessible to ESXi hosts, now you can power-on the required User VMs.

Intended Audience – Administrators of Nutanix Virtual Computing Platform with vSphere ESXi.

Thanks for reading the post and do share your views 🙂


Never Stop Learning !

 

 

 

Nutanix Cluster – Enabling Maintenance mode on ESXi Host

Lets have an overview about Nutanix Virtual Computing Platform prior directly jumping in to the steps on enabling maintenance mode.

Hyper-Converged Infra (HCI)

HCI is a software based architecture which tightly integrates the compute, storage, network and Virtualization. Here the vital part is that the local storage of physical servers which are part of cluster are converged and provided as a pool of shared storage resource to utilize the Virtualization features.

Nutanix Virtual Computing Platform

Nutanix Virtual Computing Platform is a Hyper-Converged Infra. Here, Nutanix  uses its own developed NDFS (Nutanix Distributed File System) for converging storage resources.

In general, the physical hosts which are part of Nutanix Cluster are installed with Standard Hypervisor (In this case assume ESXi) and they have their own hardware resources such as Processor, Memory, Storage and Network.

Here Nutanix places its Controller-VM (CVM) in each host \ node of cluster. It is the one which is responsible for forming the unified shared storage resource and serving the IOPS from hypervisor. So, CVM is the key one which enables storage level convergence.

Logically we can say

  • Compute level clustering is happening with help of vSphere HA & DRS of hypervisor.
  • Storage level convergence \ clustering is happening with help of Nutanix CVM.

So, while taking a host \ node for activity. Two level of maintenance have to be placed.

  • Hypervisor level maintenance
  • CVM level maintenance.

Consider we are having Nutanix Cluster with 5-ESXi hosts and resiliency factor is set to withstand single node failure. So, we can safely take one node for maintenance activity.

Steps to enable maintenance mode:

It is good to collect the NCC log and have it verified with Nutanix to ensure that there is no existing critical issues in cluster.

Verify the “Data Resiliency status” of Nutanix Cluster in PRISM portal, it should be Normal prior starting the activity.

As a first step, we have to migrate all the User VMs which are residing in the target host (except CVM) to other hosts available in cluster.

Connect the CVM of target ESXi host via SSH and execute the below mentioned command to find its UUID.

ncli host ls | grep -C7 [IP-Adress of CVM]

Place the CVM in maintenance mode using its UUID which we have fetched in previous step.

ncli host edit id=[UUID] enable-maintenance-mode=”true”

Verify that the CVM has been placed in maintenance mode using following command. In this stage, CVM level Maintenance mode is enabled and confirmed.

cluster status | grep CVM

Now do the shutdown of CVM using below command.

cvm_shutdown -h now

Now it is safe to enable maintenance mode at hypervisor level. All the user VMs were migrated to other nodes and CVM also brought down gracefully as per previous steps.

Place the target ESXi host in maintenance mode and take it for your maintenance activity.

Steps to exit from maintenance mode:

Once completed with the maintenance activity, now we have to add the nodes back to cluster.

Exit the ESXi host from Maintenance mode and Power-ON the CVM.

Connect a neighbor CVM available in Cluster via SSH.

Check the status of CVM which we have Powered-ON. In this stage it should be reported as it is in maintenance mode.

ncli host ls | grep -C7 [IP-Adress of CVM]

Exit the CVM from maintenance mode using its UUID which we have fetched in previous step.

ncli host edit id=[UUID] enable-maintenance-mode=”false”

Verify that the CVM has been removed from maintenance mode using following command.

cluster status | grep CVM

CVM came out of maintenance mode now.

Ensure that the “Data resiliency and Meta-data sync status” came normal in PRISM portal. It may take few minutes to reflect.

Note: In the given commands, parameters in the brackets [ ] should be replaced with appropriate value.

For example –

ncli host ls | grep -C7 [IP-Address of CVM]   –>   ncli host ls | grep -C7 169.254.20.1

Intended Audience – Administrators of Nutanix Virtual Computing Platform with vSphere ESXi.

Thanks for reading the post and do share your views 🙂


Never Stop Learning !

ESXi Host Sync Issue with vCenter – A general System error occurred: Failed to login with vim administrator password

Few weeks back, have experienced an ESXi host not responding state issue. The symptoms are –

  • Host reported as not responding in vCenter
  • Unable to connect the host directly through vSphere client
  • Not responding to any of the commands in DCUI
  • But VMs are running without any issues and accessible in network

ESXi host found enabled with Lockdown mode, SSH also disabled and no remote logging configured. So, there is no option available to connect it remotely during that scenario to troubleshoot.

Few days later, ESXi host starts responding to commands in DCUI. So, we have enabled the SSH and disabled the Lockdown mode which would help us to troubleshoot further in this state.

Fortunately, we could connect the ESXi host directly through vSphere client post disabling lockdown mode. But the host found still reported as “Not responding” in vCenter.

We have tried reconnecting the host back to vCenter since it is finely accessible through vSphere client in direct. But the reconnect operation fails with below error:

Error – A general System error occurred: Failed to login with vim administrator password

We could see – vim.fault.InvalidLogin event traces in vpxa log reporting “Cannot complete login due to incorrect user name and password” 

Considering the above error, we suspect issue with vpxa agent. So we were in plan to completely remove the host from vCenter inventory and add it back.

But at the same time, While processing the vCenter connect request from hypervisor end, below mentioned event found logged in ESXi events:

Error – The ramdisk ‘var’ is full. As a result, the file /var/run/vmware/tickets/vmtck  could not be written

The above error clearly indicates that the var partition is full and we have confirmed the same with following command –>  vdf -h in putty session of ESXi host.

Ramdisk                Size    Used  Available Use% Mounted on
root                         32M     2M       29M         9%        —
etc                            28M  396K      27M         1%        —
opt                            32M      0B      32M         0%        —
var                            48M   48M        0M     100%        —
tmp                         256M  180K    255M         0%        —
iofilters                    32M       0B       32M        0%        —
hostdstats            2053M      7M   2045M       0%         —
stagebootbank      250M      4K      249M       0%         —

The above output confirms that the var partition is full.

So, we have moved the old log files which blocks the space in var partition to local Datastore. Post which we could see the sufficient free space available in var partition. If you are not familiar with command to move the files, you can use WinSCP tool for the same.

Ramdisk                Size    Used  Available Use% Mounted on
root                         32M     2M       29M         9%        —
etc                            28M  396K      27M         1%        —
opt                            32M      0B      32M         0%        —
var                            48M      1M     47M        2%        —
tmp                         256M  180K    255M         0%        —
iofilters                    32M       0B       32M        0%        —
hostdstats            2053M      7M   2045M       0%         —
stagebootbank      250M      4K      249M       0%         —

Going back to the main problem, now we have tried connecting back the ESXi host in vCenter. The task has gone successful and host also came in sync with vCenter.

So whenever you face host sync issue with vCenter, please have a check on ramdisk usage as well.


Never Stop Learning !