nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

I am out of ideas - High Latency on a LUN - on hosts with no VMs

jibbajabba

This has been quite an eventful week with not much sleep. At the moment we are in a situation where no one knows what else we can do. Let me first explain what happened.

We introduced an additional blade to our infrastructure. It was load-tested for 10 days, all stable and nice. Monday then that host disappears from the vCenter. The host itself is still up, just cannot connect to vCenter / Client. VMs are up too so that was a bonus. After hours with VMware support they basically gave up and we had not choice but to bounce the host - well, to add insult to the injury, HA didn't work and did not fail the VMs over. Problem in scenarios like that is that while the (disconnected) host is still in vCenter - the VMs are too - which are disconnected but showing as powered on - which they are not. So you cannot even migrate them (like you can with powered off VMs).

Next "solution" was to remove the host from vCenter. At this stage we were finally able to add the VMs back to the inventory using other hosts. Of course there were some corruptions / broken VMs / Fricked up VMDK descriptor files and the list (and hours) go on.

We initially thouight that was it - far from it ... we continued to see latencies on all datastores / hosts of 250k-700k ms ... yepp .. 700.000 ms ...
A power-on operation (or even adding VMs back into the inventory) took up to 30 minutes / VM.

Anyway ... we obviously opened tickets with the storage vendor as well and they of course blamed VMware .. I actually managed to get both in a phone conference, VMware and Storage vendor with VMware confirming yet again a storage issue. Three days later still no result.

At some point we had a hunch - all these VMs, which were affected, were also migrated using DRS (when you least need it) which bombed out when the host crashed the second time (before we finally pulled the blade).

Locks - our guess .. So some VMs we expected to be the culprit, were rebooted .. and ola ... latency gone.

No one can explain what happens, why that "fixed" some issues, but heh - we were happy ...

Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing

Above shows where the host was taken out of maintenance mode and put back in again.

Now VMkernel logs show some SCSI aborts and yes, this is likely due to storage issues which we may still have - however, how can the only hosts showing now a latency with no VMs on it when they are out of maintenance mode, but look fine when in maintenance mode and all other hosts with the VMs running, are fine ?

Now we are in a blame loop - storage vendor blames vmware, vmware blames storage vendor.

VMware Supports also just shrugs when I try to get an explanation how a rebooted VM can cause the latency to calm down as it surely shouldn't make a difference if the storage back end is to be blamed ....

So I hope someone here can give me some pointers, because right now we are out of ideas (and clearly so are the vendors)

Find more posts tagged with

SAVE $250 on Your Boot Camp — Use Code "EXAM26"

Exclusively for TechExams members for Infosec Boot Camps starting before April 30, 2026

View Boot Camps

Comments

blargoe

Just to be clear, the datastore in question is a shared datastore among all hosts, but only these two particuar hosts that have no VMs are complaining about latency?

Honestly, I would also post this over on the VMware VMTN forums, there are far more eyes and ears over there and quite a few official vExperts lurking who have 1000's of vSphere implementations collectively amongst them.

lsud00d

Have you tried to repeat the issue and capture the traffic with something like Wireshark? A pcap might give you a better idea of what's going on at the network layers, potentially giving you more credence of blaming storage vs. VMWare and making them fix it

bighornsheep

iSCSI storage or FC? Check the storage adapter config of the hosts to make sure your paths are active.

jibbajabba

FC and yes, these are shared datastores and no other host is showing this issue. My worry is that if a hosts reboots it starts showing the same symptoms, considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.

Mmmm... as soon as i added it to the cluster - makes me wonder .. I THINK I didn't see the issues while the host was on its own (with LUNs) and yes, I did posted over at the vmware forums...

kj0

So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update?

If no other hosts are showing this, but are connected, it should like something is corrupted or trying to access the Datastore from the hypervisor. I could most likely be wrong and confused, but 15 minutes to reinstall would knock off another possibility.

jibbajabba

kj0 wrote: »

So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update? .

jibbajabba wrote: »

considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.

jibbajabba wrote: »

Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing

Thanks though ...

kj0

jibbajabba wrote: »

Thanks though ...

HAHA... I even read it a second time before posting and I still missed that line.

The next then would be wireshark I guess.

jibbajabba

Not sure Wireshark helps. Storage is purely FC and hosts are empty.

EV42TMAN

Have you checked the FC SAN your connecting to? you could have a dying drive or something

TheProf

Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?

blargoe

Are the two problem hosts physically connected to the same FC switch or switch module, whereas the others perhaps are not? Same rack (sharing a FC patch panel perhaps) or anything else shared between only these two hosts that the rest may not have in common?

kj0

jibbajabba wrote: »

Not sure Wireshark helps. Storage is purely FC and hosts are empty.

Dammit, That's me out now. HAHA, I'll take notes next time before replying with a suggestion - Been playing with iSCSI way too much lately.

What about a dodgy cable or two or Controller?

jibbajabba

@ EV42TMAN
Yea - checked all that. Storage vendor cannot see anything / no failed disk

@TheProf
Mmm... You might be onto something ... I removed the host from the cluster and put it by itself and like putting the host into maintenance mode, it seems to be fixing it.

I might turn HA off tonight and see what happens.

@blargoe
These are blades so all hosts are connected to the same fabric. One thing we have scheduled for next week is rebooting one storage header (active / passive unfortunately) because we are also missing standby-paths

@kj0
These are blades so the cable itself are connected to the blade switch and not host itself.
So unfortunately that is not it either, otherwise we'd see the issues on the other hosts as well

Everyone

I don't know what support level you have with these vendors... but this is the kind of thing where you should have someone from all parties involved onsite. I've gone onsite to customers and also had VMware + Network vendor + Storage vendor there as well. Little less blame game, little more working together to resolve the issue for the customer. I would suggest SAN oversubscription as a possibility here, but typically you'd see performance problems across all hosts connected to the SAN on the same fabric if that were the case.

jibbajabba

TheProf wrote: »

Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?

Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...

We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....

Guess what time I made the change

@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?

TheProf

jibbajabba wrote: »

Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...

We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....

Guess what time I made the change

@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?

Good stuff! Glad you got it working!

Everyone

jibbajabba wrote: »

@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?

They do, and I only know that because I’ve been onsite with a customer and also had my equivalent from VMware there. Sounds like you got it sorted out and maybe don’t need it now though.

jibbajabba

Everyone wrote: »

They do, and I only know that because I’ve been onsite with a customer and also had my equivalent from VMware there. Sounds like you got it sorted out and maybe don’t need it now though.

We still got a lot of issues, including the blame war. I was surprised I got them onto a conference call .. Then VMware saying yet again it is a storage issue and once we were off the phone, we got an email from the storage vendor sayings its a VMware issue.

What support level do you need to get the guys onsite ? I can imagine you get that by default with VCE / Flexpot ...

Edit: Never mind : http://www.vmware.com/uk/support/services/mission-critical.html

Everyone

There you go, I would have had no idea. With Microsoft it's "Premier" support to get someone onsite. If you were running Hyper-V and had a CritSit like this and a Premier contract... you would have had someone onsite ASAP. I just knew VMware had a similar support offering for their products, and you seem to have found it.

jibbajabba

Yea and it is only $50k+ / year

Everyone

For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.

jibbajabba

Everyone wrote: »

For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.

Oh I agree ... That's my view - just not the directors one with the cheque

jibbajabba

VMware Support never ceases to amaze me

From: "VMware Technical Support"
Sent: 16 October 2013 18:58
To:
Cc:
Subject: VMware Support Request 1

** Please do not change the subject line of this email if you wish to respond. **

Hello Michael,

Another thought, if latency re-occurs, you could try disabling HA,
VMware KB: Disabling VMware High Availability (HA)
To disable VMware HA:
1. In the vSphere Client, right-click the cluster and click Edit Settings.
2. Deselect the Turn On VMware HA check box.
3. Click OK.
Note: The process may take some time to complete.

TheProf

They're probably watching this thread

jibbajabba

TheProf wrote: »

They're probably watching this thread

Possibly ...

ElevenBravo

Curious - What storage vendor are you using and what model is it?

jibbajabba

ElevenBravo wrote: »

Curious - What storage vendor are you using and what model is it?

Doesn't really matter now .. we will be replacing it within 6 weeks.