I am out of ideas - High Latency on a LUN - on hosts with no VMs
jibbajabba
Member Posts: 4,317 ■■■■■■■■□□
This has been quite an eventful week with not much sleep. At the moment we are in a situation where no one knows what else we can do. Let me first explain what happened.
We introduced an additional blade to our infrastructure. It was load-tested for 10 days, all stable and nice. Monday then that host disappears from the vCenter. The host itself is still up, just cannot connect to vCenter / Client. VMs are up too so that was a bonus. After hours with VMware support they basically gave up and we had not choice but to bounce the host - well, to add insult to the injury, HA didn't work and did not fail the VMs over. Problem in scenarios like that is that while the (disconnected) host is still in vCenter - the VMs are too - which are disconnected but showing as powered on - which they are not. So you cannot even migrate them (like you can with powered off VMs).
Next "solution" was to remove the host from vCenter. At this stage we were finally able to add the VMs back to the inventory using other hosts. Of course there were some corruptions / broken VMs / Fricked up VMDK descriptor files and the list (and hours) go on.
We initially thouight that was it - far from it ... we continued to see latencies on all datastores / hosts of 250k-700k ms ... yepp .. 700.000 ms ...
A power-on operation (or even adding VMs back into the inventory) took up to 30 minutes / VM.
Anyway ... we obviously opened tickets with the storage vendor as well and they of course blamed VMware .. I actually managed to get both in a phone conference, VMware and Storage vendor with VMware confirming yet again a storage issue. Three days later still no result.
At some point we had a hunch - all these VMs, which were affected, were also migrated using DRS (when you least need it) which bombed out when the host crashed the second time (before we finally pulled the blade).
Locks - our guess .. So some VMs we expected to be the culprit, were rebooted .. and ola ... latency gone.
No one can explain what happens, why that "fixed" some issues, but heh - we were happy ...
Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing
Above shows where the host was taken out of maintenance mode and put back in again.
Now VMkernel logs show some SCSI aborts and yes, this is likely due to storage issues which we may still have - however, how can the only hosts showing now a latency with no VMs on it when they are out of maintenance mode, but look fine when in maintenance mode and all other hosts with the VMs running, are fine ?
Now we are in a blame loop - storage vendor blames vmware, vmware blames storage vendor.
VMware Supports also just shrugs when I try to get an explanation how a rebooted VM can cause the latency to calm down as it surely shouldn't make a difference if the storage back end is to be blamed ....
So I hope someone here can give me some pointers, because right now we are out of ideas (and clearly so are the vendors)
We introduced an additional blade to our infrastructure. It was load-tested for 10 days, all stable and nice. Monday then that host disappears from the vCenter. The host itself is still up, just cannot connect to vCenter / Client. VMs are up too so that was a bonus. After hours with VMware support they basically gave up and we had not choice but to bounce the host - well, to add insult to the injury, HA didn't work and did not fail the VMs over. Problem in scenarios like that is that while the (disconnected) host is still in vCenter - the VMs are too - which are disconnected but showing as powered on - which they are not. So you cannot even migrate them (like you can with powered off VMs).
Next "solution" was to remove the host from vCenter. At this stage we were finally able to add the VMs back to the inventory using other hosts. Of course there were some corruptions / broken VMs / Fricked up VMDK descriptor files and the list (and hours) go on.
We initially thouight that was it - far from it ... we continued to see latencies on all datastores / hosts of 250k-700k ms ... yepp .. 700.000 ms ...
A power-on operation (or even adding VMs back into the inventory) took up to 30 minutes / VM.
Anyway ... we obviously opened tickets with the storage vendor as well and they of course blamed VMware .. I actually managed to get both in a phone conference, VMware and Storage vendor with VMware confirming yet again a storage issue. Three days later still no result.
At some point we had a hunch - all these VMs, which were affected, were also migrated using DRS (when you least need it) which bombed out when the host crashed the second time (before we finally pulled the blade).
Locks - our guess .. So some VMs we expected to be the culprit, were rebooted .. and ola ... latency gone.
No one can explain what happens, why that "fixed" some issues, but heh - we were happy ...
Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing
Above shows where the host was taken out of maintenance mode and put back in again.
Now VMkernel logs show some SCSI aborts and yes, this is likely due to storage issues which we may still have - however, how can the only hosts showing now a latency with no VMs on it when they are out of maintenance mode, but look fine when in maintenance mode and all other hosts with the VMs running, are fine ?
Now we are in a blame loop - storage vendor blames vmware, vmware blames storage vendor.
VMware Supports also just shrugs when I try to get an explanation how a rebooted VM can cause the latency to calm down as it surely shouldn't make a difference if the storage back end is to be blamed ....
So I hope someone here can give me some pointers, because right now we are out of ideas (and clearly so are the vendors)
My own knowledge base made public: http://open902.com
Comments
-
blargoe Member Posts: 4,174 ■■■■■■■■■□Just to be clear, the datastore in question is a shared datastore among all hosts, but only these two particuar hosts that have no VMs are complaining about latency?
Honestly, I would also post this over on the VMware VMTN forums, there are far more eyes and ears over there and quite a few official vExperts lurking who have 1000's of vSphere implementations collectively amongst them.IT guy since 12/00
Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
Working on: RHCE/Ansible
Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands... -
lsud00d Member Posts: 1,571Have you tried to repeat the issue and capture the traffic with something like Wireshark? A pcap might give you a better idea of what's going on at the network layers, potentially giving you more credence of blaming storage vs. VMWare and making them fix it
-
bighornsheep Member Posts: 1,506iSCSI storage or FC? Check the storage adapter config of the hosts to make sure your paths are active.Jack of all trades, master of none
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□FC and yes, these are shared datastores and no other host is showing this issue. My worry is that if a hosts reboots it starts showing the same symptoms, considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.
Mmmm... as soon as i added it to the cluster - makes me wonder .. I THINK I didn't see the issues while the host was on its own (with LUNs) and yes, I did posted over at the vmware forums...My own knowledge base made public: http://open902.com -
kj0 Member Posts: 767So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update?
If no other hosts are showing this, but are connected, it should like something is corrupted or trying to access the Datastore from the hypervisor. I could most likely be wrong and confused, but 15 minutes to reinstall would knock off another possibility. -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update? .jibbajabba wrote: »considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.jibbajabba wrote: »Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing
Thanks though ...My own knowledge base made public: http://open902.com -
kj0 Member Posts: 767jibbajabba wrote: »
Thanks though ...
The next then would be wireshark I guess. -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□Not sure Wireshark helps. Storage is purely FC and hosts are empty.My own knowledge base made public: http://open902.com
-
EV42TMAN Member Posts: 256Have you checked the FC SAN your connecting to? you could have a dying drive or somethingCurrent Certification Exam: ???
Future Certifications: CCNP Route Switch, CCNA Datacenter, random vendor training. -
TheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?
-
blargoe Member Posts: 4,174 ■■■■■■■■■□Are the two problem hosts physically connected to the same FC switch or switch module, whereas the others perhaps are not? Same rack (sharing a FC patch panel perhaps) or anything else shared between only these two hosts that the rest may not have in common?IT guy since 12/00
Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
Working on: RHCE/Ansible
Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands... -
kj0 Member Posts: 767jibbajabba wrote: »Not sure Wireshark helps. Storage is purely FC and hosts are empty.
What about a dodgy cable or two or Controller? -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□@ EV42TMAN
Yea - checked all that. Storage vendor cannot see anything / no failed disk
@TheProf
Mmm... You might be onto something ... I removed the host from the cluster and put it by itself and like putting the host into maintenance mode, it seems to be fixing it.
I might turn HA off tonight and see what happens.
@blargoe
These are blades so all hosts are connected to the same fabric. One thing we have scheduled for next week is rebooting one storage header (active / passive unfortunately) because we are also missing standby-paths
@kj0
These are blades so the cable itself are connected to the blade switch and not host itself.
So unfortunately that is not it either, otherwise we'd see the issues on the other hosts as wellMy own knowledge base made public: http://open902.com -
Everyone Member Posts: 1,661I don't know what support level you have with these vendors... but this is the kind of thing where you should have someone from all parties involved onsite. I've gone onsite to customers and also had VMware + Network vendor + Storage vendor there as well. Little less blame game, little more working together to resolve the issue for the customer. I would suggest SAN oversubscription as a possibility here, but typically you'd see performance problems across all hosts connected to the SAN on the same fabric if that were the case.
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?
Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...
We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....
Guess what time I made the change
@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?My own knowledge base made public: http://open902.com -
TheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□jibbajabba wrote: »Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...
We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....
Guess what time I made the change
@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?
Good stuff! Glad you got it working! -
Everyone Member Posts: 1,661jibbajabba wrote: »@Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□They do, and I only know that because I’ve been onsite with a customer and also had my equivalent from VMware there. Sounds like you got it sorted out and maybe don’t need it now though.
We still got a lot of issues, including the blame war. I was surprised I got them onto a conference call .. Then VMware saying yet again it is a storage issue and once we were off the phone, we got an email from the storage vendor sayings its a VMware issue.
What support level do you need to get the guys onsite ? I can imagine you get that by default with VCE / Flexpot ...
Edit: Never mind : http://www.vmware.com/uk/support/services/mission-critical.htmlMy own knowledge base made public: http://open902.com -
Everyone Member Posts: 1,661There you go, I would have had no idea. With Microsoft it's "Premier" support to get someone onsite. If you were running Hyper-V and had a CritSit like this and a Premier contract... you would have had someone onsite ASAP. I just knew VMware had a similar support offering for their products, and you seem to have found it.
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□Yea and it is only $50k+ / yearMy own knowledge base made public: http://open902.com
-
Everyone Member Posts: 1,661For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.
Oh I agree ... That's my view - just not the directors one with the chequeMy own knowledge base made public: http://open902.com -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□VMware Support never ceases to amaze meFrom: "VMware Technical Support"
Sent: 16 October 2013 18:58
To:
Cc:
Subject: VMware Support Request 1
** Please do not change the subject line of this email if you wish to respond. **
Hello Michael,
Another thought, if latency re-occurs, you could try disabling HA,
VMware KB: Disabling VMware High Availability (HA)
To disable VMware HA:
1. In the vSphere Client, right-click the cluster and click Edit Settings.
2. Deselect the Turn On VMware HA check box.
3. Click OK.
Note: The process may take some time to complete.My own knowledge base made public: http://open902.com -
TheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□They're probably watching this thread
-
ElevenBravo Member Posts: 6 ■□□□□□□□□□Curious - What storage vendor are you using and what model is it?
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□ElevenBravo wrote: »Curious - What storage vendor are you using and what model is it?
Doesn't really matter now .. we will be replacing it within 6 weeks.My own knowledge base made public: http://open902.com