Options

I am out of ideas - High Latency on a LUN - on hosts with no VMs

jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
This has been quite an eventful week with not much sleep. At the moment we are in a situation where no one knows what else we can do. Let me first explain what happened.

We introduced an additional blade to our infrastructure. It was load-tested for 10 days, all stable and nice. Monday then that host disappears from the vCenter. The host itself is still up, just cannot connect to vCenter / Client. VMs are up too so that was a bonus. After hours with VMware support they basically gave up and we had not choice but to bounce the host - well, to add insult to the injury, HA didn't work and did not fail the VMs over. Problem in scenarios like that is that while the (disconnected) host is still in vCenter - the VMs are too - which are disconnected but showing as powered on - which they are not. So you cannot even migrate them (like you can with powered off VMs).

Next "solution" was to remove the host from vCenter. At this stage we were finally able to add the VMs back to the inventory using other hosts. Of course there were some corruptions / broken VMs / Fricked up VMDK descriptor files and the list (and hours) go on.

We initially thouight that was it - far from it ... we continued to see latencies on all datastores / hosts of 250k-700k ms ... yepp .. 700.000 ms ...
A power-on operation (or even adding VMs back into the inventory) took up to 30 minutes / VM.

Anyway ... we obviously opened tickets with the storage vendor as well and they of course blamed VMware .. I actually managed to get both in a phone conference, VMware and Storage vendor with VMware confirming yet again a storage issue. Three days later still no result.

At some point we had a hunch - all these VMs, which were affected, were also migrated using DRS (when you least need it) which bombed out when the host crashed the second time (before we finally pulled the blade).

Locks - our guess .. So some VMs we expected to be the culprit, were rebooted .. and ola ... latency gone.

No one can explain what happens, why that "fixed" some issues, but heh - we were happy ...

Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing

10zzdc9.jpg

Above shows where the host was taken out of maintenance mode and put back in again.

Now VMkernel logs show some SCSI aborts and yes, this is likely due to storage issues which we may still have - however, how can the only hosts showing now a latency with no VMs on it when they are out of maintenance mode, but look fine when in maintenance mode and all other hosts with the VMs running, are fine ?

Now we are in a blame loop - storage vendor blames vmware, vmware blames storage vendor.

VMware Supports also just shrugs when I try to get an explanation how a rebooted VM can cause the latency to calm down as it surely shouldn't make a difference if the storage back end is to be blamed ....

So I hope someone here can give me some pointers, because right now we are out of ideas (and clearly so are the vendors)
My own knowledge base made public: http://open902.com :p

Comments

  • Options
    blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    Just to be clear, the datastore in question is a shared datastore among all hosts, but only these two particuar hosts that have no VMs are complaining about latency?

    Honestly, I would also post this over on the VMware VMTN forums, there are far more eyes and ears over there and quite a few official vExperts lurking who have 1000's of vSphere implementations collectively amongst them.
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • Options
    lsud00dlsud00d Member Posts: 1,571
    Have you tried to repeat the issue and capture the traffic with something like Wireshark? A pcap might give you a better idea of what's going on at the network layers, potentially giving you more credence of blaming storage vs. VMWare and making them fix it icon_wink.gif
  • Options
    bighornsheepbighornsheep Member Posts: 1,506
    iSCSI storage or FC? Check the storage adapter config of the hosts to make sure your paths are active.
    Jack of all trades, master of none
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    FC and yes, these are shared datastores and no other host is showing this issue. My worry is that if a hosts reboots it starts showing the same symptoms, considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.

    Mmmm... as soon as i added it to the cluster - makes me wonder .. I THINK I didn't see the issues while the host was on its own (with LUNs) and yes, I did posted over at the vmware forums...
    My own knowledge base made public: http://open902.com :p
  • Options
    kj0kj0 Member Posts: 767
    So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update?

    If no other hosts are showing this, but are connected, it should like something is corrupted or trying to access the Datastore from the hypervisor. I could most likely be wrong and confused, but 15 minutes to reinstall would knock off another possibility.
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    kj0 wrote: »
    So is there anything running on these hosts at the moment? are you able to reinstall ESXi and update? .
    jibbajabba wrote: »
    considering that I even reinstalled one host and it started showing the latency as soon as I added it to the cluster.
    jibbajabba wrote: »
    Well now the weirdest thing ... and to actually finally get to the point, we have two hosts .. EMPTY hosts .. no VMs, showing the same sort of device latency on ONE particular datastore. As soon as you put the hosts back into maintenance mode, the latency goes down to nothing

    10zzdc9.jpg

    ;)

    Thanks though ...
    My own knowledge base made public: http://open902.com :p
  • Options
    kj0kj0 Member Posts: 767
    jibbajabba wrote: »
    ;)

    Thanks though ...
    HAHA... I even read it a second time before posting and I still missed that line.

    The next then would be wireshark I guess.
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Not sure Wireshark helps. Storage is purely FC and hosts are empty.
    My own knowledge base made public: http://open902.com :p
  • Options
    EV42TMANEV42TMAN Member Posts: 256
    Have you checked the FC SAN your connecting to? you could have a dying drive or something
    Current Certification Exam: ???
    Future Certifications: CCNP Route Switch, CCNA Datacenter, random vendor training.
  • Options
    TheProfTheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□
    Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?
  • Options
    blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    Are the two problem hosts physically connected to the same FC switch or switch module, whereas the others perhaps are not? Same rack (sharing a FC patch panel perhaps) or anything else shared between only these two hosts that the rest may not have in common?
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • Options
    kj0kj0 Member Posts: 767
    jibbajabba wrote: »
    Not sure Wireshark helps. Storage is purely FC and hosts are empty.
    Dammit, That's me out now. HAHA, I'll take notes next time before replying with a suggestion - Been playing with iSCSI way too much lately.

    What about a dodgy cable or two or Controller?
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    @ EV42TMAN
    Yea - checked all that. Storage vendor cannot see anything / no failed disk

    @TheProf
    Mmm... You might be onto something ... I removed the host from the cluster and put it by itself and like putting the host into maintenance mode, it seems to be fixing it.

    I might turn HA off tonight and see what happens.

    @blargoe
    These are blades so all hosts are connected to the same fabric. One thing we have scheduled for next week is rebooting one storage header (active / passive unfortunately) because we are also missing standby-paths

    @kj0
    These are blades so the cable itself are connected to the blade switch and not host itself.
    So unfortunately that is not it either, otherwise we'd see the issues on the other hosts as well icon_sad.gif
    My own knowledge base made public: http://open902.com :p
  • Options
    EveryoneEveryone Member Posts: 1,661
    I don't know what support level you have with these vendors... but this is the kind of thing where you should have someone from all parties involved onsite. I've gone onsite to customers and also had VMware + Network vendor + Storage vendor there as well. Little less blame game, little more working together to resolve the issue for the customer. I would suggest SAN oversubscription as a possibility here, but typically you'd see performance problems across all hosts connected to the SAN on the same fabric if that were the case.
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    TheProf wrote: »
    Could it be something to do with HA datastore heartbeat on the cluster? maybe that's doing something weird? if you turn off HA, do you still see this issue?

    Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...

    We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....

    Guess what time I made the change :)

    2vae49s.jpg

    @Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?
    My own knowledge base made public: http://open902.com :p
  • Options
    TheProfTheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□
    jibbajabba wrote: »
    Kudos Sir ... that did the trick ... just chosen different datastores (the one in question was indeed used for heartbeat) for heartbeat and bang .. all down now (latency that is, not the hosts) ...

    We still have other issues, but this was one concern of many .. which now seems to be solved.... which again, VMware blamed on our storage vendor ... Sigh ....

    Guess what time I made the change :)

    2vae49s.jpg

    @Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?

    Good stuff! Glad you got it working! :)
  • Options
    EveryoneEveryone Member Posts: 1,661
    jibbajabba wrote: »
    @Everyone .. we got 24/7 Production Support (VSPP) .. Never even heard that VMware goes onsite ... ?!?
    They do, and I only know that because I’ve been onsite with a customer and also had my equivalent from VMware there. Sounds like you got it sorted out and maybe don’t need it now though. ;)
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Everyone wrote: »
    They do, and I only know that because I’ve been onsite with a customer and also had my equivalent from VMware there. Sounds like you got it sorted out and maybe don’t need it now though. ;)

    We still got a lot of issues, including the blame war. I was surprised I got them onto a conference call .. Then VMware saying yet again it is a storage issue and once we were off the phone, we got an email from the storage vendor sayings its a VMware issue.

    What support level do you need to get the guys onsite ? I can imagine you get that by default with VCE / Flexpot ...

    Edit: Never mind : http://www.vmware.com/uk/support/services/mission-critical.html
    My own knowledge base made public: http://open902.com :p
  • Options
    EveryoneEveryone Member Posts: 1,661
    There you go, I would have had no idea. With Microsoft it's "Premier" support to get someone onsite. If you were running Hyper-V and had a CritSit like this and a Premier contract... you would have had someone onsite ASAP. I just knew VMware had a similar support offering for their products, and you seem to have found it. ;)
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Yea and it is only $50k+ / year :s
    My own knowledge base made public: http://open902.com :p
  • Options
    EveryoneEveryone Member Posts: 1,661
    For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Everyone wrote: »
    For the VMware support? If they offer anything close to what MS does, there's way more value in it than you might think. It goes way beyond dropping in and saving the day when you have a critical outage. Lots of opportunities for proactive work and to learn from some of the best in the business on a specific technology.

    Oh I agree ... That's my view - just not the directors one with the cheque :)
    My own knowledge base made public: http://open902.com :p
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    VMware Support never ceases to amaze me
    From: "VMware Technical Support"
    Sent: 16 October 2013 18:58
    To:
    Cc:
    Subject: VMware Support Request 1

    ** Please do not change the subject line of this email if you wish to respond. **

    Hello Michael,

    Another thought, if latency re-occurs, you could try disabling HA,
    VMware KB: Disabling VMware High Availability (HA)
    To disable VMware HA:
    1. In the vSphere Client, right-click the cluster and click Edit Settings.
    2. Deselect the Turn On VMware HA check box.
    3. Click OK.
    Note: The process may take some time to complete.
    My own knowledge base made public: http://open902.com :p
  • Options
    TheProfTheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□
    They're probably watching this thread :)
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    TheProf wrote: »
    They're probably watching this thread :)

    Possibly ...
    My own knowledge base made public: http://open902.com :p
  • Options
    ElevenBravoElevenBravo Member Posts: 6 ■□□□□□□□□□
    Curious - What storage vendor are you using and what model is it?
  • Options
    jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Curious - What storage vendor are you using and what model is it?

    Doesn't really matter now .. we will be replacing it within 6 weeks.
    My own knowledge base made public: http://open902.com :p
Sign In or Register to comment.