OK boys, put on your thinking caps for this one...

We have 2 SANs. A VNX5600 and 5700. for a month we have been experiencing crippling disk latency in our VMware environment. using solarwinds VM manager we are able to see a pattern of latency. it starts every day at 9:40 and finishes aprox 3 hrs later. We found that our fabric switch B CPU was pegged at 100% yesterday and rebooted that, it fixed the latency for the rest of the day and it started up again this morning at 9:40. We have paid experts from HP (Blade enclosures), EMC and VMware to assess our environment and everyone claims their piece of this puzzle is working fine. Does anyone have any ideas what could be happening? I can provide more info if needed.

oh, the strangest part of this, a lun with retired vm's, vm's that are turned completely off, is the biggest offender of latency. How could that be?

EDIT: Forgot to mention, all scheduled backups have been disabled as part of troubleshooting this for 2+ weeks.

Find more posts tagged with

Free for TechExams community: Cybersecurity salary guide

Compare cert salaries and plan your next career move

Button

Comments

cruwl

where there any CPU issues on the fabrics today when the issue came back?

d4nz1g

what processes managed to rise B's CPU? maybe snmp?

elTorito

What kind of workload is running on that VMware environment? Virtual desktops, servers, databases, or a mix of all? Do you observe an abnormal increase in IOPS during the bouts of increased latency? Going by your description, it seems the occurence of the problem is somewhat predictable, so perhaps there is some scheduled event kicking off every day? Antivirus pattern updates, Windows updates, some housekeeping/defragmentation/maintenance task?

Latency spikes on the storage network can be a right b***h to troubleshoot, but I would start by trying to correlate that latency with a surge in I/O demand somewhere in the environment.

BGraves

Any chance you have room to move the retired vm's off that Lun? If so, are you able to move them each to a different Lun temporarily and see if the issue moves to a different Lun as well?

gespenstern

I would check all scheduled tasks first. Actually, such a pattern means problem is already solved 75%. Check all the backups that might run in background, optimization tasks, gathering and reporting stats functionality, all the crontabs, etc. I would turn off everything that's not clear and left only those tasks that are certainly start at a different time frame.

tdean

Hi guys, i will update more later. ive shut down all the ports on the B leg of that switch and im re-enabling 2 at a time every 5 minutes.

the environment is 2 clusters, 45 hosts, aprox 1100 vm's. all servers. we have sql boxes on their own luns for the most part. no jobs running. funny that it just started a month ago. i think it might be one of our monitoring software apps gone crazy. i will keep you up to date. thank you for the responses.