How is this possible?

Trying to wrap my head around this problem. Hoping the combined intelligence at TE can help me out.

How can a SAN admin who logs into the management console on a regular bases not notice 3 disks failing over 4 months time?

Find more posts tagged with

SAVE $250 on Your Boot Camp — Use Code "EXAM26"

Exclusively for TechExams members for Infosec Boot Camps starting before April 30, 2026

View Boot Camps

Comments

sratakhin

Checked my two SANs just in case. So far so good

veritas_libertas

This should be an interesting story/rant...

demonfurbie

logging in and checking things out are totally diff

some places check the logs to see logins but not activity

webgeek

Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.

QHalo

My storage notifies me if I have a disk failure. Not sure how you can miss it honestly. If you don't have autosupport enabled or any notifications, well then you'd have to be specifically looking at the disks. Which in most cases you're not and only looking at the volumes on the device. That's about as devil's advocate I can get. Get proper notifications setup, track your failures and this is a non-issue. Sounds like laziness or ineptitude. Hot-spares save lives it looks like.

blargoe

Every SAN management application GUI I've ever used would indicate any failures on the first screen as soon as you log in, or have some kind of flag or something on some part of the window begging you to click on it.

RouteMyPacket

Laziness or Incomptence. Most likely both!

Welcome to IT where the majority fit that exact description.

coffeeluvr

routemypacket wrote: »

laziness or incomptence. Most likely both!

Welcome to it where the majority fit that exact description.

lmao!!!

Master Of Puppets

RouteMyPacket wrote: »

Laziness or Incomptence. Most likely both!

Welcome to IT where the majority fit that exact description.

Sad but true!Although I'm just starting I have already noticed it

UnixGuy

It is possible because he is not doing his job

ptilsen

Just to play devil's advocate, but how can there not be a reliable alerting system that proactively detects this? I've never made a point of checking disks on servers or SAN's I've configured, because I would never willingly configure them without some sort of alerting system.

Don't get me wrong. If the SAN admin is clearly shirking, he should be fired. But, whoever is in charge of this should be asking "How can we make sure there is an alerting system that reliably detects failures? What will it take to audit or verify the system works?" I sincerely doubt the answer is manually checking the management console frequently. Most of the SAN equipment I've worked could send SMTP notifications on failures, and would also respond to SNMP. It shouldn't be difficult to get it so the appropriate team is alerted of a disk failure.

undomiel

I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.

RouteMyPacket

undomiel wrote: »

I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.

Yeah but the problem you and ptilsen have is that you both think logically like a System Administrator should. However, this is IT we are talking about. I have seen Sys Admins, Network Engineers work in an environment and not even have a monitoring system in place. I assume it's an "Out of sight out of mind" approach?

I would fire someone on the spot that even began to show such a lack of common sense. I am becoming more and more jaded the longer I work in this field. The positive is that it is easy to stand above these types.

I tell you what, i'm just gonna dive on the next person I hear say "Yeah, but it's been awhile" when asked if they have experience with a technology.

"Yeah, but it's been awhile" = I have absolutely no clue about said technology

VAHokie56

webgeek wrote: »

Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.

"if you don't pizza you're gonna have a bad time"

lulz awesome...your comment made me go to YouTube to see that clip and in-turn took 1 hour of my time watching related videos....dam you interwebs!

About7Narwhal

I just want to go on record and state that one of our SANs had 2 drives fail and never alerted anyone. It passed daily HW checks for weeks before the vendor came out on routine maintenance and noticed it. Maybe the admin saw it, maybe not. But either way, the HW check designed by the vendor never found any problems.

QHalo

I hope there was some head smashing over that.

blargoe

undomiel wrote: »

I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.

My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.

When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughts

dave330i

blargoe wrote: »

My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.

When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughts

That's my problem. The few SANs I've worked with had built-in alerts that showed on management console. If it was a sudden failure, I can understand the admin not catching it, but it was an on going problem for a few months.

Fortunately there wasn't too much damage. It was a new site being built out, so there wasn't too much stuff running on the SAN.

tpatt100

Maybe the person wasn't actually doing their job? Years ago our "backup admin" (Tech who had backups as part of his duties) was documenting that he was testing the backups being performed. One day the tax department for the City had issues with their database and asked us to do a restore from the previous day. He did it and it failed. Ok, try the next one, nope that failed also.

We went back weeks and nothing worked except the log said he was testing them...... The tech was fired soon after for something else but the backup failure issue was the second to last failure on his part so he was well on his way out the door. Major cluster, soon after our manager sent two of us to training and took a more active role in auditing roles and responsibilities.

sratakhin

We have 2 SANs from Equallogic and when you log in, it shows a bunch of management tasks, but doesn't show the hard drives' health. To get there, you would have to click like 5 times. Frankly, we never had any disk failures so I'm not sure if something will pop up on the screen if it happens.

QHalo

I believe it only shows up in the Errors tray at the bottom. It's been a while since I had a failure in one of mine but I'm pretty sure it doesn't come with 4th of July style warnings. Neither does my NetApp but if I miss it it doesn't really matter because it phones home, generates a ticket and then I get a phone call asking when to send the replacement.

GAngel

As mentioned already it depends on the SAN model as much as the admin managing it.

3-4 disk in a large array means nothing and may not be set as a trigger if done by %.

it_consultant

If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive. When you have racks full of shelves with bezels on them it is hard to find the one drive in pre-failure.

tbhouston

RouteMyPacket wrote: »

Laziness or Incomptence. Most likely both!

Welcome to IT where the majority fit that exact description.

Great quote i love it, LOL.. so many people just following steps on a sheet and not knowing what they are doing
also so many people who see a problem and know they are the only one that would fix it, so they let someone else point it out at some point

jibbajabba

Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive

I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long

southerne

I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.

trademark attorneys

southerne

Thanks for some very good and interesting stuff.I am glad to be here.As I enhanced my abilities and general understanding in this regard.

QHalo

jibbajabba wrote: »

Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive

I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long

I know I've made plenty of mistakes. However you learn from them and apply appropriate corrections. Three missed drive failures is a pattern.

UnixGuy

jibbajabba wrote: »

Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive

I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long

Our management team expects the admin to login to the SAN and Servers everyday and do a thorough health check. Having 3+ drives failures means that the admin wasn't simply checking the system.

it_consultant

southerne wrote: »

I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.

We have hundreds of drives, we rely on the HITRAK server to alarm if their is a drive failure. This is actually a requirement to get Hitachi support, when there is a failure the server automatically opens a service request with Hitachi.

If we didn't have a monitor then there is a very good possibility that a bad drive would go unnoticed. We rarely look at the individual disks in the DP's and the physical disks are covered by a bezel so you wouldn't necessarily see a bad drive LED.