How is this possible?
Trying to wrap my head around this problem. Hoping the combined intelligence at TE can help me out.
How can a SAN admin who logs into the management console on a regular bases not notice 3 disks failing over 4 months time?
How can a SAN admin who logs into the management console on a regular bases not notice 3 disks failing over 4 months time?
2018 Certification Goals: Maybe VMware Sales Cert
"Simplify, then add lightness" -Colin Chapman
"Simplify, then add lightness" -Colin Chapman
Comments
-
demonfurbie Member Posts: 1,819 ■■■■■□□□□□logging in and checking things out are totally diff
some places check the logs to see logins but not activitywgu undergrad: done ... woot!!
WGU MS IT Management: done ... double woot :cheers: -
webgeek Member Posts: 495 ■■■■□□□□□□Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.BS in IT: Information Assurance and Security (Capella) CISSP, GIAC GSEC, Net+, A+
-
QHalo Member Posts: 1,488My storage notifies me if I have a disk failure. Not sure how you can miss it honestly. If you don't have autosupport enabled or any notifications, well then you'd have to be specifically looking at the disks. Which in most cases you're not and only looking at the volumes on the device. That's about as devil's advocate I can get. Get proper notifications setup, track your failures and this is a non-issue. Sounds like laziness or ineptitude. Hot-spares save lives it looks like.
-
blargoe Member Posts: 4,174 ■■■■■■■■■□Every SAN management application GUI I've ever used would indicate any failures on the first screen as soon as you log in, or have some kind of flag or something on some part of the window begging you to click on it.IT guy since 12/00
Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
Working on: RHCE/Ansible
Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands... -
RouteMyPacket Member Posts: 1,104Laziness or Incomptence. Most likely both!
Welcome to IT where the majority fit that exact description.Modularity and Design Simplicity:
Think of the 2:00 a.m. test—if you were awakened in the
middle of the night because of a network problem and had to figure out the
traffic flows in your network while you were half asleep, could you do it? -
coffeeluvr Member Posts: 734 ■■■■■□□□□□routemypacket wrote: »laziness or incomptence. Most likely both!
Welcome to it where the majority fit that exact description.
lmao!!!"Something feels funny, I must be thinking too hard. - Pooh" -
Master Of Puppets Member Posts: 1,210RouteMyPacket wrote: »Laziness or Incomptence. Most likely both!
Welcome to IT where the majority fit that exact description.
Sad but true!Although I'm just starting I have already noticed itYes, I am a criminal. My crime is that of curiosity. My crime is that of judging people by what they say and think, not what they look like. My crime is that of outsmarting you, something that you will never forgive me for. -
ptilsen Member Posts: 2,835 ■■■■■■■■■■Just to play devil's advocate, but how can there not be a reliable alerting system that proactively detects this? I've never made a point of checking disks on servers or SAN's I've configured, because I would never willingly configure them without some sort of alerting system.
Don't get me wrong. If the SAN admin is clearly shirking, he should be fired. But, whoever is in charge of this should be asking "How can we make sure there is an alerting system that reliably detects failures? What will it take to audit or verify the system works?" I sincerely doubt the answer is manually checking the management console frequently. Most of the SAN equipment I've worked could send SMTP notifications on failures, and would also respond to SNMP. It shouldn't be difficult to get it so the appropriate team is alerted of a disk failure. -
undomiel Member Posts: 2,818I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.Jumping on the IT blogging band wagon -- http://www.jefferyland.com/
-
RouteMyPacket Member Posts: 1,104I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.
Yeah but the problem you and ptilsen have is that you both think logically like a System Administrator should. However, this is IT we are talking about. I have seen Sys Admins, Network Engineers work in an environment and not even have a monitoring system in place. I assume it's an "Out of sight out of mind" approach?
I would fire someone on the spot that even began to show such a lack of common sense. I am becoming more and more jaded the longer I work in this field. The positive is that it is easy to stand above these types.
I tell you what, i'm just gonna dive on the next person I hear say "Yeah, but it's been awhile" when asked if they have experience with a technology.
"Yeah, but it's been awhile" = I have absolutely no clue about said technologyModularity and Design Simplicity:
Think of the 2:00 a.m. test—if you were awakened in the
middle of the night because of a network problem and had to figure out the
traffic flows in your network while you were half asleep, could you do it? -
VAHokie56 Member Posts: 783Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.
"if you don't pizza you're gonna have a bad time"
lulz awesome...your comment made me go to YouTube to see that clip and in-turn took 1 hour of my time watching related videos....dam you interwebs!.ιlι..ιlι.
CISCO
"A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures -
About7Narwhal Member Posts: 761I just want to go on record and state that one of our SANs had 2 drives fail and never alerted anyone. It passed daily HW checks for weeks before the vendor came out on routine maintenance and noticed it. Maybe the admin saw it, maybe not. But either way, the HW check designed by the vendor never found any problems.
-
blargoe Member Posts: 4,174 ■■■■■■■■■□I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.
My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.
When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughtsIT guy since 12/00
Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
Working on: RHCE/Ansible
Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands... -
dave330i Member Posts: 2,091 ■■■■■■■■■■My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.
When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughts
That's my problem. The few SANs I've worked with had built-in alerts that showed on management console. If it was a sudden failure, I can understand the admin not catching it, but it was an on going problem for a few months.
Fortunately there wasn't too much damage. It was a new site being built out, so there wasn't too much stuff running on the SAN.2018 Certification Goals: Maybe VMware Sales Cert
"Simplify, then add lightness" -Colin Chapman -
tpatt100 Member Posts: 2,991 ■■■■■■■■■□Maybe the person wasn't actually doing their job? Years ago our "backup admin" (Tech who had backups as part of his duties) was documenting that he was testing the backups being performed. One day the tax department for the City had issues with their database and asked us to do a restore from the previous day. He did it and it failed. Ok, try the next one, nope that failed also.
We went back weeks and nothing worked except the log said he was testing them...... The tech was fired soon after for something else but the backup failure issue was the second to last failure on his part so he was well on his way out the door. Major cluster, soon after our manager sent two of us to training and took a more active role in auditing roles and responsibilities. -
sratakhin Member Posts: 818We have 2 SANs from Equallogic and when you log in, it shows a bunch of management tasks, but doesn't show the hard drives' health. To get there, you would have to click like 5 times. Frankly, we never had any disk failures so I'm not sure if something will pop up on the screen if it happens.
-
QHalo Member Posts: 1,488I believe it only shows up in the Errors tray at the bottom. It's been a while since I had a failure in one of mine but I'm pretty sure it doesn't come with 4th of July style warnings. Neither does my NetApp but if I miss it it doesn't really matter because it phones home, generates a ticket and then I get a phone call asking when to send the replacement.
-
GAngel Member Posts: 708 ■■■■□□□□□□As mentioned already it depends on the SAN model as much as the admin managing it.
3-4 disk in a large array means nothing and may not be set as a trigger if done by %. -
it_consultant Member Posts: 1,903If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive. When you have racks full of shelves with bezels on them it is hard to find the one drive in pre-failure.
-
tbhouston Member Posts: 32 ■■□□□□□□□□RouteMyPacket wrote: »Laziness or Incomptence. Most likely both!
Welcome to IT where the majority fit that exact description.
Great quote i love it, LOL.. so many people just following steps on a sheet and not knowing what they are doing
also so many people who see a problem and know they are the only one that would fix it, so they let someone else point it out at some point -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive
I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that longMy own knowledge base made public: http://open902.com -
southerne Member Posts: 5 ■□□□□□□□□□I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.
trademark attorneys -
southerne Member Posts: 5 ■□□□□□□□□□Thanks for some very good and interesting stuff.I am glad to be here.As I enhanced my abilities and general understanding in this regard.
-
QHalo Member Posts: 1,488jibbajabba wrote: »Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive
I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long
I know I've made plenty of mistakes. However you learn from them and apply appropriate corrections. Three missed drive failures is a pattern. -
UnixGuy Mod Posts: 4,570 Modjibbajabba wrote: »Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive
I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long
Our management team expects the admin to login to the SAN and Servers everyday and do a thorough health check. Having 3+ drives failures means that the admin wasn't simply checking the system. -
it_consultant Member Posts: 1,903I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.
We have hundreds of drives, we rely on the HITRAK server to alarm if their is a drive failure. This is actually a requirement to get Hitachi support, when there is a failure the server automatically opens a service request with Hitachi.
If we didn't have a monitor then there is a very good possibility that a bad drive would go unnoticed. We rarely look at the individual disks in the DP's and the physical disks are covered by a bezel so you wouldn't necessarily see a bad drive LED.