How is this possible?

dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
Trying to wrap my head around this problem. Hoping the combined intelligence at TE can help me out.

How can a SAN admin who logs into the management console on a regular bases not notice 3 disks failing over 4 months time?
2018 Certification Goals: Maybe VMware Sales Cert
"Simplify, then add lightness" -Colin Chapman
«1

Comments

  • sratakhinsratakhin Member Posts: 818
    Checked my two SANs just in case. So far so good :)
  • veritas_libertasveritas_libertas CISSP, GIAC x5, CompTIA x5 Greenville, SC USAMember Posts: 5,738 ■■■■■■■■■■
    This should be an interesting story/rant... ;)
    Currently working on: Linux and Python
  • demonfurbiedemonfurbie Member Posts: 1,819
    logging in and checking things out are totally diff

    some places check the logs to see logins but not activity
    wgu undergrad: done ... woot!!
    WGU MS IT Management: done ... double woot :cheers:
  • webgeekwebgeek Member Posts: 495
    Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.
    BS in IT: Information Assurance and Security (Capella) ETA 2013/Early 2014
    2013 Goals: CISSP [:cheers:] ITIL Foundations [ ] Project+ [ ] Linux+ [ ] CCNA (Maybe) [ ]
  • QHaloQHalo Member Posts: 1,488
    My storage notifies me if I have a disk failure. Not sure how you can miss it honestly. If you don't have autosupport enabled or any notifications, well then you'd have to be specifically looking at the disks. Which in most cases you're not and only looking at the volumes on the device. That's about as devil's advocate I can get. Get proper notifications setup, track your failures and this is a non-issue. Sounds like laziness or ineptitude. Hot-spares save lives it looks like.
  • blargoeblargoe Self-Described Huguenot NC, USAMember Posts: 4,174 ■■■■■■■■■□
    Every SAN management application GUI I've ever used would indicate any failures on the first screen as soon as you log in, or have some kind of flag or something on some part of the window begging you to click on it.
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • RouteMyPacketRouteMyPacket Member Posts: 1,104
    Laziness or Incomptence. Most likely both!

    Welcome to IT where the majority fit that exact description.
    Modularity and Design Simplicity:

    Think of the 2:00 a.m. test—if you were awakened in the
    middle of the night because of a network problem and had to figure out the
    traffic flows in your network while you were half asleep, could you do it?
  • coffeeluvrcoffeeluvr Senior Member NCMember Posts: 734 ■■■■■□□□□□
    laziness or incomptence. Most likely both!

    Welcome to it where the majority fit that exact description.

    lmao!!!
    "Something feels funny, I must be thinking too hard. - Pooh"
  • Master Of PuppetsMaster Of Puppets Member Posts: 1,210
    Laziness or Incomptence. Most likely both!

    Welcome to IT where the majority fit that exact description.

    Sad but true!Although I'm just starting I have already noticed it :D
    Yes, I am a criminal. My crime is that of curiosity. My crime is that of judging people by what they say and think, not what they look like. My crime is that of outsmarting you, something that you will never forgive me for.
  • UnixGuyUnixGuy Are we having fun yet? Mod Posts: 4,282 Mod
    It is possible because he is not doing his job
    Certs: GPEN, GCFA, CISM, CRISC, RHCE
    In Progress: MBA
  • ptilsenptilsen Member Posts: 2,835 ■■■■■■■■■■
    Just to play devil's advocate, but how can there not be a reliable alerting system that proactively detects this? I've never made a point of checking disks on servers or SAN's I've configured, because I would never willingly configure them without some sort of alerting system.

    Don't get me wrong. If the SAN admin is clearly shirking, he should be fired. But, whoever is in charge of this should be asking "How can we make sure there is an alerting system that reliably detects failures? What will it take to audit or verify the system works?" I sincerely doubt the answer is manually checking the management console frequently. Most of the SAN equipment I've worked could send SMTP notifications on failures, and would also respond to SNMP. It shouldn't be difficult to get it so the appropriate team is alerted of a disk failure.
    Working B.S., Computer Science
    Complete: 55/120 credits SPAN 201, LIT 100, ETHS 200, AP Lang, MATH 120, WRIT 231, ICS 140, MATH 215, ECON 202, ECON 201, ICS 141, MATH 210, LING 111, ICS 240
    In progress: CLEP US GOV,
    Next up: MATH 211, ECON 352, ICS 340
  • undomielundomiel Member Posts: 2,818
    I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.
    Jumping on the IT blogging band wagon -- http://www.jefferyland.com/
  • RouteMyPacketRouteMyPacket Member Posts: 1,104
    undomiel wrote: »
    I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.


    Yeah but the problem you and ptilsen have is that you both think logically like a System Administrator should. However, this is IT we are talking about. I have seen Sys Admins, Network Engineers work in an environment and not even have a monitoring system in place. I assume it's an "Out of sight out of mind" approach?

    I would fire someone on the spot that even began to show such a lack of common sense. I am becoming more and more jaded the longer I work in this field. The positive is that it is easy to stand above these types.

    I tell you what, i'm just gonna dive on the next person I hear say "Yeah, but it's been awhile" when asked if they have experience with a technology.

    "Yeah, but it's been awhile" = I have absolutely no clue about said technology
    Modularity and Design Simplicity:

    Think of the 2:00 a.m. test—if you were awakened in the
    middle of the night because of a network problem and had to figure out the
    traffic flows in your network while you were half asleep, could you do it?
  • VAHokie56VAHokie56 Member Posts: 783
    webgeek wrote: »
    Either dumped exams or feels secure they won't fail. Either way, he's going to have a bad time.


    "if you don't pizza you're gonna have a bad time"

    lulz awesome...your comment made me go to YouTube to see that clip and in-turn took 1 hour of my time watching related videos....dam you interwebs!
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • About7NarwhalAbout7Narwhal Member Posts: 761
    I just want to go on record and state that one of our SANs had 2 drives fail and never alerted anyone. It passed daily HW checks for weeks before the vendor came out on routine maintenance and noticed it. Maybe the admin saw it, maybe not. But either way, the HW check designed by the vendor never found any problems.
  • QHaloQHalo Member Posts: 1,488
    I hope there was some head smashing over that.
  • blargoeblargoe Self-Described Huguenot NC, USAMember Posts: 4,174 ■■■■■■■■■□
    undomiel wrote: »
    I am with ptilsen on this. First line of defense should be proper monitoring/alerting setup on the SANs. But also to blargoe's point every admin software I've used would generally have some indicator of a failure on the first screen. The admin will be more alert from this point forward.

    My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.

    When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughts :)
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    blargoe wrote: »
    My point was more toward Dave's initial question - how could someone log in to the console every day, and know nothing about multiple disk failures? Yeah, there should have been alerting, but you should have noticed it every day when you logged in to the management console.

    When Dave returns in 2 or 3 days, after he finishes rebuilding the VMware infrastructure for his customer that doesn't replace failed drives, hopefully he'll share his thoughts :)

    That's my problem. The few SANs I've worked with had built-in alerts that showed on management console. If it was a sudden failure, I can understand the admin not catching it, but it was an on going problem for a few months.

    Fortunately there wasn't too much damage. It was a new site being built out, so there wasn't too much stuff running on the SAN.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • tpatt100tpatt100 Member Posts: 2,991 ■■■■■■■■■□
    Maybe the person wasn't actually doing their job? Years ago our "backup admin" (Tech who had backups as part of his duties) was documenting that he was testing the backups being performed. One day the tax department for the City had issues with their database and asked us to do a restore from the previous day. He did it and it failed. Ok, try the next one, nope that failed also.

    We went back weeks and nothing worked except the log said he was testing them...... The tech was fired soon after for something else but the backup failure issue was the second to last failure on his part so he was well on his way out the door. Major cluster, soon after our manager sent two of us to training and took a more active role in auditing roles and responsibilities.
  • sratakhinsratakhin Member Posts: 818
    We have 2 SANs from Equallogic and when you log in, it shows a bunch of management tasks, but doesn't show the hard drives' health. To get there, you would have to click like 5 times. Frankly, we never had any disk failures so I'm not sure if something will pop up on the screen if it happens.
  • QHaloQHalo Member Posts: 1,488
    I believe it only shows up in the Errors tray at the bottom. It's been a while since I had a failure in one of mine but I'm pretty sure it doesn't come with 4th of July style warnings. Neither does my NetApp but if I miss it it doesn't really matter because it phones home, generates a ticket and then I get a phone call asking when to send the replacement.
  • GAngelGAngel Member Posts: 708
    As mentioned already it depends on the SAN model as much as the admin managing it.


    3-4 disk in a large array means nothing and may not be set as a trigger if done by %.
  • it_consultantit_consultant Member Posts: 1,903
    If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive. When you have racks full of shelves with bezels on them it is hard to find the one drive in pre-failure.
  • tbhoustontbhouston Member Posts: 32 ■■□□□□□□□□
    Laziness or Incomptence. Most likely both!

    Welcome to IT where the majority fit that exact description.

    Great quote i love it, LOL.. so many people just following steps on a sheet and not knowing what they are doing
    also so many people who see a problem and know they are the only one that would fix it, so they let someone else point it out at some point
  • jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive :)

    I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long :)
    My own knowledge base made public: http://open902.com :p
  • southernesoutherne Member Posts: 5 ■□□□□□□□□□
    I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.


    trademark attorneys
  • southernesoutherne Member Posts: 5 ■□□□□□□□□□
    Thanks for some very good and interesting stuff.I am glad to be here.As I enhanced my abilities and general understanding in this regard.
  • QHaloQHalo Member Posts: 1,488
    jibbajabba wrote: »
    Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive :)

    I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long :)

    I know I've made plenty of mistakes. However you learn from them and apply appropriate corrections. Three missed drive failures is a pattern.
  • UnixGuyUnixGuy Are we having fun yet? Mod Posts: 4,282 Mod
    jibbajabba wrote: »
    Based on the responses here it sounds like the majority never missed anything / made mistakes / couldn't be arsed. Impressive :)

    I would rather ask the question - who implemented the SAN, didn't setup mail / SNMP notifications and ignored the failure alerts for that long :)


    Our management team expects the admin to login to the SAN and Servers everyday and do a thorough health check. Having 3+ drives failures means that the admin wasn't simply checking the system.
    Certs: GPEN, GCFA, CISM, CRISC, RHCE
    In Progress: MBA
  • it_consultantit_consultant Member Posts: 1,903
    southerne wrote: »
    I can't understand the point ver well that If we didn't have a Hitachi monitoring server, nobody here would notice a failing drive.Anyone with some explanation will be welcomed.

    We have hundreds of drives, we rely on the HITRAK server to alarm if their is a drive failure. This is actually a requirement to get Hitachi support, when there is a failure the server automatically opens a service request with Hitachi.

    If we didn't have a monitor then there is a very good possibility that a bad drive would go unnoticed. We rarely look at the individual disks in the DP's and the physical disks are covered by a bezel so you wouldn't necessarily see a bad drive LED.
Sign In or Register to comment.