Options

Sysadmins: How would you of delt with this server critical?

Everyone,

I like to get everyone's opinion. I feel like it's hard to get feedback from my job, so I'd like to ask on here.

An app server went down at 5:30AM, this is how I handled it. Try and keep in mind I'm a Jr. Sysadmin. Obviously I'm not the most experienced.

- Tried pinging: Nothing
- Tried RDP with host name and IP addres: Nothing
- Tried going in with Labtech: Nothing
- Tried going in through vSphere: Black screen
- Tried sending multiple CTRL+ALT+DEL requests to the server through console: Nothing
- Tried connecting remotely through remote manager: Nothing
- Powered down VM and brought it back up: Came back up just fine.
- Went through event viewer. Only saw one error at the time of the crash. Event Id 6008.
- Searched and saw on Microsoft's website that this is due to the server locking. There is a Hotfix for this. I will speak with the team first.


So, how did I do? Did I handle this correctly?

Comments

  • Options
    OctalDumpOctalDump Member Posts: 1,722
    Was there any documentation for how to handle this? Because going by the book first. Sometimes there are specific procedures already in place. Sometimes certain boxes shouldn't be restarted, or have dependencies that need to be looked after etc etc.

    Without knowing anything else apart from what you wrote, I would probably have done much the same thing, although likely jumped ahead to the VSphere console after RDP failed. I'd also double check the VSphere metrics to see if server is actually doing anything (like high CPU or network or disk).

    But probably the same outcome, restart, check event viewer, escalate back to team. Other options might be to increase logging levels, and ensure kernel **** is being kept after crash.

    Also, worth cross correlating other logs, from VMware, network whatever. But all that depends on how important this server is.

    Restarting (in some form) is probably 90% of fixes.
    2017 Goals - Something Cisco, Something Linux, Agile PM
  • Options
    systemstechsystemstech Member Posts: 120
    OctalDump wrote: »
    Was there any documentation for how to handle this? Because going by the book first. Sometimes there are specific procedures already in place. Sometimes certain boxes shouldn't be restarted, or have dependencies that need to be looked after etc etc.

    Without knowing anything else apart from what you wrote, I would probably have done much the same thing, although likely jumped ahead to the VSphere console after RDP failed. I'd also double check the VSphere metrics to see if server is actually doing anything (like high CPU or network or disk).

    But probably the same outcome, restart, check event viewer, escalate back to team. Other options might be to increase logging levels, and ensure kernel **** is being kept after crash.

    Also, worth cross correlating other logs, from VMware, network whatever. But all that depends on how important this server is.

    Restarting (in some form) is probably 90% of fixes.


    If there was anything like high CPU, network, or disk, we would of gotten a critical for that. We have it all set up in Opsview. Nope, there's no hand book for this type of stuff.

    Thanks for the response. I've never been in the situation where the console is just a black screen. Usually some sort of error or something.

    What do you mean by checking other logs in VMware?
  • Options
    dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    VM HA should have been enabled. Then the crashed VM would have restarted by vSphere.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • Options
    systemstechsystemstech Member Posts: 120
    dave330i wrote: »
    VM HA should have been enabled. Then the crashed VM would have restarted by vSphere.


    What's VM HA?
  • Options
    OctalDumpOctalDump Member Posts: 1,722
    What's VM HA?

    Pretty much what you would imagine. VMware checks the guest OS for a heartbeat, when there is no heartbeat it can shock the OS back to life (a virtual powercycle). It's crash consistent recovery, but still recovery.

    https://www.vmware.com/au/products/vsphere/features/high-availability

    It costs money since it isn't in all license levels for VMware. And there are reasons why you might not implement it.

    If you want really cool, then check out Fault Tolerance:

    vSphere Fault Tolerance: VMware | VMware United Kingdom

    Although that probably wouldn't have helped in this instance.
    2017 Goals - Something Cisco, Something Linux, Agile PM
  • Options
    OctalDumpOctalDump Member Posts: 1,722
    What do you mean by checking other logs in VMware?

    I mean checking the utilisation history, events, anything that might show something unusual prior to the crash. Which reminds me: VMware have also Ops Manager which can be useful for alerts/notifications etc in these kinds of circumstances. It kind of does automatic baselines and can report on anomalies and what not. Again, it costs extra, but if you have enough VMs it can be useful.
    2017 Goals - Something Cisco, Something Linux, Agile PM
  • Options
    techfiendtechfiend Member Posts: 1,481 ■■■■□□□□□□
    I think you handled it fine. Given a situation of a down server there's really only one thing that can be done. Did you check vsphere logs? Some clues might be in there. That might be the only thing you didn't mention that someone might get on you but that's how we improve.

    Virtualization saved a lot of hassle this time. Is it server 2008 or 2012?
    2018 AWS Solutions Architect - Associate (Apr) 2017 VCAP6-DCV Deploy (Oct) 2016 Storage+ (Jan)
    2015 Start WGU (Feb) Net+ (Feb) Sec+ (Mar) Project+ (Apr) Other WGU (Jun) CCENT (Jul) CCNA (Aug) CCNA Security (Aug) MCP 2012 (Sep) MCSA 2012 (Oct) Linux+ (Nov) Capstone/BS (Nov) VCP6-DCV (Dec) ITILF (Dec)
Sign In or Register to comment.