Sysadmins: How would you of delt with this server critical?

systemstech · September 2015

Everyone,

I like to get everyone's opinion. I feel like it's hard to get feedback from my job, so I'd like to ask on here.

An app server went down at 5:30AM, this is how I handled it. Try and keep in mind I'm a Jr. Sysadmin. Obviously I'm not the most experienced.

- Tried pinging: Nothing
- Tried RDP with host name and IP addres: Nothing
- Tried going in with Labtech: Nothing
- Tried going in through vSphere: Black screen
- Tried sending multiple CTRL+ALT+DEL requests to the server through console: Nothing
- Tried connecting remotely through remote manager: Nothing
- Powered down VM and brought it back up: Came back up just fine.
- Went through event viewer. Only saw one error at the time of the crash. Event Id 6008.
- Searched and saw on Microsoft's website that this is due to the server locking. There is a Hotfix for this. I will speak with the team first.

So, how did I do? Did I handle this correctly?

OctalDump · September 2015

Was there any documentation for how to handle this? Because going by the book first. Sometimes there are specific procedures already in place. Sometimes certain boxes shouldn't be restarted, or have dependencies that need to be looked after etc etc.

Without knowing anything else apart from what you wrote, I would probably have done much the same thing, although likely jumped ahead to the VSphere console after RDP failed. I'd also double check the VSphere metrics to see if server is actually doing anything (like high CPU or network or disk).

But probably the same outcome, restart, check event viewer, escalate back to team. Other options might be to increase logging levels, and ensure kernel **** is being kept after crash.

Also, worth cross correlating other logs, from VMware, network whatever. But all that depends on how important this server is.

Restarting (in some form) is probably 90% of fixes.

systemstech · September 2015

OctalDump wrote: »

Was there any documentation for how to handle this? Because going by the book first. Sometimes there are specific procedures already in place. Sometimes certain boxes shouldn't be restarted, or have dependencies that need to be looked after etc etc.

Without knowing anything else apart from what you wrote, I would probably have done much the same thing, although likely jumped ahead to the VSphere console after RDP failed. I'd also double check the VSphere metrics to see if server is actually doing anything (like high CPU or network or disk).

But probably the same outcome, restart, check event viewer, escalate back to team. Other options might be to increase logging levels, and ensure kernel **** is being kept after crash.

Also, worth cross correlating other logs, from VMware, network whatever. But all that depends on how important this server is.

Restarting (in some form) is probably 90% of fixes.

If there was anything like high CPU, network, or disk, we would of gotten a critical for that. We have it all set up in Opsview. Nope, there's no hand book for this type of stuff.

Thanks for the response. I've never been in the situation where the console is just a black screen. Usually some sort of error or something.

What do you mean by checking other logs in VMware?

dave330i · September 2015

VM HA should have been enabled. Then the crashed VM would have restarted by vSphere.

systemstech · September 2015

dave330i wrote: »

VM HA should have been enabled. Then the crashed VM would have restarted by vSphere.

What's VM HA?

OctalDump · September 2015

systemstech wrote: »

What's VM HA?

Pretty much what you would imagine. VMware checks the guest OS for a heartbeat, when there is no heartbeat it can shock the OS back to life (a virtual powercycle). It's crash consistent recovery, but still recovery.

https://www.vmware.com/au/products/vsphere/features/high-availability

It costs money since it isn't in all license levels for VMware. And there are reasons why you might not implement it.

If you want really cool, then check out Fault Tolerance:

vSphere Fault Tolerance: VMware | VMware United Kingdom

Although that probably wouldn't have helped in this instance.

OctalDump · September 2015

systemstech wrote: »

What do you mean by checking other logs in VMware?

I mean checking the utilisation history, events, anything that might show something unusual prior to the crash. Which reminds me: VMware have also Ops Manager which can be useful for alerts/notifications etc in these kinds of circumstances. It kind of does automatic baselines and can report on anomalies and what not. Again, it costs extra, but if you have enough VMs it can be useful.

techfiend · September 2015

I think you handled it fine. Given a situation of a down server there's really only one thing that can be done. Did you check vsphere logs? Some clues might be in there. That might be the only thing you didn't mention that someone might get on you but that's how we improve.

Virtualization saved a lot of hassle this time. Is it server 2008 or 2012?

Sysadmins: How would you of delt with this server critical?

Comments