Hesitant to reboot a device?

MAC_Addy · May 2012

I have a NAS at work that had to be rebooted this morning. We were having a power outage and our battery backup was running low - so I had to make the decision on what to turn off first and last. I consoled into it and saw that it's been up for 411 days. Now, I knew it needed to be turned off for a couple of minutes, or at least until the power came back on - but have you ever looked at the days up and your inner voice is screaming at you to stop? Bah... there goes my 411 days up!

ccnxjr · May 2012

hehe, yeah!
I would be a little skeptical too, wondering if I made any changes that haven't been saved in a config file , would it start up correctly? or is there a particular power on sequence for other systems that this one is linked too?

here's the current uptime for a logging server at work:
10:20:39 up 530 days, 3:22, 2 users, load average: 0.00, 0.02, 0.00

It really would be a shame to see this system rebooted or turned off...
530 days, it should get a medal or a watch or something for working this long with no days off :P

the_Grinch · May 2012

When I was on the night NOC we would have scheduled reboots for different devices. Usually, because someone had an issue during the day and the engineer troubleshooting believed a reboot would fix it. I always found it funny because I would read over the ticket and think "yup a reboot isn't going to fix this issue", but I would do it anyone. But more to your topic, since I was remote I would always worry about the device not coming back. Sitting there on another server waiting several minutes for your ping to start again can be the worst 5 minutes of your life. One customer had their stuff at a data center where the night staff's only skill was knowing how to power cycle. The worst part was they had no idea where any equipment was located. It took me two hours to get them to find the server in the designated rack and when I asked them for the rack number they said there wasn't one (which lead me to ask why they asked me for one if the rack didn't have one?). The second part that was mind boggling was they asked me where the key was to unlock the cage. Um, not my data center how would I know? They had to take apart a pen to bring the server back up

It was a terrible night...

Akaricloud · May 2012

I'm always hesitant to reboot devices, but not because it kills the uptime counter. A lot of our servers are used for multiple purposes that we don't know about until they're offline.

Also, I've done the "I'll just reboot this server real quick" and had hardware problems that prevented startup again. Like the_Grinch mentioned, remote devices can be one of the worst. I once hunted all over for a server that wasn't coming back online only to find that it was virtual..

Tackle · May 2012

Installing Windows updates then rebooting a Server is one of my worst fears (Well, not really if it's a VM or if it gets rebooted fairly often). If it's a physical server that is up and running, at least most of the time I can get some sort of indication that something is going to fail by watching the logs.

Some of the ESX Servers we run haven't been turned off in a couple years. Deathly scared to turn them off. Who knows if the drives will spin back up?

ccnxjr - I feel you on the not saving a config file. Really makes me nervous.

Hypntick · May 2012

As long as the device is virtual or has a remote access card, AND has current good backups I usually have no issues with rebooting if it's needed. But I verify I can restore from the backups before even considering it as an option.

One of our guys had to bounce a box one night to replace faulty hardware, corrupted the EDB, AD data store, and a half a dozen other things. Also when he was finally able to get back in, he had issues with restoring some of the backups. Solid 16+ hours he was on site rebuilding this box, all the while with the owner of the company sitting right there mad as heck.

Trifidw · May 2012

Yes, but only because I fear they wouldn't come back up... (Not a unjust fear as this happened yesterday with one our servers).

WafflesAndRootbeer · May 2012

the_Grinch wrote: »

When I was on the night NOC we would have scheduled reboots for different devices. Usually, because someone had an issue during the day and the engineer troubleshooting believed a reboot would fix it. I always found it funny because I would read over the ticket and think "yup a reboot isn't going to fix this issue", but I would do it anyone. But more to your topic, since I was remote I would always worry about the device not coming back. Sitting there on another server waiting several minutes for your ping to start again can be the worst 5 minutes of your life. One customer had their stuff at a data center where the night staff's only skill was knowing how to power cycle. The worst part was they had no idea where any equipment was located. It took me two hours to get them to find the server in the designated rack and when I asked them for the rack number they said there wasn't one (which lead me to ask why they asked me for one if the rack didn't have one?). The second part that was mind boggling was they asked me where the key was to unlock the cage. Um, not my data center how would I know? They had to take apart a pen to bring the server back up It was a terrible night...

And that right there is why a lot of people leave NOC/Server room jobs. Organization is one of the top three important things that must always be covered and it's usually lacking to the point where you'd have to do it yourself to get it done, which may or may not go over well with your supervisors. I've known a lot of them to take it as a slap in the face if you show anything but complete acceptance for their failures.

kalebksp · May 2012

Trifidw wrote: »

Yes, but only because I fear they wouldn't come back up... (Not a unjust fear as this happened yesterday with one our servers).

My thoughts exactly. Multiple times I have seen devices that have been running for a long time not come back from a reboot, usually due to hardware failure. The worst is when it's during remediation of a new client network and backups are questionable.

Essendon · May 2012

Say your prayers when you hit that Restart button or have iLO/DRAC access to the server. A colleague once had to drive 2 hours at 11pm one night to the sticks to reboot an Exchange server whose iLO he had assumed was working, but wasn't (the server couldn't properly shutdown after Windows updates). No one was happy, as you can imagine.

steve_f · May 2012

Yep, terrifies me. Especially when you have onlt budgeted about 20 minutes for the problem, and the server doesn't come back up!

Once rebooted an exchange server (THE most scary to reboot IMO). I know exchange pretty well, but still, it's Disaster recovery options are not simple.
The server came up fine, but of the 2 HP MSA StorageWorks disk arrays connected to it, only 1 was detected

Was quite terrifying, but got it working in the end. Whew!

aquilla · May 2012

Hell yeah! On the night shift we sometimes get requests to reboot servers if there have been problems. I hate rebooting Exchange servers. The worst one for me is when you reboot an Exchange server and it takes upwards of 10 minutes just to go down due to time taken to unmount the stores etc. The RDP session gets terminated and you can't get back on to see what's happening. Can be useful to shut down via iLO if that is available - at least you can see it shutdown. Then you're praying it comes back up clean.

Forsaken_GA · May 2012

Hi uptime just means the server likely hasn't been patched and maintained the way it should. Sure, for your own boxes in your own lab, go for the geek cred. In production? Tsk, tsk.

Forsaken_GA · May 2012

Essendon wrote: »

Say your prayers when you hit that Restart button or have iLO/DRAC access to the server. A colleague once had to drive 2 hours at 11pm one night to the sticks to reboot an Exchange server whose iLO he had assumed was working, but wasn't (the server couldn't properly shutdown after Windows updates). No one was happy, as you can imagine.

Yeah, we have this problem with some network gear occasionally. Whenever an upgrade that is being done requires a reboot, it's standard procedure to ensure we can see the console of both supervisor modules via the term server prior to a reboot. If we can't, we don't do it.

Occasionally, someone slips up and forgets to make that verification ahead of time, and they have cause to regret it. Ensuring proper Out of Band management access prior to performing maintenances is a hallmark of a good operations team.

shodown · May 2012

Forsaken_GA wrote: »

Yeah, we have this problem with some network gear occasionally. Whenever an upgrade that is being done requires a reboot, it's standard procedure to ensure we can see the console of both supervisor modules via the term server prior to a reboot. If we can't, we don't do it.

Occasionally, someone slips up and forgets to make that verification ahead of time, and they have cause to regret it. Ensuring proper Out of Band management access prior to performing maintenances is a hallmark of a good operations team.

I thoguth this was kinda a no brainer, but it a lot of shops it isn't. I actually carry a out of band box in my car. Most of our critical sites have a backup cable/DSL connection to them for this very reason to ensure we can get access if things go wrong during a window.

jmritenour · May 2012

I'm with Forsaken on this one - if you're not rebooting, you're obviously not keeping up with patches, and that's nothing to be proud of. Maybe not a big deal for some network or storage devices, but a server?

Though a couple of our clients have some ancient Dells still running Windows 2000, most of which haven't been rebooted since the last set of updates came out for that. I dread when we do have to reboot one of them, because they almost never come back up easy. They run like a champ once they are up, but getting there can be a battle.

jibbajabba · May 2012

I only made once the
mistake in rebooting a Linux based NFS NAS after 600 days uptime. I didn't know that Linux forces a filesystem
check after x amounts of days and the NAS (20TB+) was down for days as it runs fsck on reboot.

aquilla · May 2012

With regards to patching servers, we prefer to have a local IT engineer on site to do it in case anything goes wrong. We once patched a remote server in New York with an Exchange SP. The installation crashed half way through and trashed the server. Local IT engineer was on the phone to Microsoft support for several hours before MS advised they couldn't recover the server. Had to rebuild the server from scratch and then restore the Exchange database from backups.

jibbajabba · May 2012

aquilla wrote: »

With regards to patching servers, we prefer to have a local IT engineer on site to do it in case anything goes wrong. We once patched a remote server in New York with an Exchange SP. The installation crashed half way through and trashed the server. Local IT engineer was on the phone to Microsoft support for several hours before MS advised they couldn't recover the server. Had to rebuild the server from scratch and then restore the Exchange database from backups.

Patching .... can't wait

We are a big financial corporate (miss my hosting days sometimes) and therefore all server are mission critical. Patching happens only twice a year, next one is next months and with 1k+ systems you can imagine how much fun it is to baby sit that with 6 people (and we need to be done in six weeks time).

Forsaken_GA · May 2012

jmritenour wrote: »

Though a couple of our clients have some ancient Dells still running Windows 2000, most of which haven't been rebooted since the last set of updates came out for that. I dread when we do have to reboot one of them, because they almost never come back up easy. They run like a champ once they are up, but getting there can be a battle.

Worst case scenario I've ever had - customer with flaky Win2k sp2 servers that may or may not come back up on reboot, and we could *not* patch it, because the patches past a certain point broke the entire back office accounting application for their entire company... and the original developer was defunct, so couldn't update it (and they never bothered to get the source code, so they couldn't get anyone else to fix it) and they weren't willing to pay the cash to migrate to something else.

I was very, very happy when they decided to build their own datacenter and take their management in house.

ptilsen · May 2012

Essendon wrote: »

Say your prayers when you hit that Restart button or have iLO/DRAC access to the server.

Nearly every time I audit a project, someone misses an iLO. Either they don't have it accessible, don't have it configured, or don't have the key installed. Every time. It drives me nuts. I don't do anything unless I know I can get at it remotely if it breaks.

Essendon · May 2012

Or the VLAN was changed and no one remembered to change the iLO's address too. There's a GUI that you can use to edit the iLO's settings but it requires .NET 3.5 and without a major change it cannot be installed (the server needs a restart too). Thankfully, we upgraded the majority of our physical servers from G5's to G6's and more recently, G7's, and iLO configuration is a key deliverable. O and about 200 of our servers are now virtual, so no more iLO and no more freaking nervous moments.

Forsaken_GA · May 2012

Essendon wrote: »

Or the VLAN was changed and no one remembered to change the iLO's address too.

Ah yes, we find this one all the time too, but with the host entries on the tsv (sometimes it just moved ports and no one bothered to tell us), which is why we like to verify our OOB before we do any maint hehe

Essendon · May 2012

Forsaken_GA wrote: »

which is why we like to verify our OOB before we do any maint hehe

Good idea!

Hesitant to reboot a device?

Comments