How to get into the mindset of troubleshooting Linux/Windows

YuckTheFankees · December 2012

I have made it to the final stage of the interviewing process for a technical support/sys admin type position (mostly linux/some windows). I have a networking background mixed with some Linux task (viewing disk usage/deleting files/etc).

I am use to troubleshooting network connections (ping, traceroute, netstat, etc) but not servers and I want to understand how system admins on TechExams look at a problem and resolve it from start to finish (what's your first step, cmd's you may use, tools, strategy, most common issues, important files).

During my technical interview, I was asked "How do you troubleshoot a server that keeps rebooting". I have never experience this first hand so I replied:
*take down the exact error message and any error codes
* possibly do a screen print
* See if any particular file/dir/device were mentioned in the error
* is the server rebooting or shutting down
* View logs/ event viewer if possible

How would you go about troubleshooting an issue like this? And any other troubleshooting tips would be helpeful.

dbrink · December 2012

For a windows server that keeps rebooting I would do what you have said but if it keeps happening you will probably have to get a memory **** and run it through the Windows debugger to figure out which module/dll is causing the fault.

ptilsen · December 2012

The important thing about troubleshooting is not specifics. Specifics are important of course as an SME or specialist on a particular system, but for most positions you will be troubleshooting different things. As a result, what's most important is your approach to troubleshooting, not the specifics of, for example, a server that keeps rebooting. Using a good approach means you can come up with a good answer to that question despite the specifics of the question.

I would answer that I would first gather as much information as possible. In the case of a chronic rebooting issue, this means the frequency and timing of the reboots, any possible or identified correlating events, when the problem originated or is suspected to have originated, and any error messages, logs, or other possible indicators as to what could be causing the issue. After collecting information, if the evidence does not point to a conclusive cause, I would then try to determine likely causes and seek to eliminate them as possibilities. That is a general response to a partially specific question, and it doesn't involve specific tools or checking for specific technical issues.

Tools and techniques are going to vary between operating systems, hardware platforms, and even server roles. What's important is that you know how and when to apply them within a more general process. The details of "a rebooting server" are going to vary too much for me to say "well, I would take a screen print, look over the logs, run a chkdsk, and test the power source". When we start getting specific information is when we say "I'm going to check these logs, look for these types of defects, and run these tests."

In general, rebooting servers are not dissimilar to rebooting network devices. The same techniques and even some of the same tools are going to apply. In a related story, troubleshooting issues with network applications will involve some of the same techniques and most of the same tools (in terms of what they do, anyway) as troubleshooting network devices and connections.

ChooseLife · December 2012

YuckTheFankees wrote: »

I want to understand how system admins on TechExams look at a problem and resolve it from start to finish (what's your first step, cmd's you may use, tools, strategy, most common issues, important files).

Goal: Find the pattern/correlation with other events and/or a way to reproduce the problem reliably
First step: Check logs (dmesg, /var/log/*)
Tactics: Collect data - resource usage trending, process tree snapshots, process traces
Tools: dmesg, syslog, snmp, ps, top, vmstat, iostat, nmon, strace...

pram · December 2012

As said, for RHEL the tell-tale signs will usually be in dmesg or /var/log/messages. If it isn't obvious there then I'd consider using sar/sysstat. Its a daemon that continually collects system info. The only two things I ever see linux servers restarting from are the oom process killer going crazy (usually on webservers) and kernel panics. Both of these show up in dmesg/messages

the_Grinch · December 2012

On the Windows side, there are a couple of things I would try. First (especially if I am physically there) check the UPS to see if there is some sort of power issue (dead battery). Depending on the setup, you might be able to remote into it via a web interface and check. Second, attempt to get the server into safe mode. If that work, I'd then begin to look at the logs System and Application specifically. Third, I'd check to see if any updates were recently installed. The big thing is finding out if it is blue screening when rebooting as there is an option where it won't sit at the error screen. Hopefully in safe mode you'd be able to get to that.

The other thing you'll want to consider is that it is probably a hardware memory issue. Typically, this means you would be unable to boot into safe mode. At that point I would run whatever bootable hardware tests that are available. I know with HP Servers, they have a tool that is installed within the OS which can tell you about any hardware issues (bad drives, memory, etc).

I always hate those questions just because there are so many answer and I really like to see the issues. I always find I can troubleshoot better when I'm right there and can actually see what's happening. It's like being a DO, I gotta "touch" it as it were.

fly2dw · December 2012

Some good advice on this.

ptilsen wrote: »

I would answer that I would first gather as much information as possible. In the case of a chronic rebooting issue, this means the frequency and timing of the reboots, any possible or identified correlating events, when the problem originated or is suspected to have originated, and any error messages, logs, or other possible indicators as to what could be causing the issue. After collecting information, if the evidence does not point to a conclusive cause, I would then try to determine likely causes and seek to eliminate them as possibilities. That is a general response to a partially specific question, and it doesn't involve specific tools or checking for specific technical issues.

I agree with this here, as I think we have had the same A+ troubleshooting conditioning

Just to extend on some of these points, I would recommend the following:

1 - When did the issue begin, when was it first noticed?

2 - How often do the reboots occur? Is there a certain time? Is it random?

3 - What was the last significant thing that was done to/on the machine? Internet browsing? Deleting system files (It happens)? Notice a virus alert from AV? Firmware update (Flash BIOS)? Device driver install?

4 - When does the reboot occur when using the computer?

a - Does the machine complete it's POST process? If not then you may need to listen to error code beeps indicating a potential hardware issue and read the motherboard troubleshooting instructions on what they mean (Some are universal, but best to check the manual). You may need to investigate the BIOS for settings that have changed (Boot order, CPU settings etc). Has there been a recent firmware upgrade on the BIOS carried out by the user (Hopefully not) or technician? You may need to open the computer up and check components are working correctly and are not damaged.

b - Machine completes POST process but reboots just before OS login This could be incorrect boot partition settings. You could setup some debugging to capture the potential issue. In Windows you could even check GPO's for start up scripts, as there could be some kind of misconfiguration.

c - Machine completes POST and user can login, however the reboots happen in the OS. Here you can check Event logs (Application, System, or Specific logs), setup debugging, or check AV logs. Some machines like Dell and HP have software that you can load on there to generate hardware error messages. You can use a ton of tools (Sysinternals do great Microsoft tools). You could use a Boot CD to analyse the HDD while the OS is not running like the Ultimate Boot CD or a number of Linux distros or software (Could even use the default chkdsk).

5 - Once you have identified and fixed the issue, do not forget to document the problem and solution

This works for me, but there are a ton of troubleshooting models out there if you want a source of reference. A+ goes into this kind of stuff pretty well:

CompTIA A+ Troubleshooting Model - Flashcards - Create Free Flashcards

Or check this from Microsoft:

Understanding Troubleshooting

Hope this helps.

W Stewart · December 2012

The first thing I would do is go into the bios to see if it reboots there so that way I know if it's hardware. If it reboots in the bios then it's either an overheating processor, motherboard or bad PSU. Hopefully the bios keeps a log that you can check otherwise you will need to start swapping parts or the entire system depending on how your department operates.

If it's an OS issue on linux then find out where exactly it reboots at. Does it reboot while at the grub prompt? Usually if it's software, a particular point in the software would have to trigger the reboot. Try going into single user mode and seeing if it reboots there as well as it may be an issue with x11 if the system is even running that. I would imagine in linux it would reboot the moment the kernel loaded. That's a more likely scenario so try booting to a different kernel to see if you get the same results. If none of that works then try booting to a live CD and installing a new kernel or reinstalling grub/lilo if that turned out to be the issue. It may also be an issue with the hard drive if it never makes it to grub or reboots very early in the boot process. There's a lot of stuff to cover.

With Windows I would go through the same hardware troubleshooting and try booting into safe mode. If safe mode works then you may want to try diagnostic start up with msconfig to see whats causing the issue and maybe try disabling the video driver as well. If it never makes it past the Windows splash screen then get a recovery cd and run fixmbr, fixboot and rebuildbcd/whatever it's called in certain versions of Windows.

If it's a hard drive issue then you'll usually know by running diagnostics or when you try to reinstall the OS and it doesn't quite go the way you planned.

Also, if it's a blue screen, get the stop code. I believe Windows Vista has an equivalent to a bluescreen, they just got rid of the blue part.

I remember reading somewhere to start troubleshooting from layer 1 on up. It definitely helps to do it that way so you're not trying to read log files while the system is rebooting on you. You've definitely got to know how frequent the reboots are though so you can know when to rule out the processor PSU and mobo.

CodeBlox · December 2012

For servers rebooting (Or any windows workstation for that matter) you could check the files in C:\windows\Minidump

When a BSOD occurs, the **** gets stored there if enabled (which it usually is). You will need a third party tool to get any useful data from the *.dmp files. Often times, the results tell you exactly whats causing the BSOD. On workstations, the BSOD can come up and go pretty quickly giving it the illusion that the workstation just rebooted for no reason. We have a domain controller which occasionally BSODS at work.

higherho · December 2012

I moved into a more linux based enviroment (at least for one of our major big-data applications) and you will be surprised how much troubleshooting actually flows over from a Windows enviroment. Basically what I mean is, it wasn't hard for me to bring over my troubleshooting skills from being a windows admin to a hybrid Linux / Windows admin.

1. You will always first check your logs, this is pretty much a given no matter what OS you are using
2. Verifying services are running correctly, etc
3. Use built in tools or extenral tools to help aid you in your troubleshooting
4. Follow a logical pattern, write things down and what steps you taken so far and use your knowledge of that OS to figure a solution or work around until a proper solution is made.

Is linux harder to troubleshoot than Windows? It depends, I find Linux to be easier to work around issues than Windows. Is the command line hard? No, it just requires you to remember certain commands. Is linux software package distro better than Windows? I personally dont think so but I can install software on a linux box much faster than a Windows box. Is Scripting the same? Pretty much, just different commands.

As long as you know your overall Concepts such as DNS, SMTP, TCP you can implement these on any OS, just some handle it differently. When you get to know your file system (/etc /lib /lib64 /sbin, etc) and can handle VI well then you will get used to linux pretty quickly.

cgrimaldo · December 2012

CodeBlox wrote: »

For servers rebooting (Or any windows workstation for that matter) you could check the files in C:\windows\Minidump

When a BSOD occurs, the **** gets stored there if enabled (which it usually is). You will need a third party tool to get any useful data from the *.dmp files. Often times, the results tell you exactly whats causing the BSOD. On workstations, the BSOD can come up and go pretty quickly giving it the illusion that the workstation just rebooted for no reason. We have a domain controller which occasionally BSODS at work.

Do you have any recommendations for third party tools to view minidump files?

the_Grinch · December 2012

I've always used WhoCrashed to analyze mini-****. Does it all automatically and is actually a good tool.

Resplendence Software - WhoCrashed, automatic crash **** analyzer

Obviously, though, you need to confirm that it isn't a hardware issue. I had a laptop that kept crashing and it reported Symantec along with Mozy causing the crashes. Ultimately found out it was bad RAM.

YuckTheFankees · December 2012

Thank you for all the helpful tips. I am currently reading multiple books to come up with a strong troubleshooting strategy. Slowly but surely, I am beginning to get a grasp for troubleshooting.

dbrink · December 2012

If you want to get a deep understanding of Windows then I would recommend the Windows Internals books by Mark Russinovich.

YuckTheFankees · December 2012

I'm more focused on Linux, just basics of Windows.

pram · December 2012

Not directly related to rebooting issues, but strace can be a lifesaver. I've used it to debug quite a few scripts. If you've never used it, it essentially attaches to a process and lets you view the system calls. This can be helpful in determining why a program is having issues, as you can see errors being generated that typically don't end up in any log.

For example, one of my clients had a java program that ran through jboss that was causing random humongous load spikes. Rather than wait on the programmer to debug it I decided to take a look at what it was doing. Ultimately by using 'strace -e trace=network' on it I discovered it was having name resolution issues, and the threads were getting stuck in an infinite loop.

It can be fairly daunting to use at first because the output isn't very intuitive, but its truly invaluable for troubleshooting.

yuddhidhtir · December 2012

Nice thread!! learnt alot, thanks.

ChooseLife · December 2012

pram wrote: »

Not directly related to rebooting issues, but strace can be a lifesaver. I've used it to debug quite a few scripts. If you've never used it, it essentially attaches to a process and lets you view the system calls. This can be helpful in determining why a program is having issues, as you can see errors being generated that typically don't end up in any log.

For example, one of my clients had a java program that ran through jboss that was causing random humongous load spikes. Rather than wait on the programmer to debug it I decided to take a look at what it was doing. Ultimately by using 'strace -e trace=network' on it I discovered it was having name resolution issues, and the threads were getting stuck in an infinite loop.

It can be fairly daunting to use at first because the output isn't very intuitive, but its truly invaluable for troubleshooting.

I agree absolutely. strace can be intimidating, but a server admin not having it in the arsenal is akin to a network engineer unfamiliar with tcpdump. Other tools provide indirect hints, strace provides real data.

How to get into the mindset of troubleshooting Linux/Windows

Comments