Sometimes you forget simple troubleshooting ey

Two issues which made me chukle.

1. After a windows update server doesn't come back up - boot error

Three people working on it ... After failure to get into SafeMode, swapped Raid card, motherboard and were close to re-imaging the server I get a call-out at 11pm ... no joy to do anything remotely .. get into the car to the datacenter .. walked to the server, took out the floppy disk and rebooted the server just fine .. Mind you - round trip to the DC is 4.5hrs which you don't really like after 11pm

... ok, at first I was p**d off of course - but at the end I did chuckle

2. Server disjoined the domain just fine. Fails to rejoin .. people assume network issues, putting down a cable to a different switch with a different network to rule out network issues .. nothing ... two hours later I come along, hit "nslookup" - notice the DNS server is wrong, set the correct one, joined just fine ...

Needless to say a lot of double facepalm images made the round ..

Anyone else had to deal with fails like that

I think once you work in high-end / tech environments you sometimes forget the easiest things ..

I also remember someone fighting with the network team about a server not negotiating at GbE and only after half a day of playing with it someone checked the network cable which was broken

Find more posts tagged with

Free for TechExams community: Cybersecurity salary guide

Compare cert salaries and plan your next career move

Button

Comments

Monkerz

I had a "duh" moment not too long ago.

We recently had a branch office lose power for a few hours, when the power was restored the router came back up, but everything else was intermittently online. By this I mean almost predicable packet loss.

The site had called our help desk to which the help desk tech walked the user through "powercycling" the router and 48 port L2 switch, then proceeded to submit a ticket to our group with this exact wording as the issue, "Office lost network after power surge, had user to power on the cisco router 2600 they were online briefly. pings seems as if their circuit maybe down. Their phones are also down. Bang box is working and scanning server and I am also able to access the server but its moving rather slow."

Now the phrases "pings seems as if their circuit maybe down" and "I am also able to access the server" just didn't make since to me. If the circuit was down, all gear downstream of that WAN link would be offline.

I get to looking, I connect to the router and see the SVI, WAN port and a few switch ports up and rx/tx'ing packets. So everything connected to this device was working, for the most part. I pull out my "trusty" site diagram and see that this site also has a L2 switch, so I try to ping it. The pings fail, so I check the arp table on the router and the entry is sitting there. I run a "sh lldp nei" and see the device's information, I can see what model it is, what port it is connected to the RTR on and what the management IP of the device is.

I had the branch power cycle the device. Once rebooted, I had a full 30 seconds of constant uninterrupted pings going on, then it seemed as if the switch crapped out again and the same ole losing packets began again. Beginning to think this was a corrupted arp table on the switch as the switch was constantly arp'ing the router like it wasn't saving the reply, we powercycled the switch again and threw in a static arp statement back to the router. This kept the switch up so we could continue our troubleshooting on the device. Long story short, after an hour of piddling with this and upgrading AOS, swapping out patch cables, and everything else under the sun we setup a new switch and gave it to our shipping department to be overnighted to the site.

It was only after this, my co-worker decided it would be a good idea to debug arp on the switch that was having issues. That was is, a simple debug arp showed us the problem. We had two different devices reply to this switch saying they were xxx.xxx.xxx.1. When I saw that, the light switch flipped on and pissed off at myself I blurted out, "those f****** installers left the old Cisco router in place when they installed the new Adtran." Then I went back and saw the ticket again, the tech had mentioned he "powered on" the cisco router, not that he had powercycled the unit.

I shut down the port on the switch that the Cisco 2600 was connected to and everything began to work again. I called the site and had them rip out that router and ship it back to me.

So I guess this is a misreading and forgetting the basics story.

Everyone

I hope you got paid for the drive time.

While not quite as simple, we had a change over this last weekend to cut POP3, IMAP, OWA, and ActiveSync over to the new Exchange 2010 servers. Even though I'm the only one on the team with real world experience doing this, I wasn't involved, mostly because Exchange 2010 migration project was well underway before I even got hired. So I ask the project lead that was doing the implementation how it went, and he tells me the change that was scheduled for 1 hour (in reality it should only take about 15 minutes if that) turned into 18 hours and it still isn't working right. We have 10 CAS's in the CAS array, but they had problems with the HLBs so only 2 of those 10 servers are being used right now. I asked him what kind of HLBs we have, and he tells me Foundry ServerIron's. I said "Dude, you should have called me, I bet I know exactly what is going on, I had issues getting things working with Foundry HLBs at my last job." He tells me that when they turn on HTTP to HTTPS redirect nothing works. Yup, just what I suspected. By default these HLBs only recognize HTTP status codes 200-299, the status code that IIS 7 returns for redirects is 302. There's a simple command to add status codes to the HLB so it won't think the server is down when it isn't. Apparently our network guys that maintain these HLBs didn't know that. There's also a couple other "gotchas" with configuring the health checks to work right with Exchange 2010 on these HLBs, so I told him about those too. Could have saved a ton of time if they'd just asked me!

undomiel

Monkerz story sounds pretty familiar. We had a site that all of a sudden started experiencing intermittent internet access problems. The other guys would check things, reboot the router and say there you go, fixed. Then 10 minutes later they're experiencing problems again. Unfortunately we have a number of unmanaged switches in that environment so I had to monitor ARP tables from the servers. Sure enough, my suspicion was confirmed and I saw that .1's MAC address had switched to something unfamiliar. Looking up the address turned up a wireless maker so I figured I was looking at a WAP of some sort. Fortunately the site was only 15 minutes away from me so it wasn't too bad of a drive. I get onsite and locate the problem unit pretty swiftly unit pretty swiftly.

Checking back on things I eventually found out what happened. There was a WAP onsite from the previous provider that had never worked properly so we left it disconnected. Well an enterprising soul at our company decided to remedy the problem and had proceeded to reset the thing to defaults and reconnect it to the network. He could never connect into it to reconfigure it though, and eventually gave up in frustration. Of course leaving it connected to the network. Sad thing was that he was doing this to procrastinate rolling out a number of new PCs since that work was pretty boring and it is much more exciting to reconfigure a WAP.