Whats the biggest screw up you have been a part of at work?

--chris-- · June 2014

I am going to leave details out because this might make national news, it just occurred and its too soon to tell if it was an attack or a "oops".

The issue:
Every device in an organization was re-imaged instantly (device count in the 1000's). Every device.

They are paging every tech they employ in a three states to assist in resolving this issue.

I think this will be one of the biggest melt downs I might be involved in (the resolve process not the "oops process).

Anyone else have a similar experience? How was the problem solved? How long did it take?

yzT · June 2014

Setting up modsecurity I let the employment service of my government without service.. xD Almost couldn't disable modsecurity due to the heavy load.

shodown · June 2014

I've brought down quite a few call centers in my day. The largest has prob been about 1000 users on a team of guys I've been working with. The largest on my own was close to 400.

jvrlopez · June 2014

Unplugging a live server while rewiring the racks. Whoops.

kohr-ah · June 2014

Accidently clicked failover on a Cisco NAC unit and the secondary didn't have valid certificates.
I knocked 1800 users offline and to make it better no one knew the login info for the secondary device so I could fail it back.

jvrlopez · June 2014

Here are some more I can remember:

1) Breaking a safe's handle off. The handle needed oil to function smoothly but we never got around to oiling it. I decided to stand on top of it in an effort to force it open. When that didn't work, I ended up kicking it and then jumping on it. I broke it right off. We ended up having to call the only locksmith on the island to come out and save us (time sensitive material was inside).

2) Changing a safe's combination and leaving the change key plugged in while I closed and locked it. This meant the changed combination took but wasn't confirmed (meaning I couldn't open it). I figured the change key could be jostled out so I got about 5 guys to help me rock the safe back and forth in an effort to dislodge it. It wasn't working so we were planning on putting it on a dolly and dropping it as well as throwing it down some stairs (2). This was a 500 lb safe mind you...we ended up making so much noise from rocking the safe back and forth that someone came by to see what was going on. By pure luck, he had encountered the same situation before a few years ago. He said they took a hammer to the face of the dial and ended up destroying the X-07 lock. He also said their local locksmith came by and mentioned that if you just left the combo dial alone for 5 minutes, it would default back to the original combination and you could open it again. Problem solved.

3) My supervisor disabled the networking interfaces on a server that used networking to boot up.

4) Not applying reboot settings correctly on about 20 patches that I pushed on the network. So instead of 1 reboot with 20 patches applied, every user got 20 reboots. Complete work stoppage on a Monday and leadership was not happy.

5) I was inventorying servers/blades in racks while some contractors had a bare drive with a SATA to USB controller attached and were using it to pull down some critical data. I figured I could step over it with no problem. Well my right foot made it across fine, my left caught the USB cable and pulled the drive down from about 5 feet up. Their interface immediately gave an error and the drive was totaled. About 3 hours of data transfer squashed. This was about 15 minutes before they planned on going home so they were not happy at all.

gorebrush · June 2014

I upgraded a server once by upgrading, in place, Windows 2000 to 2003. I'd done the same thing the day previously and it worked absolutely fine. However, on the second day I did it on a slightly different server, it was one generation older.

So, the second day, feeling quite pleased with myself after a good first day work (This was over a weekend) I plop the disk in the drive and off we go. Start installing Windows and next thing I know, the box is stuck in a reboot loop.

Not good. It transpired that for the older generation machine, I *needed* the SmartStart CD (HP boxes) and from there then I had to rebuild the server from scratch.

Oh, did I mention that this server was also a SQL Server. Our only SQL Server. The one that kinda ran the ERP system. ERP that y'know, ran the company.

I defecated large bricks that day but after persevering I managed to recover everything (I was pretty picky with backups, they always worked, and I had a backup from the Friday night)

I think the only thing we had to do on the Monday morning was recreate everyones logins on the SQL Database because the ERP used SQL Authentication, and I think everyone had to re-input about a days work.

I got away with it luckily but boy that was an embarrasing day. Paid off in the end though, the aim was to upgrade Exchange (The SQL Server was a backup exchange server also) - we had Exchange 2000 - so public folder size was our achilles.

DevilWAH · June 2014

shodown wrote: »

I've brought down quite a few call centers in my day. The largest has prob been about 1000 users on a team of guys I've been working with. The largest on my own was close to 400.

its not about the size of the screw up, its about what you learn so that is does not happen again...

also whats the difference between a screw up and an operational error? For example I set up HSRP back in the day and set the hello interval to 100msec (as confirmed by a consultatnt and non where in the documentation does it say you should not.) Worked fine for weeks untill a large back up job ran and the HSRP active and fail over starting flapping and brought down the network. Took the whole site down for about 20 min while I worked out what was up and fixing it.

Also had the same with 3com - Cisco ether channels. set them up and all working fine, but then site had a power cycle and when every thing was coming back up about half the CISCO - 3COM trunks threw errors and the Cisco error disabled them. Thank fully that one I have all access switches set to release error disabled ports after 15min, so after re-configuring the ether-channels it all came back. Strange one though that you can configure the channels and they work fine untill the reboot and loads of errors as the form..

Apart from a very early mistake when i cross patched a live network in to a development network (same IP range and AD domin) that brought the whole site down, I don't think I have ever done some thing that looking back on I messed up. Times when some thing has broke, but even with all the testing in the world until you apply it to the live network you can never be sure.

--chris-- · June 2014

This has turned out to be a great little thread. Its nice to see just about every experience level runs into something like this every now and then.

My personal biggest oops so far has been following the direction of upper management. It was a Monday morning, bright and early. She walked into our groups office and introduced us to a contractor that was working on people soft. He was going to be here for 4 months and needed access to some shares, printers and the intranet. She instructed me to add him to domain, and he will use his network credentials to get into the nescessary server resources.

Well that's a big no-no. It was his personal device that she wanted me to put on the network. The system denied access, and because it was migrated over to the domain (and the user didn't know his credentials for the laptop) we were unable to log in to get it back into a workgroup. I ended up using a linux boot CD to browse to the C:\My Docs\ and look for a user profile, which he then remembered and was able to get me logged in.

That little no no ran up the chain and worked its way back down to me. I was about to get it "pretty good" when my manager stepped in and pointed out that our management requested this. The heat was diverted to the correct person and I was informed of the policy I breached.

wastedtime · June 2014

I had a pretty big one that I lucked out on. The windows network I was working on had 802.1x using TLS running and there was a misconfiguration of it from higher up than me. I went through and tried to figure out how to fix it myself. I thought I had found a fix and it seemed to work on the one computer I was at. So I did it to the rest of the OU. Well for one reason or another we restarted that one machine that I had tested it on and.....it couldn't connect to the network, 802.1x had failed. Now that I had done it to about 300 machines I was a bit worried. Luckily for me as long as the machines didn't get restarted fairly soon after it wouldn't happen. What I had did was modified a registery key that deleted the local machine certificate. I believe the auto certificate renewal restored that key when it would check on it (not sure of why). Needless to say that one machine was the only one down and I eventually did find a fix. Turns out that they had a few registry settings wrong. When a user was logged in it wouldn't try to restart network authentication and it was also left trying to authenticate with a user certificate instead of a computer certificate.

J_86 · June 2014

Years ago when I was green at my first IT job I was moving some equipment and created a broadcast storm once. Luckily I quickly figured out what happen when all of the users around me starting saying they couldn't do anything on their computers . Everything was on the same VLAN at this place, so it stopped connections to servers and everything in the data center

I locked out the root account of some security cameras we had at one place I worked. There just happen to be a problem in that version of firmware that once the account was locked, it would not unlock in 15 minutes like it was supposed to. The only fix was to hard reset the cameras. The problem with that was they were 40 feet in air, on the warehouse ceiling. Had to rent a scissor lift and go around to each camera and reset it. I was the only tech, took me 3 days to fix.

Deleted bunch of user data once at the direction of a manger. Turns out it was not old like it said

. I had to restore everything from a tape backup, after driving 4 hours to go get it from on offsite storage location.

Part of a team and we upgraded IOS versions on all our switches one night after testing one location a few week prior. Unknown to us bug caused issues and brought down 3 distribution centers for about 3 hours.

Cut what we thought was "old" analog phone cabling, which turned out to be in use by the fire alarm system. When I cut it, the fire alarm went off and the fire department came.

Learning is fun

loxleynew · June 2014

This was a tale of a co worker who did a major oops. He was trying to upgrade an ESX box form 4.1 to 5.1 and thought to himself why not do this during the middle of a working day? Vmotion works fine right?? Ok so he started v motioning off about 10 or so servers to another ESX host and in the middle of this the server threw an error and it crashed. When it came back up half of the servers and their data were corrupt and could not be recovered... Bad news about this was they were mostly all database servers so big problem. About 1000 people used these servers daily. another problem was he had the backups set for during the middle of the day. Guess what happened? as they were crashing it was backing up the servers so it backed up corrupt data. You guessed it... All the most recent backups could not be used.

What ended up happening was we lost about 1 days worth of data which was huge as it took about a week to re-create that data manually. He also tried to cover himself by deleting logs on the ESX host and backup server so it took us about 2-3 hours to figure out what had happened. Luckily we restored some of those logs and figured it out. He denied it was him even though he was the only one at the time with admin rights on that ESX host.

Now a tale of my major oops. back in the day I was working with exchange 2000. I installed AV on our exchange server and forgot to add exceptions to certain folders. It quarantined all the logs on the exchange server and then when the screen popped up "do you want to delete the quarantined files" I of course was like yea sure lets do it. So bye bye logs. It took me like a day to recreate the logs while everyone's email was down. Not a good day to say the least.

CodeBlox · June 2014

Accidentally rebooted all servers in our domain...

Mishra · June 2014

I reset a mortgage company's entire work on a Friday...

They had one central SQL server that stored all of their financial data. It was a 2 node, 1 shared storage situation.

One node died early in the month and needed to be rebuilt.

I inserted a CD into this dead SQL node and started through the installation process. I got to the partitioning screen and saw a 70GB volume and 500GB volume. I scrolled up and down, then scrolled up up up up to see what the highlight was really highlighting. I did this for a good 2-5 minutes to make SURE I knew what partition I was selecting.... So I deployed Windows to the 70GB volume.

Hours later the entire org is down and I'm on the phone with Microsoft to understand why the 500GB partition is not available. They said the partition table was removed for some reason..... We did this on a Friday (thankfully planned it that way) So we had to restore from backup and all the bankers had to replay information they did on Friday over the weekend or on Monday. I still have no idea what happened. I'm POSITIVE I selected the 70GB volume.

I will unplug the SAN next time I ever need to do this.

jibbajabba · June 2014

Seen that before where the OS drive was technically drive 1 - so it wrote the MBR on drive 0 - no matter what I have chosen during the install (can't remember if that was 2003 or 2008 already).

Since then - out of habit, I usually disconnect discs I am not supposed to use lol ... What always scares me is if I have to reinstall ESXi and I am not able to mask off the LUNs .. "Is this REALLY the USB drive .. is it .. IS IT .. BUT IS / WAS IT THOUGH ? .. click next .. click back .. IS IT STILL THE USB ???????"

Mishra · June 2014

jibba,

Weird thing was that the OS did get done with it's install and it was on the 70GB partition.

I definitely will disconnect any disks I don't want to use next time.

DevilWAH · June 2014

CodeBlox wrote: »

Accidentally rebooted all servers in our domain...

Mate at work run a "test" network build for desktops, that was meant to go out find a machine and rebuild it with a clean desktop image.. Sadly he forgot to tick a back and started a rebuild on every machine on the next work. servers and desktops

thank fully he had set it only to run on up to 20 machines at a time and it only wiped about 4 servers and 10 PC's before he cancelled it. What was lucky is my PC was one of the first it did so I got to ask him what was going on straight away so he could stop it. Could have been very messy..

fredrikjj · June 2014

Pasting "unconfigure switch all / yes" into the wrong Extreme Networks switch. If I remember correctly, and it was a while since I used Extreme, it erased the entire running config without even requiring a reboot.

SteveLord · June 2014

I recently pasted SQL code in to the wrong procedure in our PRODUCTION database. Everyone complained about how my "minor fix" totally made everything worse. Couldn't believe what I thought I did was so bad. Then when I pulled it all up, I realized what I did. Oops!

deth1k · June 2014

--chris-- wrote: »

I am going to leave details out because this might make national news, it just occurred and its too soon to tell if it was an attack or a "oops".

The issue:
Every device in an organization was re-imaged instantly (device count in the 1000's). Every device.

They are paging every tech they employ in a three states to assist in resolving this issue.

I think this will be one of the biggest melt downs I might be involved in (the resolve process not the "oops process).

Anyone else have a similar experience? How was the problem solved? How long did it take?

[h=1]Emory University by chance?

[/h]

xiny · June 2014

Week 1: i had to upgrade all my branches to ShoreTel VoIP phones.
Week 2: Ran Daily Nessus scans to prepare for an upcoming Audit
Week 3: Couldn't figure out for the life of me on why all my VoIP phones kept crashing every single day.
Week 4: Finally made the connection between VoIP Phones and Nessus Scans.....

instant000 · June 2014

deth1k wrote: »

Emory University by chance?

Oops! Emory University accidentally tells all of its Windows PC to format themselves | WinBeta

CodeBlox · June 2014

Funny thing... I almost made the exact same mistake as the OP during my first week as a Systems Admin. Thank goodness I had not distributed the package for the image files to the distro points. The task sequence could not resolve the package so it failed. We got a few calls about it but once I realized what I had done I sorted it out. That was nearly 2 years ago as a noob lol.

colemic · June 2014

set off a fire alarm in a SCIF. using a heat gun to remove TS/SCI stickers off decommissioned equipment, can't scrape them off. the trick is to heat them until they bubble, then they come right off... I did one, it gave off a curly-q of smoke, and I watched it go right up to the alarm, LOL. Had to sprint to the Ops floor to explain what happened before they emptied a 1600-person building.

In Afghanistan, rebooted a classified chat server, after installing patches. Problem is, it was in use, and was the server soldiers use to coordinate medevacs when they had casualties... and I rebooted it in the middle of a mission to get an injured soldier out. wasn't entirely my fault, (wrong background on the desktop that would have indicated its criticality) but it really upset me.

Also in AFG, wasn't me but several thousand NIPR machines had an SCCM reimage command issued to them, with no profile backup.Supposed to have been to a test OU but went to all of camp eggers. They pulled techs from all over the country for a few weeks to get everything back to normal.

jvrlopez · June 2014

Blow dryers/heat guns were a savior when it came to removing those tamper proof classification stickers. Had to go through an entire squadron once and remove a certain sticker.

--chris-- · June 2014

deth1k wrote: »

Emory University by chance?

No, but how bizarre! This must happen frequently if a published report of one was as recent as May this year!

This never hit the news I am guessing because it was only a major PITA (they are still recruiting techs to go out there and help resolve this) and never completely effected the organizations ability to do business.

The technical bits I was told today: They use SCCM and meant to image a few devices and accidentally selected the wrong group and re-imaged everything. Because of the encryption however, the re-image failed before it was able to format the disks. Because of this, they are able to restore the devices to the original state (user data and all) by decrypting the disk and repairing the MBR.

This process take 4-6 hours per device, and with 5100 devices...well its going to be awhile.

My team lead said he did something similar a few years back at another place. He sent a command via SMS to install a certain app that requires some funky HD settings that made a couple dozen computers get "weird" lol.

networker050184 · June 2014

Working as a backbone engineer at a fairly large service provider I've written changes that have left hundreds of thousands of enterprises with no or degraded service. Mistakes happen!

fredrikjj · June 2014

networker050184 wrote: »

Working as a backbone engineer at a fairly large service provider I've written changes that have left hundreds of thousands of enterprises with no or degraded service. Mistakes happen!

I'd love to hear more about this.

DevilWAH · June 2014

networker050184 wrote: »

Working as a backbone engineer at a fairly large service provider I've written changes that have left hundreds of thousands of enterprises with no or degraded service. Mistakes happen!

Made a metastases like that at my last place and you would be in front of the directors. If they found out you had not followed testing and implementation process or been careless then you where out the door. IF it was an honest mistake you would be expects to write up a report on what occured and write in new processes to insure no one else could do it again.

I don't agree either that mistakes happen, they only happen if you let them, if you do enough testing, plan changes well, and take care implementing them, you should never be in a position when you cause outages.

I managed a global change management team for a large company and we never accepted and excuse of "mistakes happen". Our clients expected a top quality service an if we caused an outage I was the one who had to sit in front of a panel of suits and explain why they had lost millions. "Oh well Mestakes happen" was not going to go down well. A few people tried it though but did not stay around long.

Kelkin · June 2014

DevilWAH wrote: »

I don't agree either that mistakes happen, they only happen if you let them, if you do enough testing, plan changes well, and take care implementing them, you should never be in a position when you cause outages.

I dont know if I buy that either.. Humans make mistakes. I am not saying that we should make light of it. We should do all the prep work required to do our best NOT to make a mistake but again we are Human and bound to make a mistake from time to time..

DevilWAH · June 2014

OK let me clarify.

Case one. Engineer configured BPDU on a core device, due to a bug in the IOS code this triggers a broadcast storm and brings down every core switch in the data centre. This was not seen in testing as the utilisation of the switches/numbers/arrangement was a subset of the full data center and the bug did not raise its head.

Case two. An engineer has been asked to upgrade core switches with a new IOS, has about 50 to do, Switches are arranged in a in to redundant pairs and cells. Plan says only upgrade one cell at a time, and never both in a redundant pair together. This insures that there is always live services running. So engineers try's to make things efficient by having multiply connections open, and uploading the code to the next lot of switches while he is rebooting the last one. Accidentally reboots the two switches he has just wiped the IOS from before uploading and not the two that he has completed. And even worse a pair of redundant switches. Switches don't come up as you would expect and the whole cell goes down......

The first case is not a mistake or a screw-up. nothing in the documentation or testing could have foreseen the issue with the bug. Just one of the things we have to deal with as IT engineers and why we put in redundant systems to try to avoid downtime when we come across them.

Second case is a mistake, and in my experience, including when I have made them, they are caused but not paying attention to detail and often when trying to rush things. A mistake is some thing you look back on and think.. I could have prevented that / I should have know better / that should not have been allowed to happen.

Yes mistakes will happen, but in my view they are never acceptable and mistakes is not the correct term. Carelessness, Rushing, lack of planning, Stupidity, lack of professionalism, lack of respect for clients/users, loss of concentration. Account for about 99% of so called mistakes and none of them are valid excuses. You can put simple steps in place that if people follow will prevent all of these from affecting service delivery.

Mistakes don't just happen, they are caused by some one. to say "mistakes happen" suggests that some one does exactly the same thing 100 times and one of these times a "mistake happens". But it did not just happen, the person must that time have done some thing different.

Whats the biggest screw up you have been a part of at work?

Comments