When IT Changes go Horribly wrong!!

As a IT Professional working in IT Service Management reading articles below makes me think how can this happen in todays day and age?

Natwest glitch: RBS chief Stephen Hester faces pressure to explain fault to public - Telegraph

Some extracts

The technical problems were triggered by a software update late on Tuesday, which caused the bank’s computer system to fail. As a result, payments going in or out of accounts overnight were not processed, causing a huge backlog. The outage also created some technical instability in the system which exacerbated the problem, sources disclosed.

Any thoughts?

Find more posts tagged with

Free for TechExams community: Cybersecurity salary guide

Compare cert salaries and plan your next career move

Button

Comments

tprice5

And thats why you run parallel environments, test and production.

shodown

for a small company sometimes things do happen, but banks have big enough bugets to run Test labs like stated above. When I was working for a bank back in 2008, they ran a full blown call manager/UCCE/Unity setup in a test lab with voice gateways and everything, so I see no reason why this bank doesn't have the same for its internal systems.

DevilWAH

I have had it my self, and seen it happen in a very large environment where lots of time and money has been spent testing.

In my case I was setting up HSRP (high avilibility outer fail-over stuff), because we have storage running over the network I wanted to keep the keep alive to a minimum so I had them set to 100msec. In he testing environment that was a replica of the live network it was running perfect, I stressed tested it, tried to break it and it worked perfectly.

Put it in production, and 4 days in it ground to a halt due to the bursty nature of user traffic stopping the keep alive and then due to the low keep alive times the system went if to a nasty feed back loop.

For a customer they again had a test network and spent months testing in dev labs, Plus a 6 months role out of new CISCO IOS's to all there devices. months after they started and issue how the devices handled BDPU packets caused an entire data centre to go down when a standard change was made to a single device (a change that had been made to dozens of other devices before with out issue)

The fact is no matter how much money or how long you put in to testing you can never replicate the conditions exactly of you live network, and you testing can never test every possible thing that might happen. Its the reason we get bugs in programs and these kind of unplanned outages.

test environment do not mean that there will never be problems with deploying to a live system they simple reduce the chances. In IT its a mixed up world, some time major C*&K up can pass unnoticed, while the most minor of glitches can take out a building.

I would love to know what actuly happened, but lets not judge there change mangment untill we know the facts.

YFZblu

DevilWAH wrote: »

I have had it my self, and seen it happen in a very large environment where lots of time and money has been spent testing

This - It is far too convenient to look at a situation from the outside and say "blerg, they should have tested it". Often times testing was done.

pumbaa_g

Guys industry average for successful changes is 98% most change/release and deployment managers I know will not go for the big bang approach as it involves similar risk. Most will run a pilot then phased roll out across all CI's
How can you play Russian roulette with something like this?

: either way I feel sorry for the IT Guys, they must have been flogged on the crossroads

atorven

pumbaa_g - Where do you get the 98% success from?
It's funny how the general public are putting the blame on the IT guys mearly because they are outsourced, like local guys would have did things differently.

People need to understand despite your best effort **** happens.

pumbaa_g

This is the average industry change management success rate across multiple providers. Please understand a change manager deals with atleast 300 changes a month, this is apart from the pre-approved changed which are considered low risk so the numbers are hardly surprising.
I saw the outsourcing part and I believe that people really need to understand is that if you pay peanuts you get monkeys. In the end you get what you pay for in my opinion, cut down on cost anywhere and you will face the same issue sooner or later.
I dont buy the argument of local vs outsourced, minimum wage is the same everywhere.
Its like playing Russian Roulette.

Roguetadhg

I would think local IT management could handle the situation better. Most "Outsources" companies don't keep the fine-tooth records of what the build is, where things go, configurations.

But when I read the article (26th) the problem was already fixed - IT's job finished. It was the work that they needed to catch up on was the ongoing issue - The Bank.

Stuff happens. All we can do is try our best to weed out the bugs with testing and trying to work the system how it wasn't not designed. Get 99% of the bugs, and it'll be that %1 that messes everything up.

It's always the IT person's fault.

DevilWAH

Roguetadhg wrote: »

I would think local IT management could handle the situation better. Most "Outsources" companies don't keep the fine-tooth records of what the build is, where things go, configurations.

Um I would disagree with that, (having worked i as a change manage for a very large service provider), with a click of a button I could have told you the history of every single change on the devices we looked after dating back years.

Any large company like a bank will have a detailed CMDB (Configuration management database) and any decent change management company will link directly in to this, and update against it as soon as changes are made.

Indeed for our support work, our help desk system would link in to many compinies own system so and issues we delt with where logged against the devices automaticly.

So while I agree for small compinies it may be the case that recourds are not kepted, this is no true for the links of banks. Indeed on bank I worked on had spent £50,000,000 on software licences along for its change mangment system, and as much again for the configuration and migration. When companies are spending that amount of money, they expect any companies they out source to, to look after it and keep it up to date.

pumbaa_g

Agree with DevilWAH Change Management does not work like that, every CI is has to be accounted for in CMDB. Config Management have to ensure that every detail from the time it came into the system etc till its phased out is recorded. Any changes made are linked and can be pulled up within moments and apart from Change Management you do have Release and Deployment to ensure that when such release are carried out in Enterprise Environments you don't get any surprises or even if you do (the 1%) it can be rolled back and no one is hurt (Risk Assessment anyone?)
However, after saying that I have seen a few companies who couldn't care less, they just want the service at the lowest cost possible. In this scenario why blame the poor guys who made a mistake, is it not the responsibility of the Bank? If you were paying for a service, is it not your responsibility to ensure that you get best value for money? I would say that they owe it to the customers.
If someone approached you to sell a Rolex for a few bucks claiming its the real deal on the street would you buy it? Thats the same way with the IT Service Industry you get what you pay for

higherho

Thank goodness we don't live in an age were failure or punishment is severe! if something goes wrong your PM might come up to you and say this;

In the name of "Program manager" of the house of IT the first of his name. King of the North, Lord of the seven kingdoms, and Protector of the realm. I Charge you to bring justice to the false IT Professional pumbaa_g and all those who shared in his crimes. I denounce him and detente him, I stripe him of all ranks and titles, of all lands and holdings, and sentence him to death

Send a raven to Casterly Rock inform that the IT team leader he has been summoned to court to answer for the crimes of his bannermen. He will arrive at the fort night to answer for his crimes or be branded an enemy of the crown and traitor to the realm.

In all seriousness, change management and a development environment are needed to minimize such issues as stated in the main post. Although we are human and we cannot fix all issues prior to a software deployment, software install, etc. Nothing is never 100% certain.