Dealing with hardware errors in OS deployment

ptilsen · November 2012

I'm currently working in SCCM 2007 R3 to build a deployment framework to do in-place, zero-touch upgrades to a couple thousand Windows XP systems using USMT task sequences. There have been lots of fun challenges, and some not-so-fun challenges, but I'm running into a serious concern as we're about to kick-off upgrades on the first business unit.

I believe we're going to see a high failure rate of migrations due to failing hard drives on the XP machines. The machines vary in age, but the majority are going to be from sometime between 2009 and 2011. Smaller percentages are from 2012 and 2008, with a select few from late 2007. What I am noticing a lot just with test systems is that the test system will "work fine", but in fact if chkdsk is run, there are fairly serious errors almost always indicative of hard drive errors. If the task sequence starts, it will generally break sometime during the upgrade, resulting it NTLDR is missing, or a completely corrupt boot sector or file system. This is no mystery to me; an operating system running without apparent symptoms is in no way mutually exclusive with hard drive failure, and a full in-place upgrade to Windows 7 is a very understandable trigger of the symptoms.

My concern is with how to respond from a planning perspective. I honestly think we're going to get 5% failure rate, maybe 10%. I think it's going to be extraordinarily painful, even doing only 20 or 30 a day. We're talking about multiple replacements a week when, from a business and end-user perspective, we're replacing systems that worked fine on XP, which invariably leads to the perception that they're our fault.

One thought I've had is to have the task sequence run chkdsk and stop based on the output. I think that would at least prevent the vast majority of errors, but we're behind schedule and I just know it will take a day or two to code and fully test it (it might only take 30 minutes to code, but testing is key).

I'm curious if anyone else here has been involved in a similar problem and seen the response. In some brief Googling I'm not seeing how others have try to preempt these sorts of issues. I can't imagine I am the first (or the last) to run into this concern. It's happening a lot on the test systems (more than 10%), which are, on average, older than the production, but I can't imagine it will be uncommon in production. (I also have a theory that consumer hard drives have become less reliable over the last few years, but it is based on 100% anecdotal evidence)

I know my scripting method will work, but it seems like an extreme response to a problem I'm not seeing a lot of web talk about.

sratakhin · November 2012

That's why it's sometimes easier to just buy new computers

Try running something like HDDScan on all machines you are planning to upgrade. If it shows that everything is good, schedule a chkdsk on next reboot and then do the upgrade. I'm not sure how to automate this process, but you could probably use login scripts or Group Policy.

elTorito · November 2012

Perhaps you're being a little bit too concerned? In our fat-client to VDI migration last year, we reimaged 650+ desktops (previously running Windows 2000 Professional) with a customized Windows 7 Embedded image. At least half of the desktops were systems purchased as far back as 2005. We've had our fair share of problems solving issues with the migration to Windows 7, but failed installations due to bad hard drives was not one of them. All in all, our desktop support team probably had to replace about 1% of the machines, if that.

Are you able to plan a partial rollout, to test the waters, so to speak? Perhaps starting with the computers in a department that's less critical?

Finally, I hardly think that a few failed installations is going to change the end user's perception of the IT team all that much

ptilsen · November 2012

We won't be buying new computers.

If only.

I am comfortable with doing a read-only chkdsk as part of the task sequence, and that would be sufficient. Really, it would be fairly trivial but I don't want to lose another day to making sure it works.

We will be rolling this out in fairly small groups, so I suppose it's much ado about nothing. If we see a lot of failures, we'll delay a bit while I integrate the chkdsk step.

ElTorito: You wouldn't think such a high failure rate would be expected, but there are a few things to consider. The organization has standardized on various Dell models over the last few years. If any of those used statistically likely-to-fail hard drives, we will see a higher failure rate than you did. There are other differences to consider. These are largely in manufacturing facilities, which get dusty (especially metal dust, in this case) and tend to have more failures. They are also largely power users (AutoCAD), which tend to use local storage more than average, putting more wear on hard drives. Windows 7 embedded is also very different from Windows 7 professional. We are deploying a larger OS image with more actions -- it will be a lot more I/O and a lot more storage use, meaning a much higher probability of experience the symptoms of bad sectors or other drive failure. Finally, going back to my completely unsubstantiated theory that they just don't make 'em like they used to, 2008-2010 era hard drives might actually be more likely to fail than 2005-2007-era.

As far as perceptions, the initial reaction to high failure rate will likely be disbelief that it's the hardware, and that it's something wrong with how we've done things. Really, the perception will be more from local IT management and helpdesk than end-users. It will be totally unfounded, but knowing the people in question, it's what I expect.

Ultimately, I think you're right -- we can afford to go through the first few dozen systems as a trial. They will be fairly good representations of the failure rate we'll see, and we've already made the decision to set aside one space PC for every five we upgrade.

Dealing with hardware errors in OS deployment

Comments