Dealing with hardware errors in OS deployment
I'm currently working in SCCM 2007 R3 to build a deployment framework to do in-place, zero-touch upgrades to a couple thousand Windows XP systems using USMT task sequences. There have been lots of fun challenges, and some not-so-fun challenges, but I'm running into a serious concern as we're about to kick-off upgrades on the first business unit.
I believe we're going to see a high failure rate of migrations due to failing hard drives on the XP machines. The machines vary in age, but the majority are going to be from sometime between 2009 and 2011. Smaller percentages are from 2012 and 2008, with a select few from late 2007. What I am noticing a lot just with test systems is that the test system will "work fine", but in fact if chkdsk is run, there are fairly serious errors almost always indicative of hard drive errors. If the task sequence starts, it will generally break sometime during the upgrade, resulting it NTLDR is missing, or a completely corrupt boot sector or file system. This is no mystery to me; an operating system running without apparent symptoms is in no way mutually exclusive with hard drive failure, and a full in-place upgrade to Windows 7 is a very understandable trigger of the symptoms.
My concern is with how to respond from a planning perspective. I honestly think we're going to get 5% failure rate, maybe 10%. I think it's going to be extraordinarily painful, even doing only 20 or 30 a day. We're talking about multiple replacements a week when, from a business and end-user perspective, we're replacing systems that worked fine on XP, which invariably leads to the perception that they're our fault.
One thought I've had is to have the task sequence run chkdsk and stop based on the output. I think that would at least prevent the vast majority of errors, but we're behind schedule and I just know it will take a day or two to code and fully test it (it might only take 30 minutes to code, but testing is key).
I'm curious if anyone else here has been involved in a similar problem and seen the response. In some brief Googling I'm not seeing how others have try to preempt these sorts of issues. I can't imagine I am the first (or the last) to run into this concern. It's happening a lot on the test systems (more than 10%), which are, on average, older than the production, but I can't imagine it will be uncommon in production. (I also have a theory that consumer hard drives have become less reliable over the last few years, but it is based on 100% anecdotal evidence)
I know my scripting method will work, but it seems like an extreme response to a problem I'm not seeing a lot of web talk about.