Alright, this one's going to hurt some heads. . .
At my work, we have a customer that needs to have some SQL Server 2000 databases backed up on a regular basis. He's currently configured his servers (each running Windows 2000 Server, I believe) to back up the databases via the SQL Server 2000 manager to a remote server which we provide to him as a backup location. (This server runs Windows 2003 Server). There are two facts which are very key to this whole operation:
One is this, from our hardware engineer, regarding the backup server they rent from us:
This is MS112 server with one GB connection and 2x400GB SATA RAID1.
The RAID1 is “chipset-integrated†inexpensive desktop solution that is not intended for DB operation.
Your assessment is correct – they should reduce the load on the disk and/or network system.
And the error the customer is seeing when his backups fail is the following:
log from 64.84.59.9 *************Log start
BackupDiskFile::RequestDurableMedia: failure on backup device '\\Nsnovps00521\d$\SQL_Backups\ResTrack_dyn_db_200704040100.BAK'. Operating system error 64(The specified network name is no longer available.).
****************Log end
log from 64.84.59.10 *************Log start
BackupDiskFile::RequestDurableMedia: failure on backup device '\\Nsnovps00521\d$\SQL_Backups\ResTrack_jacobs_db_200704040131.BAK'. Operating system error 64(The specified network name is no longer available.).
****************Log end
So, we're left with a couple of problems. As the databases are being backed up by the SQL backup utility, they are timing out. Sometimes the event logs on the remote server show that the write-operation for a file that is only a few megs big takes up to an hour and a half. . . then times out. After much discussion, and the customer insisting that our network is congested, we've come to the conclusion that the backup server we've provided for them can't handle having multiple files backed up to it at once, and the server being backed up from can't handle creating multiple SQL Server backups at the same time, for the very same reason: the hard drives simply aren't fast enough, and can't handle interactive read/write operations of this scope.
We'd like the customer to either back up the databases locally, staggering the backup jobs so that they don't cause the hard drive to crap out, and then have the .BAK files backed up by the Veritas Netbackup program that runs on the machine as well, which backs up all the rest of the customers files. (That's pretty much standard operations at our company, to have Netbackup back up their machines regularly.)
The other solution would be for the client to purchase a faster server, either through us and have us manage it, or through another vendor and manage it themselves, in our datacenter. The problem there, of course, is that it would a lot of money to get a high-end storage server with super-fast hard drives that can handle having a half-dozen, two-hundred meg files written to it at any given time.
The question I have, after all that rambling, is this. Does anyone know of any whitepapers, documentation, or any recorded cases we could reference, that would show this customer that what we've found is correct? They are refusing to back up the databases locally, they claim they can't stagger the backups because they 'have to happen at once', and they are adamantly refusing to consider a more powerful server to back up their databases to. We basically need some proof, something to show them that they're going to need to change something, since test after test has shown that it's not the network that's lagging, the hard drives are simply not able to keep up with the amount of data streaming in; especially when it happens to be SQL Server databases, which seem to be notorious for error-checking and are, in effect, still being written to and queried as they're being backed up.
Any links to recommendations by Microsoft to this effect, recommendations by hardware vendors, or simply links and pages that show that this has happened before, would be helpful. Any other solutions to this problem would probably help, too, in case there is some 'gotcha' that our team doesn't see.