EMC - iSCSI corruption
Claymoore
Member Posts: 1,637
This is going to be a bit of a rant, but hopefully I can help some other EMC customers out there.
If you have a Celerra or integrated NS SAN with a 10Gb network link, you are probably going to have to replace the NIC. (EMC Primus documentation to follow)
History:
We purchased an NS82 (Clariion CX380 with an integrated Celerra) in January. This model has no fibre connectivity (other than for a tape drive) so everything is connected through CIFS/NFS or iSCSI. There is one 10Gb link and 6 1Gb links on each datamover so we are using the 10Gb link as our primary iSCSI path with 4 1Gb links bundled into an etherchannel as a secondary path and each path runs through a separate Cisco 4948 switch. Due to a series of unfortunate events regarding connecting our HPUX servers to the new SAN, I didn't get around to cutting over our Exchange 2003 server until July and that's when the real fun began.
Problem:
The message stores were corrupted sometime after being moved to the new SAN. I loaded the corrupted stores into the Recovery Storage Group one by one and moved mailboxes to new stores on the new SAN. The stores were corrupted again almost immediately. We involved Microsoft and they said the corruption we experienced was caused by hardware 100% of the time (and later they were proven right), so mailboxes were moved back to stores on the old HP XP512 SAN. This entire process lasted almost a week with limited or no mail function, so as the Exchange admin I was taking a beating during the first week of August. The old server only had three expansion slots, two of which were filled with FC HBAs so the server only had one PCI-X TOE NIC connected to the new SAN over the 10Gb path. This server was 5 years old and I had been trying to replace it and now there was suddenly enough money in the budget for a new Exchange server. I built a new server with 2 PCIe TOE NICs and hooked it up to the iSCSI SAN using both paths set up to use the MCS load balance setting of least queue depth and prepared to move over the mailboxes to new message stores.
I set up test mailboxes in each store - everything was fine
I moved the IT message store in September and we tested for a week - all good.
I moved 3 of the remaining stores in one day - still good.
We kicked off a backup at 9:06 - the event log showed message store corruption at 9:07
I moved all the mailboxes back to the old server the next day.
EMC has been saying everything was fine all along, but now I am sure the problem is theirs. An EMC engineer told me that they do not support MCS load balancing - just Failover - and that was causing the problem. I asked for the EMC documentation on this (which was never provided because it isn't true) and gave him a link to the MS iSCSI User Guide that says it is supported. Plus, the original server only had one connection so MPIO or MCS was not involved during the first corruption. EMC tried to blame our backup software - HP Data Protector - except it backed up fine over FC and will even backed up a single message store without error. EMC isn't being very helpful until they find out we are bringing in Hitachi to investigate a competitive upgrade and now we have a PM and project team assigned to resolve this issue.
I find that by copying one of the large corrupted message stores to a different location on the same drive that I can generate enough I/O to recreate the error, so now I have a way to test. The team involves an engineer who actually knows what he is doing and we started testing this week. What we find is that the corruption occurs when data is moving over the 10 Gb link - either alone or in a load balanced configuration - but not when the data is moving over the etherchannel (4Gb) path by itself. He captures some network traffic, I send him errors from the event log, and that data is sent on to another team. We get a response the next day.
The 10Gb Neterion NIC in the datamover is passing corrupted TCP segments on up the layers where they may or may not be caught. iSCSI has an error checking routine in the protocol so it sees the errors and rejects the PDUs resulting in iScsiPrt errors Event ID 7 (initiator could not send a PDU) and Event ID 29 (Target rejected a PDU). The CRC check in Celerra replicator will also generate errors, but we only run iSCSI over the 10Gb link. CIFS and NFS traffic won't be checked so the corrupted data will be written to disk.
The NIC not only passed the bad traffic - it is the source of the corruption!
Solution:
The permanent fix is to replace the NIC and EMC is working on a plan to address this with the affected customers. In the meantime you can use an alternate network path and avoid the 10Gb path. You could also asked to be patched to the latest NAS code, but that has a downside. The NIC will still generate or pass the corrupted TCP segments, but now the Celerra will process all the data rather than relying on the TCP Offload Engine of the 10Gb NIC, resulting in increased processor usage and decreased performance while it has to process all the retransmit requests for the corrupt TCP segments.
Putting a bad NIC in a SAN is a big deal. The fact that this model was released a year ago, purchased in January, and the problem not revealed to us until October is an even bigger deal. Not to mention the time wasted and reputation damage incurred while dealing with the fallout from EMC trying to cut a few dollars by buying a batch of NICs from some Shanghai street market. I hope the money saved was worth it to you EMC, because it certainly wasn't to us.
If you have a Celerra or integrated NS SAN with a 10Gb network link, you are probably going to have to replace the NIC. (EMC Primus documentation to follow)
History:
We purchased an NS82 (Clariion CX380 with an integrated Celerra) in January. This model has no fibre connectivity (other than for a tape drive) so everything is connected through CIFS/NFS or iSCSI. There is one 10Gb link and 6 1Gb links on each datamover so we are using the 10Gb link as our primary iSCSI path with 4 1Gb links bundled into an etherchannel as a secondary path and each path runs through a separate Cisco 4948 switch. Due to a series of unfortunate events regarding connecting our HPUX servers to the new SAN, I didn't get around to cutting over our Exchange 2003 server until July and that's when the real fun began.
Problem:
The message stores were corrupted sometime after being moved to the new SAN. I loaded the corrupted stores into the Recovery Storage Group one by one and moved mailboxes to new stores on the new SAN. The stores were corrupted again almost immediately. We involved Microsoft and they said the corruption we experienced was caused by hardware 100% of the time (and later they were proven right), so mailboxes were moved back to stores on the old HP XP512 SAN. This entire process lasted almost a week with limited or no mail function, so as the Exchange admin I was taking a beating during the first week of August. The old server only had three expansion slots, two of which were filled with FC HBAs so the server only had one PCI-X TOE NIC connected to the new SAN over the 10Gb path. This server was 5 years old and I had been trying to replace it and now there was suddenly enough money in the budget for a new Exchange server. I built a new server with 2 PCIe TOE NICs and hooked it up to the iSCSI SAN using both paths set up to use the MCS load balance setting of least queue depth and prepared to move over the mailboxes to new message stores.
I set up test mailboxes in each store - everything was fine
I moved the IT message store in September and we tested for a week - all good.
I moved 3 of the remaining stores in one day - still good.
We kicked off a backup at 9:06 - the event log showed message store corruption at 9:07
I moved all the mailboxes back to the old server the next day.
EMC has been saying everything was fine all along, but now I am sure the problem is theirs. An EMC engineer told me that they do not support MCS load balancing - just Failover - and that was causing the problem. I asked for the EMC documentation on this (which was never provided because it isn't true) and gave him a link to the MS iSCSI User Guide that says it is supported. Plus, the original server only had one connection so MPIO or MCS was not involved during the first corruption. EMC tried to blame our backup software - HP Data Protector - except it backed up fine over FC and will even backed up a single message store without error. EMC isn't being very helpful until they find out we are bringing in Hitachi to investigate a competitive upgrade and now we have a PM and project team assigned to resolve this issue.
I find that by copying one of the large corrupted message stores to a different location on the same drive that I can generate enough I/O to recreate the error, so now I have a way to test. The team involves an engineer who actually knows what he is doing and we started testing this week. What we find is that the corruption occurs when data is moving over the 10 Gb link - either alone or in a load balanced configuration - but not when the data is moving over the etherchannel (4Gb) path by itself. He captures some network traffic, I send him errors from the event log, and that data is sent on to another team. We get a response the next day.
The 10Gb Neterion NIC in the datamover is passing corrupted TCP segments on up the layers where they may or may not be caught. iSCSI has an error checking routine in the protocol so it sees the errors and rejects the PDUs resulting in iScsiPrt errors Event ID 7 (initiator could not send a PDU) and Event ID 29 (Target rejected a PDU). The CRC check in Celerra replicator will also generate errors, but we only run iSCSI over the 10Gb link. CIFS and NFS traffic won't be checked so the corrupted data will be written to disk.
The NIC not only passed the bad traffic - it is the source of the corruption!
Solution:
The permanent fix is to replace the NIC and EMC is working on a plan to address this with the affected customers. In the meantime you can use an alternate network path and avoid the 10Gb path. You could also asked to be patched to the latest NAS code, but that has a downside. The NIC will still generate or pass the corrupted TCP segments, but now the Celerra will process all the data rather than relying on the TCP Offload Engine of the 10Gb NIC, resulting in increased processor usage and decreased performance while it has to process all the retransmit requests for the corrupt TCP segments.
Putting a bad NIC in a SAN is a big deal. The fact that this model was released a year ago, purchased in January, and the problem not revealed to us until October is an even bigger deal. Not to mention the time wasted and reputation damage incurred while dealing with the fallout from EMC trying to cut a few dollars by buying a batch of NICs from some Shanghai street market. I hope the money saved was worth it to you EMC, because it certainly wasn't to us.
Comments
-
astorrs Member Posts: 3,139 ■■■■■■□□□□Ouch Claymoore, that was ugly. I find far too often with SAN vendors these days, they seem to almost require at least the threat of a competitor poaching their clients to actually move on either deep discount pricing or major engineering support issues like you experienced. Now that I've said that and after a little further thought, Microsoft and Symantec are also just as useless these days too...
I assume you're now running on one of the patched NAS code releases, what's the plan going forward? Just replace the 10G NIC when they have new ones? Are you guys looking for compensation from EMC in any way? (Don't know how much business you guys do with them). -
Claymoore Member Posts: 1,637Yeah, Astorss, it's been a mess. My first step was to change the MCS settings on all my windows servers to failover only with the 4gb etherchannel path as the active path. Next, an EMC CE will be on site this weekend to patch us to the current NAS code. Finally, a new NICs - screened and tested beforehand - will be installed when they become available (no timetable).
One of the reasons we went with EMC over other vendors was the availabilty of a 10Gb connection for better throughput for later expansion. I guess that was a mistake.
I definitely feel like EMC owes us something. They can't give me back the weekends and evenings I spent repairing the data corruption caused by their faulty equipment, nor can they compensate my users for all the downtime they experienced. EMC also failed to disclose the amount of storage space that was required for iSCSI replication (in some cases 3 times the size of the actual data - on each array!) which forced us to use all of our extra capacity just to support the data we have now, even though we thought we were buying enough to last us a few years. We're not a big company, and even if the $400k or so we spent isn't much to EMC, it sure is to us. Giving us a few fully-stocked DAEs as a peace offering would be a start.
This whole implementation experience has left me with a very low opinion of EMC. We even chose their professional services to perform the installation and they did a poor job - shoddy project management, poor communication, and little or no documentation. The instructions I received on setting up the MS iSCSI initiator software were a few pages of photocopied screenshots with handwritten notes. Fortunately, the MS iSCSI user guide is actually useful. An EMC engineer finally emailed me their best practice document for setting up Exchange with iSCSI - documentation that would have been handy several months ago.
This has also affected me personally. As the senior system engineer here, I am responsible for all the windows servers, exchange administration and network administration. I have had all of my abilities questioned for several months now. Management all the way up to the executive level have tried to push all the blame on me, even going behind my back to vendors and asked them to 'check my work'. Everything was correct of course, but it has really strained my relationships here at the office. I kept trying to blame this on EMC, and they would do a 'scan' and say everything was fine - until we were finally able to prove it wasn't. So, to paraphrase, I want to know what EMC knew and when they knew it.
They at least owe me that much. -
Claymoore Member Posts: 1,637It appears that HPUX 11i is not immune from the iSCSI corruption - yesterday we couldn't mount the iSCSI LUNs on HPUX becasue they were too heavily corrupted. We have no concrete evidence that the LUN corruption was caused by the 10 GB NIC, but we have a pattern of behavior that matches.
We know that the 10GB NIC is corrupting data - we've seen that on our Windows Servers. It's possible - even likely - that the data was already corrupt before the NAS code update was applied. The NAS code patch was applied Saturday and it's also possible that corruption occurred when the patch was applied or when the datamovers were purposely failed over so the patch could be applied. Everything seemed to be fine after EMC applied the 5.6.39.506 patch. We suffered an unrelated (perhaps related) licensing failure in one of our legacy apps about 6AM Monday that is most easily resolved by rebooting the Unix server. However, after the reboot, none of the iSCSI LUNs would mount.
Since this server runs both of our primary line of business apps, we now have a serious problem. Our Unix admin worked with EMC and HP all day trying to mount or repair the LUNs to no avail. I'm not a Unix person (so I'm relaying the following based on what I took away from a coversation with our admin), but from what I understand some of the disk signature information is maintained in RAM but has to be loaded from disk on boot. These signatures were corrupt so the LUNs would not mount. I'm not sure how, but they recreated the disk signatures and our admin ran the unix version of chkdsk to see if the data was corrupt. Error after error was returned and he finally killed the process after 3 hours.
Data was restored from tape to our rusty, trusty, end-of-life HP XP512 and our Unix box is back up and running on fibrechannel.
What's the fallout? One weekend's work of offline processing lost and one entire day of user productivity lost. I'm not sure what we or EMC are going to do next, but I'll keep you posted. -
Claymoore Member Posts: 1,637An EMC Technical Advisory (ETA) was released on 10/17
Edit - EMC documentation removed at EMC's request. -
UnifiedEMCGuy Member Posts: 1 ■□□□□□□□□□So, EMC replaced the 10GbE cards and patched the Celerra so that future corruption would be detected.
How are things going now that the cards have been replaced?
No further corruption is happening?