Hyper-V/General replication troubleshooting help

lsud00d · April 2014

So...apologies for the long winded post, I typed this out in a word doc to get my ducks in a row but I feel that everything needs to be laid out on the table to be fully understood. I'm not a VMWare wiz (will be starting the VCP course in May!) but I'm pretty good in Hyper-V. I think the virtualization principles apply regardless of technology so please chime in if you have any ideas not mentioned here. Thanks!

ULTIMATE GOAL:
Balance replicated VM’s across 2 DR clusters

PROBLEM:
Replication works initially but then fails to one of the clusters at the DR sites. One of the DR clusters is carrying nearly all of the replica VM load currently.

SETUP:
Hyper-V 2012
2 sites (HQ and DR)
4 Clusters (2 at each site, HQ-A/HQ-B and DR-A/DR-B, replica brokers configured for each cluster)

Hosts are all R710’s. The NIC hardware/drivers/configuration is not identical but nearly the same for the most part…at the root of it, traffic is traversing the Management NIC which is configured at autonegotiation and is operating at full-duplex with the core-switch it is connected to.

Servers at both sites are 1 hop away via 1GB SAN switches to SAS storage where VM’s live on CSV’s.

Replication is configured over port 80 and via Kerberos authentication.

DIFFERENCES:
Replicas to cluster DR-A were seeded (Initial Replication) via external USB discs to avoid the network cost/time [TB’s of data]. Nearly all of the production replicas live on DR-A currently.

Replicas to cluster DR-B were seeded over the wire. These are small test VM’s (4gb-12gb each) that replicate in a short amount of time. DR-B was configured after DR-A.

Replication only fails to cluster DR-B*

*Caveat: I changed the disc size of VM2 at DR-B (it became primary via Planned Fail Over) to accommodate for Windows Updates and the need for more disc space. This broke replication so I deleted the replica (which started as the primary at HQ) and reinitiated replication from DR. Since this event, the replication has not failed. So, in this instance replication was initiated from DR.

TROUBLESHOOTING/OBSERVATIONS:
Test-VMReplicationConnection powershell command to both the Replica Server and Replica Broker show the connection to be successful

The replication fails after hours. One thing of note is that this is a time period when replication is set to retry if in a failing state, although I haven’t noticed any VM’s that have met this criteria (failed during day, needed to restart during after-office hours)

Error 32022 in event-viewer: Hyper-V suspended replication for virtual machine ‘VM1’ due to a non-recoverable failure.*

*I have also seen time out errors accompany the replication failing.

I captured the timeframe it occurred in Network Monitor (sometimes it’s easier than Wireshark with MS specific technologies). I didn’t see any retransmissions/fragmentation. I didn’t dig super deep into the frame details but scanning through nothing stood out.

FURTHER TESTING:
Repeat above caveat where primary replication is initiated from the DR site to HQ. This is fine for testing purposes but is not ideal

NIC’s on DR-A cluster hosts are Intel. NIC’s on DR-B cluster hosts are Broadcom. Again, these are all up-to-date drivers but…something to test. There’s no NIC teaming configured on the Management NIC.

lsud00d · April 2014

Back for more after recent testing...

It appears the scenario in which replication proceeds to fail is condensed simply to:

VM's IR'd (Initial Replica) and sync'd over the wire from HQ to DR.

It seems to be independent of clusters so that simplifies the troubleshooting. VM's that were IR'd via external storage transported to DR have not had this issue. VM's IR'd and sync'd over the wire in the reverse direction (DR to HQ) don't have this issue.

I chalk it up to network/lack of QoS at HQ. This still doesn't really explain why replication from HQ to DR continues to work without issue for those that were not IR'd over the wire.

Hyper-V/General replication troubleshooting help

Comments