Best Practises in using replicated volumes in vSphere 4.0

I have a bit of trouble to test a proper DRS scenario.

We have a vSphere cluster setup with two PS4000 series SANs.

Scenario
Maybe it is worth to mention that we are not Enterprise VSPP partner and we currently do not have the Site Recovery Manager available.

Anyway, both SANs have multiple LUNs configured which are visible on either host and virtual machines are spread evenly.
For certain customer we provide replication. At the moment we have one LUN on SAN1 which is replicated to SAN2.

I am now trying to replicate a SAN failure. For example, I now shutdown SAN1.

We are aware that virtual machines on that particular SAN, which are not replicated, will be inaccessible.

Now what we tried to do so far is to promote the replica on SAN2 to a volume, made the volume permanent and changed the provisioning from thin to thick.
When promoting the volume we have kept the IQN.

The problem we are now running into is probably vSphere. As soon as the SAN is down, vSphere doesn't seem to consider the storage lost and tries to keep those virtual machines online. I believe the default is 30 minutes where vSphere checks for example diskspace on datastores.

Now the idea was that all we do is basically rescanning the HBAs and vSphere should then be able to find those volumes again as they still have the same IQN.

The problem is that a rescan pretty much hangs and it seems to be unable to remove those orphaned volumes, in fact it doesn't consider them orphaned in the first place.

Everything storage related (for example removing targets from the hosts) is simply hanging.The only solution we found so far is actually creating a new IQN and rebooting those hosts, which sometimes means pushing the big button as the system even hangs during the shutdown due to the lock on those LUNs.

When creating a new IQN we obviosuly have to re-add the storage, resignature them and re-import those VMs. The main problem now really is virtual machines running on those hosts which are on the current online SAN. Rebooting the whole cluster in order to bring those replicated volumes online is something which we have to avoid.

Does all this make any sense ? What would be the right way to have both SANs connected to the cluster where certain LUNs are replicated (SAN1<>SAN2 / SAN2<>SAN1) which can be put online without any downtime of the hosts ?!?

Find more posts tagged with

Comments

There are no comments yet