Snapshot Removal: Anyone ever have a snapshot removal hang @ 99%?

DeathmageDeathmage Banned Posts: 2,496
Hey guys,

Have you guys ever had a snapshot removal hang @ 99% and just stay there, so far it's been at 99% for 2 hours on our 6.7TB file server. Been doing snapshot cleanup and all of the other servers take like 30 to 40 minutes max.

Is it normal for large VM's to take a few hours or so and just sit at 99%?

Comments

  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    How big and old was the delta? Were there multiple levels of deltas?
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • DeathmageDeathmage Banned Posts: 2,496
    okie... So I had to dig on the internet for these command... but case-in-point, vCenter lies!!!!!

    CLi Rules! it's just taking it's jolly old time to delete!!!!!

    Below:

    Picture 1: yes it's a sliver (white on white lol!!!) you may not see it.... notice the time, it's now 5:28 PM here.




    Picture 2:



    it's still working, just really fracking slow....notice the percentage....

    Picture 3:

  • ReibeReibe Member Posts: 56 ■■□□□□□□□□
    It can happen, I've had an old snapshot that was a couple of months old and it hung at 99% for about 5 hours before it finished.
  • DeathmageDeathmage Banned Posts: 2,496
    Reibe wrote: »
    It can happen, I've had an old snapshot that was a couple of months old and it hung at 99% for about 5 hours before it finished.

    yes thats me this snap shot is from March. Been looking into why the IOPS are so bad. I started digging and found a blog about keeping too many snapshots and we had 5 on the SQL VM. So this is the 1st one of 5 to be purged.

    So we shall see what this does for performance, so far the IO has decreased as this sucker is being removed.

    Learned something new about snapshots, I remember it from the exam but never actually put 1 and 1 together.
    Essendon wrote: »
    How big and old was the delta? Were there multiple levels of deltas?

    ...a few delta's as you can see from above... icon_wink.gif
  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    I hope you've been backing up this VM.
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • DeathmageDeathmage Banned Posts: 2,496
    it's working fine, the snapshot removal went smoothly, just finished a few minutes ago. Only took 5 hours though.

    The server is backed up with Backup Assist and BE 12.5.
  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    Did you forget about the snapshots or were they left behind by the backups software?

    Take a moment to read this > VMware KB: Best practices for virtual machine snapshots in the VMware environment
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • kj0kj0 Member Posts: 767
    Keep an eye on your Snapshots. I run RVTools regularly to keep up with all those metrics.

    I've had it a few times, It's got to the point where we couldn't wait any longer as it was affecting a customers work, so we started migrating everything off the host, and as soon as everything was off except that VM, it completed straight away.

    If you're backing up VMs, be really careful with your Snapshots as you may end up backing up the snapshot, so you would ultimately end up with a "Backup of a backup" situation.
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • DeathmageDeathmage Banned Posts: 2,496
    Ya I basically forgot about them.... just got done with a 3 week wireless project and I neglected the cluster for a few weeks. I mean I'd check vCOPS and graphs but slipped my mind of snapshots.

    going to be way more overcautious from now on though.

    I am going to sit in cli more often though, I hear for the VCAP I need to graft it to my skull icon_wink.gif
  • VeritiesVerities Member Posts: 1,162
    If you're using SnapShot Manager, the snapshot usually is deleted at that point (95%-99%) but can take hours to update host management agents. I ran into this issue a few times and found an article that explained what was happening and how to fix it. Of course the commands were different for our version of ESXi:

    ESXi Remove All Snapshots hangs at 99% | Blog-Stack.net

    This article had the correct commands:

    VMware KB: Committing snapshots when there are no snapshot entries in the Snapshot Manager

    You basically end up restarting the host management agents (I end up having to run services.sh restart since the other commands almost never work on 4.1):

    VMware KB: Restarting the Management agents on an ESXi or ESX host

    Don't spend 5 hours picking your nose...follow those links and get the process finished.
  • DeathmageDeathmage Banned Posts: 2,496
    Then I presume migrating the VM's to our other hosts and rebooting the troubled host would be just as good as reset the agents. The up-time on the hosts in 4 months now, maybe a reboot would be good. icon_wink.gif
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    kj0 wrote: »
    Keep an eye on your Snapshots. I run RVTools regularly to keep up with all those metrics.

    I love RVTools for exactly this reason. It is nice to run monthly to make sure your LUNs are staying clean of orphaned files.
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • DeathmageDeathmage Banned Posts: 2,496
    kj0 wrote: »
    Keep an eye on your Snapshots. I run RVTools regularly to keep up with all those metrics.

    I've had it a few times, It's got to the point where we couldn't wait any longer as it was affecting a customers work, so we started migrating everything off the host, and as soon as everything was off except that VM, it completed straight away.

    If you're backing up VMs, be really careful with your Snapshots as you may end up backing up the snapshot, so you would ultimately end up with a "Backup of a backup" situation.

    RVtools huh, is that addon or a command-line?
  • kj0kj0 Member Posts: 767
    RVTools - Home

    Do you use twitter - Its very popular on there
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    The vHealth tab is a good place to start :)
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • VeritiesVerities Member Posts: 1,162
    Deathmage wrote: »
    Then I presume migrating the VM's to our other hosts and rebooting the troubled host would be just as good as reset the agents. The up-time on the hosts in 4 months now, maybe a reboot would be good. icon_wink.gif

    Not to sound like a dick, but you presume wrong; the host management agents are services that are restarted on the host and doesn't affect the VMs as long as you don't have the automatic startup/shutdown option enabled. 4 months uptime on a host is really good.....ESXi does not need to be rebooted for no reason.
  • DeathmageDeathmage Banned Posts: 2,496
    Naaa no disrespect seen. I have pretty thick skin icon_wink.gif

    I'll give it a try next time.

    With each snapshot being deleted the performance of the array is improving. Just two more to go and I'm done. Going to let them cook while I sleep and check them in theon ing on the remote terminal server.

    Also doing a Defrag on all the except sql. We're doing a database packing Saturday. :)
  • VeritiesVerities Member Posts: 1,162
    Deathmage wrote: »
    Naaa no disrespect seen. I have pretty thick skin icon_wink.gif

    I'll give it a try next time.

    With each snapshot being deleted the performance of the array is improving. Just two more to go and I'm done. Going to let them cook while I sleep and check them in theon ing on the remote terminal server.

    Also doing a Defrag on all the except sql. We're doing a database packing Saturday. :)

    A few tips for SQL DBs:

    -Verify the size of the logs vs what vCenter expected size is based on logging level

    -Shrink the DB, by reducing white space

    -Truncate tables

    I've run into issues like 91GB transaction logs on a 7 host/56VM setup with level 3 logging. Its one of those things people don't think about until vCenter starts to run slow as mud.
  • DeathmageDeathmage Banned Posts: 2,496
    Thanks for the pointers.

    I've seen over the past few months, SQL is it's own kind of animal it sometimes defies logic. More-so than my previous employments, especially concerning is SQL queries that aren't written for speed but just-to-get-rr-done copy-n-paste and 'ooo look it works, lets leave it like this' (16 lines of code when it just needs to be 2) and a SQL database that has never been packed in 7+ years, ya know the things normal people call 'preventative maintenance'. icon_silent.gif -- case-in-point I did a purge a few months back of temp files and other safety removed system cache files like memory ****, windows/dns/font logs, on the SQL server it was 65.6 GB's in size for TEMP FILES!!!!!!!!

    I'm having to read blog after blog on SQL performance because my previous predecessor had no idea of it (makes me wonder if I should get my 'MCSE: SQL' before SI next before my VCAP) and my co-worker the programmer is shaking-his-head over the programming code and how inefficient it is. Some queries take fracking forever and it's not the array it's just the length of the poor coding.

    I'm having to think like it's a women (SQL) and treat it that way and be very very cautious by what changes are made to the systems.

    It's funny our SQL and Syteline ERP VM's are the only problem children, all the others are working fine, but you always have those select few that give you headaches and stress. icon_wink.gif
  • DeathmageDeathmage Banned Posts: 2,496
    These number look much better this morning. Still two snapshots to be removed but there from last week. Will purge them tonight after-hours and do a consolidation afterwards of the logs.

    Be nice to see what the database packing will do next week. :)

  • VeritiesVerities Member Posts: 1,162
    Deathmage wrote: »
    These number look much better this morning. Still two snapshots to be removed but there from last week. Will purge them tonight after-hours and do a consolidation afterwards of the logs.

    Be nice to see what the database packing will do next week. :)

    These performance charts remind me of a 5 year old trying to draw mountains on a piece of paper.
  • jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    I remember panicking back in the day about a removal getting stuck for ages. It took 36!!! hours to remove. Colleague made the mistake of shutting the VM down in the hope it speeds things up, it didn't. Maybe it did and it would have taken 72hrs but because it was in the middle of the removal you weren't able to power it back on so effectively the server was down for over a day.

    And that was an orphaned Veeam snapshot.
    My own knowledge base made public: http://open902.com :p
  • DeathmageDeathmage Banned Posts: 2,496
    Yup I learned that too. I took down the print server after hours at 7pm in the middle of the removal and I also found you can't power it back on while the removal is happening.

    Glad the last two snapshots were successfully removed from the Sql last night so the performance is definitely way better.

    I guess sometimes books smarts doesn't always teach you real-life stuff with VMware or anything IT for that matter...

    Won't let this happen again...
  • kj0kj0 Member Posts: 767
    I attempted to migrate a 1.5TB vmdk across two datastores to convert it to thin provision before business hours (was a school. and a video library server) ... Lets just say that cancelling it at 13% after an hour still took 2 hours to rollback!
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • DeathmageDeathmage Banned Posts: 2,496
    kj0 wrote: »
    I attempted to migrate a 1.5TB vmdk across two datastores to convert it to thin provision before business hours (was a school. and a video library server) ... Lets just say that cancelling it at 13% after an hour still took 2 hours to rollback!

    Anyway to make it faster, by like say IE: more CPU shares or higher IO shares on the LUN's from the SAN or is it literally just a time-consuming process at the VMware Linux kernel level?
  • joelsfoodjoelsfood Member Posts: 1,027 ■■■■■■□□□□
    Best way to speed up snapshots, clones, etc is to stop anything making changes to that disk. Remember, you're not just consolidating/moving all of the data that was there at the start, but then also having to keep up with any changes made while the move is in process.
  • DeathmageDeathmage Banned Posts: 2,496
    joelsfood wrote: »
    Best way to speed up snapshots, clones, etc is to stop anything making changes to that disk. Remember, you're not just consolidating/moving all of the data that was there at the start, but then also having to keep up with any changes made while the move is in process.

    so in essence, turning off the VM and then doing the snapshot removal. Only downside is the VM is offline till the removal is finished.
  • DeathmageDeathmage Banned Posts: 2,496
    Updated my blog with the information I did to acquire the snapshot information. Might be helpful to others.

    VMware Snapshot's can be good but are also bad is left unchecked. - I.T.HINK ...So you don't have too...
Sign In or Register to comment.