ESXi 6.0: 1 VM of 4 randomly loses ability to ping out

NatePrimeNatePrime Registered Users Posts: 4 ■□□□□□□□□□
I am running a ESXi 6.0 server with four VMs, three RHEL 5 and one Server 2012. About one week ago, without any updates being done to the host and only normal windows/yum.repo updates done on the VMs, one of my RHEL boxes lost all network connectivity. I could ping its own IP and the loopback, but could not ping other systems on the host or its own default gateway. After a bit of trouble shooting and rebooting, I disconnected the network under the VM settings on the VCenter server then reconnected it and regained connectivity. I wrote it down in my log books as a possible fix action and moved on.

Last night a different RHEL box had the same issue. I did the same fix action and it worked, however this morning the issue is back and unfixable. My machine can ping itself and its loopback, but not the gateway or other machines on the host. All other assets on the host have no issue connecting to the network and are accessible as normal.

Most of the fixes I've read up on deal specifically with all assets losing connectivity, not a single one. This system has been working fine for almost a month, and for almost a year before that before moving the systems to a dedicated VM server. No routing or switching configurations that would affect this have been changed.

Comments

  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    Are you using e1000 vNICs?
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • NatePrimeNatePrime Registered Users Posts: 4 ■□□□□□□□□□
    Yes, for all assets we are using e1000 vNICs.
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    VMware best practice is to use VMXNET3 vNICs unless you have a specially documented use case for using those E1000s. They are notorious for being flaky and I've personally experienced it.
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • NatePrimeNatePrime Registered Users Posts: 4 ■□□□□□□□□□
    Step 1 in every troubleshooting diagram is rebuild, so we're doing that with the Linux box we're having issues with and giving it VMXnet3 NICs. Due to the nature of the machine, it is no loss for us to redo it, and we can verify if it is the NICs causing issues or something else. Thank you.
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    That's awesome to hear.


    Not mine but:



    I knew saving this meme would come in handy some day!
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • NatePrimeNatePrime Registered Users Posts: 4 ■□□□□□□□□□
    Attempted to rebuild the server with the VMXNET3 NICs and, because our build is on RHEL 5, we can't do it. RHEL5 doesn't work with VMXNET3. It looks like we're going to have to do a manual rebuild onto a server 2012 box once our current slew of activity slows down. Thank you for your support.
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    NatePrime wrote: »
    ... because our build is on RHEL 5, we can't do it. RHEL5 doesn't work with VMXNET3.

    What sort of facts are you backing that statement up with because VMware's official documentation says otherwise:



    Link: https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=software&testConfig=16&productid=11500&releaseid=273&supRel=273,&deviceCategory=software&details=1&partner=75&releases=273&productNames=15&page=1&display_interval=10&sortColumn=Partner&sortOrder=Asc&testConfig=16


    Again, I'm going out on a limb here and guessing that this is probably your issue but instead of using a little Google-fu you just thew your hands up and gave up. I understand this is probably a frustrating especially if you have management on your ass but you need have a little more determination and persistence with these sort of things in order to become a good admin.

    NatePrime wrote: »
    It looks like we're going to have to do a manual rebuild onto a server 2012 box once our current slew of activity slows down. Thank you for your support.

    Why? All you have to do is remove the e1000 vNIC, add a new VMXnet3 vNIC and update the ip information. There are even scripts out there that will do this for you.

    Also, make sure you update any of your VM deployment templates - don't want those e1000s creeping back into the environment!

    icon_cool.gif Cheers!
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • kj0kj0 Member Posts: 767
    Remember e1000's generally have drivers across most (if not all recent) OSs. Most don't have VMTools or Open Tools installed out of the box. Ensure VMTools/openTools is installled and up to date.

    Also, Server 2012 has an issue with E1000 NICs drop off over time. We had someone make up a customer template with E1000s and then these issues started appearing. https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2109922
    2017 Goals: VCP6-DCV | VCIX
    Blog: https://readysetvirtual.wordpress.com
  • jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    iBrokeIT wrote: »
    Why? All you have to do is remove the e1000 vNIC, add a new VMXnet3 vNIC and update the ip information. There are even scripts out there that will do this for you.

    Don't even need scripts. RHEL likely just renames the current eth0 due to a Mac change to .bak so you likely just need to remove the Mac from the ifcfg script, remove the .bak part and remove the '/etc/udev/rules.d/70-persistent-net.rules' file - just like you do when preparing a RedHat template. One restart later you should be back in business .. probably 2 minute job - if even that.
    My own knowledge base made public: http://open902.com :p
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    Of course you don't NEED a script but you are a lot cooler if did use one icon_cool.gif
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
Sign In or Register to comment.