Vsphere 4.1 NFS I/o error occured
slinuxuzer
Member Posts: 665 ■■■■□□□□□□
When trying to upload files to my data stores on my netapp filer, I am getting I/O error occured I have already confirmed that its not a permissions problems and it seems to happen with big files. I know this was a known issue at one time with older versions, but I have been scouring the net and haven't found a solution yet.
Basic setup is 1 vcenter DL380 G6, 2 ESXI 4.1 installlable, cisco 3750's running 12.2.
Flow control set send on for filer and esx, set to receive on / desired on the switches
ESXI servers using two storage connections, load balanced according with netapp tr-3749, these ports are trunked with two VLans
Filer side is setup in a lacp etherchannel.
Netapp Virtual storage console is installed and storage controller settings have their recommended values set.
Any ideas?
Thanks in advance.
Basic setup is 1 vcenter DL380 G6, 2 ESXI 4.1 installlable, cisco 3750's running 12.2.
Flow control set send on for filer and esx, set to receive on / desired on the switches
ESXI servers using two storage connections, load balanced according with netapp tr-3749, these ports are trunked with two VLans
Filer side is setup in a lacp etherchannel.
Netapp Virtual storage console is installed and storage controller settings have their recommended values set.
Any ideas?
Thanks in advance.
Comments
-
kalebksp Member Posts: 1,033 ■■■■■□□□□□So you are uploading though the vSphere Client from a Windows computer? Is the source file on the local computer? Have you tried uploading the files through FastSCP or directly to the NFS share?
-
slinuxuzer Member Posts: 665 ■■■■□□□□□□So you are uploading though the vSphere Client from a Windows computer? Is the source file on the local computer? Have you tried uploading the files through FastSCP or directly to the NFS share?
Yes I am trying to upload from VIC client datastore browser from windows, source file is on same machine as the VIC client, have not tried FastSCP as I am not familiar with it, have not tried uploading directly to the share.
My main concern is that this is a symptom of something larger being wrong with my storage connections, I am about to migrate my VM's off my FAS940 on this FAS2040. -
kalebksp Member Posts: 1,033 ■■■■■□□□□□Do you have VMs running on this setup already or is it new? You may want to verify the load balancing on both vSwitch and the vmkernel ports accessing the NFS storage. The contents of your rc file on the filer could be helpful to diagnosing the problem too.
-
MSNinja Member Posts: 26 ■□□□□□□□□□I had this problem once... to long ago for me to remember what I did to fix it, but I remember I fixed it.
Even though you said it wasn't permission related Im sure it was something regarding permissions if I remember correctly...
The NetApp I had some kind of double permission config, one for the share, and one for the array or something...
My recommendation is to check the permissions again...
Hope this helps -
kalebksp Member Posts: 1,033 ■■■■■□□□□□Good point MSNinja, check that the security type is correct with 'qtree status', your NFS volumes should be set to unix style security. You can also check that /etc/exports is configured correctly, you can export a volume with the correct permissions but if it's not in /etc/exports the permissions won't be retained when the filer is rebooted.
-
ConradJ Member Posts: 83 ■■□□□□□□□□Did you try setting doing RDM mapping with SATA drives? The reason I ask is because I tried it unsuccessfully the first time and ended up getting dozens of errors like yours. Did a low level format (3 x 1.5TB drives takes some time!) and that fixed it.
Then I redid it and it has been working fine ever since.
Just an idea... -
slinuxuzer Member Posts: 665 ■■■■□□□□□□Opened a Vmware support case and after an escalation, Vmware is telling me it is a "NFC" error that was prevelant in version 3.5, and still occasionally pops up in 4.1, especially with the netapp's, they also mentioned they have had several cases of this in the last week.
They are suggesting I place a laptop in my storage Vlan and attempt the transfer from there to determine if it is a storage or network issue.
We have two other filers with the same config and same models at two other sites, not experiencing this issue, the only difference is they already upgraded to the latest DataOn Tap I am still on 8.0.1RC3 7-mode.
Planning to resume the case wednesday, after another staff member upgrades DataOn Tap.
Here is the contents of the RC file from the relative head.
#Regenerated by registry Fri Jan 14 13:04:10 CST 2011
#Auto-generated by setup Wed Dec 15 11:48:19 CST 2010
hostname {edited for Security}
vif create lacp vif1 -b ip e0c e0d
vif create lacp vif2 -b ip e0a e0b
vlan create vif1 128 144
vlan create vif2 160 176
ifconfig vif1-144 170.4.216.149 netmask 255.255.255.240 mtusize 1500 partner vif1-144
ifconfig vif2-160 170.4.216.165 netmask 255.255.255.240 mtusize 1500 partner vif2-160
ifconfig vif2-176 170.4.216.181 netmask 255.255.255.240 mtusize 1500 partner vif2-176
ifconfig vif1-128 170.4.216.133 netmask 255.255.255.240 mtusize 1500 partner vif1-128
ifconfig e0a untrusted -wins mediatype auto flowcontrol send
ifconfig e0b untrusted -wins mediatype auto flowcontrol send
route add default 170.4.216.129 1
routed on
options dns.domainname {edited for Security}
options dns.enable on
options nis.enable off -
slinuxuzer Member Posts: 665 ■■■■□□□□□□Still haven't gotten anywhere on this with Vmware, they are wanting me to put my client into the storage Vlan and try the upload from there.
I ran wireshark today while trying to do the upload and found out that my workstation IP communicates with the service console of an ESXI server during the transfer, per Netapp and vmware best practices (If I am understanding it correctly) the service console and storage conncetions should be in a different Vlan, so I question the validity of the test.
I created a share pointing to this Datastore and uploaded the same folder over CIFS and it ran fine ( I did use the same address I am using for NFS)
When doing the upload wireshark see's several segmented packets and then after that receives RST (Reset) packets from the ESXI service console address, this IS after numerous successfull data packets were transferred. I have attached a screen shot of wireshark.
Sorry to drag this one back up, but I am at the end of my rope with this one.
Thanks in advance. -
slinuxuzer Member Posts: 665 ■■■■□□□□□□Still working this issue unfortunatley, vmware support tier 1 hasn't been able to solve it so far, I posted this in the vmware community, but thought I would give you guys a crack at it also as I have been pulling my hair out for a while now.
Using wireshark I determined that the upload is being sent to the service console of ESXI, which I guess makes sense since thats where the NFS storage is actually connected, so I have a vpxa.log snippet from the ESXI host.
The lines that concern me are
[2011-02-26 22:17:08.209 15BFDB90 info 'Libs'] UUID: Unable to open /dev/mem: No such file or directory
[2011-02-26 22:16:58.026 160B5B90 warning 'Libs'] [NFC ERROR] NfcBufWrite: failed to write file
[2011-02-26 22:16:58.029 15CC0B90 warning 'Libs' opID=task-internal-4905-b765816c] [NFC ERROR] NfcBuf_Recv: session error 4
[2011-02-26 22:16:58.029 15CC0B90 warning 'Libs' opID=task-internal-4905-b765816c] [NFC ERROR] NfcServerLoop: failed to receive file data
[2011-02-26 22:16:58.153 15CC0B90 error 'App' opID=task-internal-4905-b765816c] [VPXNFCSERVER] Nfc server failed: File error -- NfcBufWrite: failed to write file
[2011-02-26 22:16:58.153 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VPXNFCSERVER] Closing NFC session
[2011-02-26 22:16:56.071 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VPXNFCSERVER] Starting NFC server loop
[2011-02-26 22:16:56.076 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaVmprovUtil] DsPathToLocalPath conversion: [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk -> /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:56.076 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaDatastoreContext] Resolved DsPath [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk to localPath /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:56.081 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaVmprovUtil] DsPathToLocalPath conversion: [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk -> /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:56.081 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaDatastoreContext] Resolved DsPath [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk to localPath /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:56.085 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaVmprovUtil] DsPathToLocalPath conversion: [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk -> /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:56.085 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VpxaDatastoreContext] Resolved DsPath [TEST] ISO/SUSE/SUSE_MBRAlign-disk1.vmdk to localPath /vmfs/volumes/c8cd2f66-895d1e33/ISO/SUSE/SUSE_MBRAlign-disk1.vmdk
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalCnxHostagent'] Received callback in WaitForUpdatesDone
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalCnxHostagent'] [VpxaHalCnxHostagent::ProcessUpdate] Applying updates from 5539 to 5540 (at 5539)
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Received change notification from hostd ...
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] [IsVmListChangeUpdate] returning false
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Setting _syncPending = true ...
[2011-02-26 22:16:57.542 15B3AB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Launching ProcessResourceNotification in new thread...
[2011-02-26 22:16:57.542 15B7BB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Operation is transform ...
[2011-02-26 22:16:57.549 15B7BB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Copied root values ...
[2011-02-26 22:16:57.549 15B7BB90 verbose 'VpxaHalResourcePool'] [VpxaHostdSpecSync] Generated list of transforming operations ...
[2011-02-26 22:16:57.549 15B7BB90 verbose 'VpxaHalResourcePool'] [OverrideTreeOnHostd] oldOperations != new operations, trying with new operation set...
[2011-02-26 22:16:57.549 15B7BB90 info 'VpxaHalResourcePool'] [OverrideTreeOnHostd] Trees are identical, nothing to do
[2011-02-26 22:16:57.549 15B7BB90 verbose 'VpxaHalResourcePool'] [OverrideTreeOnHostd] currOverrideState = 0, application succeeded, exiting...
[2011-02-26 22:16:57.549 15B7BB90 verbose 'VpxaHalResourcePool'] [ProcessResourceNotification] No syncs pending, exiting loop...
[2011-02-26 22:16:58.026 160B5B90 warning 'Libs'] [NFC ERROR] NfcBufWrite: failed to write file
[2011-02-26 22:16:58.029 15CC0B90 warning 'Libs' opID=task-internal-4905-b765816c] [NFC ERROR] NfcBuf_Recv: session error 4
[2011-02-26 22:16:58.029 15CC0B90 warning 'Libs' opID=task-internal-4905-b765816c] [NFC ERROR] NfcServerLoop: failed to receive file data
[2011-02-26 22:16:58.153 15CC0B90 error 'App' opID=task-internal-4905-b765816c] [VPXNFCSERVER] Nfc server failed: File error -- NfcBufWrite: failed to write file
[2011-02-26 22:16:58.153 15CC0B90 verbose 'App' opID=task-internal-4905-b765816c] [VPXNFCSERVER] Closing NFC session
[2011-02-26 22:16:58.153 15CC0B90 info 'App' opID=task-internal-4905-b765816c] [VpxLRO] -- FINISH task-internal-4905 -- -- VpxNfcServerLro --
[2011-02-26 22:16:58.153 15CC0B90 info 'App' opID=task-internal-4905-b765816c] [VpxLRO] -- ERROR task-internal-4905 -- -- VpxNfcServerLro: vmodl.fault.SystemError:
Result:
(vmodl.fault.SystemError) {
dynamicType = <unset>,
faultCause = (vmodl.MethodFault) null,
reason = "Nfc server failed: File error -- NfcBufWrite: failed to write file
",
msg = "",
}
Args:
-
jibbajabba Member Posts: 4,317 ■■■■■■■■□□All NFC errors I had were DNS related (which vmware support never picked up either). Make sure you have full DNS resolution (PTR helps too) between virtual center, esx hosts hosting the LUN and the client PC used to upload files to the datastore. If SAN / Client / vSphere host have multiple IPs / DNS entries use hostfiles.My own knowledge base made public: http://open902.com
-
slinuxuzer Member Posts: 665 ■■■■□□□□□□ok, I can give that a try,
I have the following ip's for each host in my environment
esx1 (service console)
esx1-vmotion
esx1-nfs
esx1-nfs2
esx1-serviceconsole2
should each IP being used for my environment be assigned a DNS host and pointer record?
and when you say use a host's file, do you mean a host's file on my ESXI servers? -
jibbajabba Member Posts: 4,317 ■■■■■■■■□□It depends on your setup.. i.e. are you using your own DNS, public DNS etc.
Also, you have two NFS IPs - are those two different server or does this mean it has a public and a private IP ?
service console2 : Also, is that a public IP ?
Plus, have you established the connection using the FQDN or IP ?
The reason I am asking is, if you say, got a public and private IP using the same hostname, you may end up round robin the DNS and if it resolves using the public during the file transfer, you would have to make sure the ports are open on the firewall.
Now I assume the following now (as an example)
esx1 (service console) : 192.168.1.1
esx1-vmotion 192.168.1.2
esx1-nfs 192.168.1.10
esx1-nfs2 217.73.12.23
esx1-serviceconsole2 217.73.12.25
What I would do now is adding the following on your client PC and virtual center (if you got one) into the hostfile, also the ESX / NFS server:
192.168.1.1 esx1 esx1.domain.com
192.168.1.10 esx1-nfs esx1-nfs.domain.com
217.77.73.12.23 esx1-nfs2 esx1-nfs2.domain.com
If NFS2 is indeed a public then make sure you have the necessary ports open, if it is a private, change the above accordingly.
vmotion / second service console does not have to be added into DNS for now. You could add the second service console IP using a different A-Record, i.e. esx1-1 esx1-1.domain.com etc.
The hostfile on Windows can be found here:
c:\windows\system32\drivers\etc\hosts
and on Linux (i.e. NFS server) and also the ESXI box :
/etc/hosts
ESXI seems to clear the host file sometimes after a reboot, but it is clearly good enough for testing purposes.
In general, make sure you read best practises also
http://vmware.com/files/pdf/VMware_NFS_BestPractices_WP_EN.pdfMy own knowledge base made public: http://open902.com -
slinuxuzer Member Posts: 665 ■■■■□□□□□□I have actually narrowed this down to the network card, I am using Hp's NC375T card.
I changed over to NC382i and without making any other changes it worked, I moved the same cable to the new card, so all network components were the same.
I have found another post on netapp where a user was having this issue with the same network card.
http://communities.netapp.com/thread/9127
I was basically having the same symptoms as him when running nfs stat on my filer. I have forwarded this information to netapp and we are working to resolve the issue.
Edit: This actually all turned out to be an old driver that was bundled with ESXI, In January I downloaded the latest version of ESXI and it autodetected the card, but the driver was one revision out of date, I guess the next revision fixed the issue, because when I installed it, it worked like a champ..
Thanks again to everyone for all their help.