[plug] NFSv4 issues

Bill Kenworthy billk at iinet.net.au
Tue Feb 22 09:38:52 AWST 2022


A long shot ... dns issues?  I've seen a similar access pattern on a master/slave disk when one went flakey.

BillK


On 22 February 2022 9:19:53 am AWST, Brad Campbell <brad at fnarfbargle.com> wrote:
>G'day all,
>
>I have a relatively simple client/server system here with a central 
>server that exports a pile of stuff using NFSv4. No authentication.
>
>This has been flawless since I upgraded from nfs v3 to v4 8-10 years ago.
>After a "recentish" kernel update on the server, I've started to get 
>intermittent and hard to reproduce timeouts on the clients. No problem 
>on the server and all other clients remain responsive.
>
>[ 2901.432422] nfs: server srv not responding, still trying
>[ 2901.432423] nfs: server srv not responding, still trying
>[ 2901.592410] nfs: server srv not responding, still trying
>[ 2901.952426] nfs: server srv not responding, still trying
>[ 2902.392426] nfs: server srv not responding, still trying
>[ 2902.392432] nfs: server srv not responding, still trying
>[ 2903.402412] nfs: server srv not responding, still trying
>[ 2903.622411] nfs: server srv not responding, still trying
>[ 2903.892413] nfs: server srv not responding, still trying
>[ 2931.012132] nfs: server srv OK
>[ 2931.012147] nfs: server srv OK
>[ 2931.012220] nfs: server srv OK
>[ 2931.012237] nfs: server srv OK
>[ 2931.012243] nfs: server srv OK
>[ 2931.012255] nfs: server srv OK
>[ 2931.012285] nfs: server srv OK
>[ 2931.012889] nfs: server srv OK
>[ 2931.036638] nfs: server srv OK
>[ 3129.162392] nfs: server srv not responding, still trying
>[ 3129.162399] nfs: server srv not responding, still trying
>[ 3129.702387] nfs: server srv not responding, still trying
>[ 3130.262377] nfs: server srv not responding, still trying
>[ 3130.412397] nfs: server srv not responding, still trying
>[ 3130.482477] nfs: server srv not responding, still trying
>[ 3130.912386] nfs: server srv not responding, still trying
>[ 3130.912392] nfs: server srv not responding, still trying
>[ 3131.412397] nfs: server srv not responding, still trying
>[ 3131.912392] nfs: server srv not responding, still trying
>[ 3157.574579] nfs: server srv OK
>[ 3157.574654] nfs: server srv OK
>[ 3157.574658] nfs: server srv OK
>[ 3157.575214] nfs: server srv OK
>[ 3157.575487] nfs: server srv OK
>[ 3157.575496] nfs: server srv OK
>[ 3157.575501] nfs: server srv OK
>[ 3157.575977] nfs: server srv OK
>[ 3157.631782] nfs: server srv OK
>[ 3157.652340] nfs: server srv OK
>[ 3176.012394] rpc_check_timeout: 1 callbacks suppressed
>[ 3176.012407] nfs: server srv not responding, still trying
>[ 3176.922393] nfs: server srv not responding, still trying
>[ 3177.992389] nfs: server srv not responding, still trying
>[ 3177.992393] nfs: server srv not responding, still trying
>[ 3178.052380] nfs: server srv not responding, still trying
>[ 3178.422382] nfs: server srv not responding, still trying
>[ 3179.202386] nfs: server srv not responding, still trying
>[ 3182.622375] nfs: server srv not responding, still trying
>[ 3183.812376] nfs: server srv not responding, still trying
>[ 3188.052371] nfs: server srv not responding, still trying
>[ 3204.945036] call_decode: 1 callbacks suppressed
>[ 3204.945051] nfs: server srv OK
>[ 3204.945063] nfs: server srv OK
>[ 3204.945176] nfs: server srv OK
>[ 3204.945208] nfs: server srv OK
>[ 3204.945224] nfs: server srv OK
>[ 3204.945229] nfs: server srv OK
>[ 3204.946453] nfs: server srv OK
>[ 3205.035067] nfs: server srv OK
>[ 3205.041453] nfs: server srv OK
>[ 3205.048524] nfs: server srv OK
>
>I do see this on the server when it happens :
>[285997.760395] rpc-srv/tcp: nfsd: sent 509476 when sending 524392 bytes 
>- shutting down socket
>[286884.809688] rpc-srv/tcp: nfsd: sent 131768 when sending 266344 bytes 
>- shutting down socket
>
>So I know it's likely to be a network issue of some kind, but as it 
>happens on a VM on the same server it's not NIC related. There's no 
>firewall rules involved.
>
>This happens to all clients having tried :
>- A kvm VM on the server.
>- My desktop
>- My laptop
>- A raspberry pi v4
>
>The fault manifests with that particular client freezing all NFS I/O for 
>~10 minutes (the log example above was after mounting with -o timeo=10).
>
>All using different kernels. I think it started sometime after Kernel 
>5.10.44 on the server, but I had a lot going on and my notes are "sparse".
>
>Much reading intimates that a request from the client gets lost, and 
>things lock up until the client hits the timeout value and re-sends the 
>request. That is backed up by changing the mount timeout value.
>
>My real problem is to reproduce it I need to move a significant amount 
>of traffic over the NFS connection, and that makes a packet trace using 
>tcpdump "a bit noisy".
>
>If it were udp then I could understand a request going astray, but as 
>it's tcp I can only think it requires a connection drop/reconnect to do 
>that and I've not been able to capture one yet in a usable packet trace.
>
>I'm building a test box to attempt to replicate it in a way I can 
>bisect, but as it can take an hour or to to manifest that's going to be 
>a very slow burn if I can reproduce it on the test hardware.
>
>Unfortunately the server is a production machine, so I'm looking for 
>ideas on how one might debug it. Yes, I've searched, but it's not common 
>and there's potentially many causes. Any nfs gurus here?
>
>Regards,
>-- 
>An expert is a person who has found out by his own painful
>experience all the mistakes that one can make in a very
>narrow field. - Niels Bohr
>_______________________________________________
>PLUG discussion list: plug at plug.org.au
>http://lists.plug.org.au/mailman/listinfo/plug
>Committee e-mail: committee at plug.org.au
>PLUG Membership: http://www.plug.org.au/membership

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.plug.org.au/pipermail/plug/attachments/20220222/cff9d450/attachment.html>


More information about the plug mailing list