[plug] NFSv4 issues
Brad Campbell
brad at fnarfbargle.com
Tue Feb 22 09:19:53 AWST 2022
G'day all,
I have a relatively simple client/server system here with a central
server that exports a pile of stuff using NFSv4. No authentication.
This has been flawless since I upgraded from nfs v3 to v4 8-10 years ago.
After a "recentish" kernel update on the server, I've started to get
intermittent and hard to reproduce timeouts on the clients. No problem
on the server and all other clients remain responsive.
[ 2901.432422] nfs: server srv not responding, still trying
[ 2901.432423] nfs: server srv not responding, still trying
[ 2901.592410] nfs: server srv not responding, still trying
[ 2901.952426] nfs: server srv not responding, still trying
[ 2902.392426] nfs: server srv not responding, still trying
[ 2902.392432] nfs: server srv not responding, still trying
[ 2903.402412] nfs: server srv not responding, still trying
[ 2903.622411] nfs: server srv not responding, still trying
[ 2903.892413] nfs: server srv not responding, still trying
[ 2931.012132] nfs: server srv OK
[ 2931.012147] nfs: server srv OK
[ 2931.012220] nfs: server srv OK
[ 2931.012237] nfs: server srv OK
[ 2931.012243] nfs: server srv OK
[ 2931.012255] nfs: server srv OK
[ 2931.012285] nfs: server srv OK
[ 2931.012889] nfs: server srv OK
[ 2931.036638] nfs: server srv OK
[ 3129.162392] nfs: server srv not responding, still trying
[ 3129.162399] nfs: server srv not responding, still trying
[ 3129.702387] nfs: server srv not responding, still trying
[ 3130.262377] nfs: server srv not responding, still trying
[ 3130.412397] nfs: server srv not responding, still trying
[ 3130.482477] nfs: server srv not responding, still trying
[ 3130.912386] nfs: server srv not responding, still trying
[ 3130.912392] nfs: server srv not responding, still trying
[ 3131.412397] nfs: server srv not responding, still trying
[ 3131.912392] nfs: server srv not responding, still trying
[ 3157.574579] nfs: server srv OK
[ 3157.574654] nfs: server srv OK
[ 3157.574658] nfs: server srv OK
[ 3157.575214] nfs: server srv OK
[ 3157.575487] nfs: server srv OK
[ 3157.575496] nfs: server srv OK
[ 3157.575501] nfs: server srv OK
[ 3157.575977] nfs: server srv OK
[ 3157.631782] nfs: server srv OK
[ 3157.652340] nfs: server srv OK
[ 3176.012394] rpc_check_timeout: 1 callbacks suppressed
[ 3176.012407] nfs: server srv not responding, still trying
[ 3176.922393] nfs: server srv not responding, still trying
[ 3177.992389] nfs: server srv not responding, still trying
[ 3177.992393] nfs: server srv not responding, still trying
[ 3178.052380] nfs: server srv not responding, still trying
[ 3178.422382] nfs: server srv not responding, still trying
[ 3179.202386] nfs: server srv not responding, still trying
[ 3182.622375] nfs: server srv not responding, still trying
[ 3183.812376] nfs: server srv not responding, still trying
[ 3188.052371] nfs: server srv not responding, still trying
[ 3204.945036] call_decode: 1 callbacks suppressed
[ 3204.945051] nfs: server srv OK
[ 3204.945063] nfs: server srv OK
[ 3204.945176] nfs: server srv OK
[ 3204.945208] nfs: server srv OK
[ 3204.945224] nfs: server srv OK
[ 3204.945229] nfs: server srv OK
[ 3204.946453] nfs: server srv OK
[ 3205.035067] nfs: server srv OK
[ 3205.041453] nfs: server srv OK
[ 3205.048524] nfs: server srv OK
I do see this on the server when it happens :
[285997.760395] rpc-srv/tcp: nfsd: sent 509476 when sending 524392 bytes
- shutting down socket
[286884.809688] rpc-srv/tcp: nfsd: sent 131768 when sending 266344 bytes
- shutting down socket
So I know it's likely to be a network issue of some kind, but as it
happens on a VM on the same server it's not NIC related. There's no
firewall rules involved.
This happens to all clients having tried :
- A kvm VM on the server.
- My desktop
- My laptop
- A raspberry pi v4
The fault manifests with that particular client freezing all NFS I/O for
~10 minutes (the log example above was after mounting with -o timeo=10).
All using different kernels. I think it started sometime after Kernel
5.10.44 on the server, but I had a lot going on and my notes are "sparse".
Much reading intimates that a request from the client gets lost, and
things lock up until the client hits the timeout value and re-sends the
request. That is backed up by changing the mount timeout value.
My real problem is to reproduce it I need to move a significant amount
of traffic over the NFS connection, and that makes a packet trace using
tcpdump "a bit noisy".
If it were udp then I could understand a request going astray, but as
it's tcp I can only think it requires a connection drop/reconnect to do
that and I've not been able to capture one yet in a usable packet trace.
I'm building a test box to attempt to replicate it in a way I can
bisect, but as it can take an hour or to to manifest that's going to be
a very slow burn if I can reproduce it on the test hardware.
Unfortunately the server is a production machine, so I'm looking for
ideas on how one might debug it. Yes, I've searched, but it's not common
and there's potentially many causes. Any nfs gurus here?
Regards,
--
An expert is a person who has found out by his own painful
experience all the mistakes that one can make in a very
narrow field. - Niels Bohr
More information about the plug
mailing list