[plug] NFSv4 issues

Tue Feb 22 09:19:53 AWST 2022

G'day all,

I have a relatively simple client/server system here with a central 
server that exports a pile of stuff using NFSv4. No authentication.

This has been flawless since I upgraded from nfs v3 to v4 8-10 years ago.
After a "recentish" kernel update on the server, I've started to get 
intermittent and hard to reproduce timeouts on the clients. No problem 
on the server and all other clients remain responsive.

[ 2901.432422] nfs: server srv not responding, still trying
[ 2901.432423] nfs: server srv not responding, still trying
[ 2901.592410] nfs: server srv not responding, still trying
[ 2901.952426] nfs: server srv not responding, still trying
[ 2902.392426] nfs: server srv not responding, still trying
[ 2902.392432] nfs: server srv not responding, still trying
[ 2903.402412] nfs: server srv not responding, still trying
[ 2903.622411] nfs: server srv not responding, still trying
[ 2903.892413] nfs: server srv not responding, still trying
[ 2931.012132] nfs: server srv OK
[ 2931.012147] nfs: server srv OK
[ 2931.012220] nfs: server srv OK
[ 2931.012237] nfs: server srv OK
[ 2931.012243] nfs: server srv OK
[ 2931.012255] nfs: server srv OK
[ 2931.012285] nfs: server srv OK
[ 2931.012889] nfs: server srv OK
[ 2931.036638] nfs: server srv OK
[ 3129.162392] nfs: server srv not responding, still trying
[ 3129.162399] nfs: server srv not responding, still trying
[ 3129.702387] nfs: server srv not responding, still trying
[ 3130.262377] nfs: server srv not responding, still trying
[ 3130.412397] nfs: server srv not responding, still trying
[ 3130.482477] nfs: server srv not responding, still trying
[ 3130.912386] nfs: server srv not responding, still trying
[ 3130.912392] nfs: server srv not responding, still trying
[ 3131.412397] nfs: server srv not responding, still trying
[ 3131.912392] nfs: server srv not responding, still trying
[ 3157.574579] nfs: server srv OK
[ 3157.574654] nfs: server srv OK
[ 3157.574658] nfs: server srv OK
[ 3157.575214] nfs: server srv OK
[ 3157.575487] nfs: server srv OK
[ 3157.575496] nfs: server srv OK
[ 3157.575501] nfs: server srv OK
[ 3157.575977] nfs: server srv OK
[ 3157.631782] nfs: server srv OK
[ 3157.652340] nfs: server srv OK
[ 3176.012394] rpc_check_timeout: 1 callbacks suppressed
[ 3176.012407] nfs: server srv not responding, still trying
[ 3176.922393] nfs: server srv not responding, still trying
[ 3177.992389] nfs: server srv not responding, still trying
[ 3177.992393] nfs: server srv not responding, still trying
[ 3178.052380] nfs: server srv not responding, still trying
[ 3178.422382] nfs: server srv not responding, still trying
[ 3179.202386] nfs: server srv not responding, still trying
[ 3182.622375] nfs: server srv not responding, still trying
[ 3183.812376] nfs: server srv not responding, still trying
[ 3188.052371] nfs: server srv not responding, still trying
[ 3204.945036] call_decode: 1 callbacks suppressed
[ 3204.945051] nfs: server srv OK
[ 3204.945063] nfs: server srv OK
[ 3204.945176] nfs: server srv OK
[ 3204.945208] nfs: server srv OK
[ 3204.945224] nfs: server srv OK
[ 3204.945229] nfs: server srv OK
[ 3204.946453] nfs: server srv OK
[ 3205.035067] nfs: server srv OK
[ 3205.041453] nfs: server srv OK
[ 3205.048524] nfs: server srv OK

I do see this on the server when it happens :
[285997.760395] rpc-srv/tcp: nfsd: sent 509476 when sending 524392 bytes 
- shutting down socket
[286884.809688] rpc-srv/tcp: nfsd: sent 131768 when sending 266344 bytes 
- shutting down socket

So I know it's likely to be a network issue of some kind, but as it 
happens on a VM on the same server it's not NIC related. There's no 
firewall rules involved.

This happens to all clients having tried :
- A kvm VM on the server.
- My desktop
- My laptop
- A raspberry pi v4

The fault manifests with that particular client freezing all NFS I/O for 
~10 minutes (the log example above was after mounting with -o timeo=10).

All using different kernels. I think it started sometime after Kernel 
5.10.44 on the server, but I had a lot going on and my notes are "sparse".

Much reading intimates that a request from the client gets lost, and 
things lock up until the client hits the timeout value and re-sends the 
request. That is backed up by changing the mount timeout value.

My real problem is to reproduce it I need to move a significant amount 
of traffic over the NFS connection, and that makes a packet trace using 
tcpdump "a bit noisy".

If it were udp then I could understand a request going astray, but as 
it's tcp I can only think it requires a connection drop/reconnect to do 
that and I've not been able to capture one yet in a usable packet trace.

I'm building a test box to attempt to replicate it in a way I can 
bisect, but as it can take an hour or to to manifest that's going to be 
a very slow burn if I can reproduce it on the test hardware.

Unfortunately the server is a production machine, so I'm looking for 
ideas on how one might debug it. Yes, I've searched, but it's not common 
and there's potentially many causes. Any nfs gurus here?

Regards,
-- 
An expert is a person who has found out by his own painful
experience all the mistakes that one can make in a very
narrow field. - Niels Bohr