[plug] NFSv4 issues
brad at fnarfbargle.com
Tue Feb 22 18:58:45 AWST 2022
On 22/2/22 6:43 pm, Dean Bergin wrote:
> If you've been able to replicate the issue with bare metal clients, the the issue likely rests at the server side. The spurious TCP events in the logs might suggest you have a network problem.
Yeah, the fault appears to have started after a kernel update on the server last year. The problem is it took me a few months to notice it and I'm not entirely sure exactly when it started.
I suspect it looks like a network issue also. The thing that gets me is the server has 2 bridge devices. One contains the external network via part of an Intel e1000 based nic and the other is a gaggle of tun devices for the KVMs and the issue occurs the same on both. So that would appear to rule out the physical devices or drivers.
> My first guess is that you may have fragmentation causing packet loss and latency affecting NFS (not sure if that's a correct diagnosis as NFS is TCP which should be able to handle a far amount of retransmissions, but I could be wrong).
That gives me some knobs to tweak anyway. I spent quite a bit of time tweaking for a 4 port bond between 2 machines, so I have a bit of familiarity with the controls available in the stack. Hadn't thought of that.
> Also, in my line of work I often see issues with MTU causing all kinds of problems. Check that too and let us know.
Definately not MTU. All devices on both the hard and VM network have a stock MTU of 1500.
> A packet capture might help get the evidence to suggest what's going on by following the conversation from a network perspective at least.
Yeah, I'm trying to get that in a viable manner. The problem is I need to push several (to several hundred) GB through the filesystem to get it to happen, which makes the captures "large". I need to get my head into wireshark / tcpdump more than the "switch it on and watch the lines scroll past" I'm used to.
Appreciate the insight.
More information about the plug