[plug] Routing with nonat - ssh tunnel and port forwarding

Fri Mar 20 09:44:08 WST 2009

Mike Holland <michael.holland at gmail.com> writes:
> Daniel Pittman wrote:
>
>> Specifically, what happens is that when a packet is lost the outside
>> TCP connection ensures it is retransmitted ... but, so too does the
>> inner connection.
>
> I find that surprising.

That is fair: TCP, IP, and network protocol design are amazingly
complicated and are full of surprises.  Even after many, many years
working relatively intimately with them I am still surprised by some of
the behaviours, too.

> TCP doesn't have any mechanism to cope with that?

No.  Specifically, there is absolutely *NO* way that TCP could possibly
ever be designed to cope with that *and* retain the current design
feature of keeping state only at the endpoints.

> Some way to make the inner TCP connection back-off it's retransmit
> intervals more quickly than the outer?

No.  A simple example is where the TCP connection comes from two
machines, but the TCP tunnel is managed by two different machines.

How do the originating machine and the router, in between them,
communicate about the TCP retransmission timers that are implemented in
the OS?

That information is not carried with the packet, the two systems can be
from completely different vendors, and the originating machine has
*zero* knowledge about the path between the two.[1]

There is also nothing in the protocol, unlike PMTU, that can convey this
information back within the standards.  So, implementing this would
require a completely new, and very low level, protocol be designed,
standardized, tested and implemented.

Also, notably, I am pretty sure it would introduce various denial of
service vulnerabilities, through faked packets instructing arbitrary TCP
connections to back off retransmission to huge periods of time...

> And would you need a very high packet-loss rate before the problem
> "goes critical"?

The level is actually surprisingly low, because there is one other
factor that plays into this, the clocks.

Specifically, the clock that your OS uses to drive retransmission is
surprisingly non-random.

The speed / delay hierarchy of your computer and OS also tends to smush
closely related events into the same or subsequent clock ticks.

So, usually the retransmission timers for the inner and outer TCP
sessions on a single machine are in the same or two adjacent clock
ticks, making them quite well synchronized.

Even on two distinct machines there is a fair degree of
synchronization, since the tunnel is usually only one machine away and
routing of packets normally happens in less than one clock tick
today.[2]

The two machines, sharing the same physical network link, also share
some common clocking through the network cards, which need a common
clock disciplined to the network to be able to communicate efficiently.

The "things you learn" comment above was that relatively recently
I learned that this same problem shows up on the Internet backbone as
traffic levels approach saturation:

Because, statistically, all the machines out there share various
relatively common clocking sources, especially over networks, they tend
to synchronize.

So, over the backbone you start to see a sawtooth shaped utilization
graph as things get closer to busy: everyone sends data, packets drop,
everyone backs off, the link has a significant idle period[3], everyone
retransmits and congestion shoots through the roof...

Part of the current design of the backbone routing protocols, such as
the "random early drop" model, is to try and smooth out network traffic
flows despite the significant synchronization of computer clocks...

The network, it is a very, very strange place.

Regards,
        Daniel

Footnotes: 
[1]  Technically, PMTU discovery now means that it has one word of
     information about the path, which is the maximum PDU that can be
     transmitted without fragmentation across all machines in the path.

[2]  My laptop, which is not that impressively powererful, has around
     5.2 million CPU cycles to process the packet between individual
     clock ticks.  That covers a lot of processing...

[3]  Fractions of a second, but when you are talking about 10Gbit/second
     or faster link speeds...