[plug] OT: Pool of IPs for testing load-balanced connections

Fri Sep 4 09:18:05 WST 2009

Simon Newton wrote:
> On Wed, Sep 2, 2009 at 8:47 PM, Adrian Woodley<Adrian at diskworld.com.au> wrote:
>   
>> Having worked at an ISP, I know allow about "High Availability" and the
>> likelihood of any given IP being actually available, especially when
>> accessed off-net. :P
>>
>> The way the load-balancing works, it sets a static route for the target IP
>> via the gateway being tested. Traffic for this test IP will always be
>> directed out that gateway, even when other connections are available or the
>> gateway being tested is down. This means that a common/important IP can't be
>> used (ie google.com's A record).
>>
>> Because each router connected to the LINC will be doing its own load
>> balancing, with its own static routes for the test IPs, each router will
>> need a unique IP to test against.
>>
>> For example, there are two routers, A and B. Both use 203.0.178.191 to test
>> for internet connectivity via the LINC. If router A is elected the default
>> gateway for the LINC, it will still route all connections for 203.0.178.191
>> back out via the LINC (static route). Test pings from router B will hit
>> router A and loop until the TTL is hit. In this scenario, router B will
>> never decide that the LINC gateway is available.
>>
>> If routers A and B have unique test IPs for testing for Internet
>> connectivity via the LINC, the test ping will hit the opposing router and be
>> direct out its current gateway, thereby establishing the validity of the
>> connection.
>>     
>
> I find whenever I'm faced with an odd looking problem like "find
> unique, highly available public IPs that aren't used by users" it's
> best to take a step back and look at what problem you're actually
> trying to solve. For each pair of machines (A, B) there are the
> following questions:
>
> i) Can machine A reach the internet
> ii) Can machine B reach box A
> iii) Can machine B reach the internet through machine A
>
> I'd assert that if i) and ii) are true, iii) should also be true (I'm
> assuming here that all machines are under your control). If it's not,
> it's a config issue which should occur infrequently enough that we
> don't mind dropping user traffic until it's detected. With that in
> mind we only have to solve i) and ii).
>
> ii) is trivial.
> i) is also reasonably simple. You can open a data link socket and
> write you own headers to force the health check out the internet link.
> If you don't want to do this you could also mark the health check
> packets with iptables and then use custom routing rules to send them
> out the external interface. Each machine then knows its own health and
> allows others to connect to it and query its health state.
>
> Once that's in place, each machine can build a list of candidate
> gateways by attempting to connect to every other and get the health
> information. You'll need to tweak the health check interval and
> thresholds to strike a balance between convergence and stability.
>
>
> Simon N
G'day Simon,

I've drawn up a quick diagram to show the current design.

http://dump.diskworld.com.au/Incident_Management_Network.png

The routers are all running pfSense[1] on Yawarra[2] embedded routers.

All routers which have their own Internet connection, NextG or Satellite 
(or other), participate in the CARP floating IP. This IP will land on 
the router with the highest priority; Pantec = 10, Satellite = 100, 
NextG = 200.

The routers in the buses and pantec[3] have two WAN interfaces; one via 
its local connection (ie NextG) and one via the LINC, using 
192.168.254.254 as the default gateway. The router will load balance 
across these two connections, when available. pfSense uses slbd[4] to 
achieve this.

slbd doesn't modify the default gateway on the router itself, it just 
directs outgoing connections via the available WAN connections. It 
determines the availability of each connection by pinging a test IP, via 
a static route.

Because the CARP setup doesn't allow the priority of a router to be 
influenced by it's status (ie WAN availability), slbd needs to check to 
ensure that the Internet is available via whichever router the floating 
IP has landed on. If all routers were using the same IP to check if 
their LINC WAN connection has Internet access, the ping will hit the 
static route on the router with the floating IP and be looped back out 
onto the LINC subnet. If unique IPs are used for this test, the floating 
IP router will simply pass the ping out via it's locally available 
connection.

Simon, your idea elegant in its simplicity, and I feel would be overall 
more reliable and require less manual intervention in the event of a 
gateway failure (ie no NextG service). However, the fire season is 
rapidly approaching and I need to have all our comms buses, satellite 
trailers and other communications facilities updated and rolled out to 
regional WA before 1st of Oct.

pfSense offered a pre-packaged and easy to configure system, at the 
right price (ie Free!). Setup of the network at a Level 2 or 3 incident 
can be easily carried out by the Incident Management staff on the 
ground, without needing a comms-tech on site straight away. The only 
scenario requiring manual intervention is if a Comms Bus is the current 
LINC default gateway and its NextG router dies; its priority will need 
to be manually dropped to force the floating IP onto another router. 
(Satellite trailers suffering the same problem can simply be removed 
from the network, with out impact other services).

I'm definitely hanging out for the day when lower-end / consumer grade 
routers support something other than RIP (OSPF would be nice!).

Cheers,

Adrian

[1] - http://pfsense.org/
[2] - http://www.yawarra.com.au/
[3] - http://dump.diskworld.com.au/Pantec.png
[4] - http://slbd.sourceforge.net/