Ok, here is how to enable failover routing in Linux in 3 lines.
http://enterprise.linux.com/enterprise/05/04/10/2132252.shtml?tid=129&ti...
----------- Enabling failover routing
After you have configured your network, the next step is to enable failover routing on your Linux box, so that if the first route dies the router will automatically switch over to the next route. To do so, you'll need to add the default gateway routes provided to you by your ISPs for both your network cards:
# route add default gw 61.16.130.97 dev eth0
# route add default gw 200.15.110.90 dev eth1
Finally, modify the /proc/sys/net/ipv4/route/gc_timeout file. This file contains a numerical value that denotes the time in seconds after which the kernel declares a route to be inactive and automatically switches to the other route if available. Change its default value of 300 to some smaller value, say 10 or 15. Save the changes and exit.
# echo "10" > /proc/sys/net/ipv4/route/gc_timeout
Now your Linux machine is ready to serve as a failover router, automatically and quickly switching to the secondary route every time the primary route fails. --------
On Tue, 10 May 2005, John Lange wrote:
Now your Linux machine is ready to serve as a failover router, automatically and quickly switching to the secondary route every time the primary route fails.
Unfortunately this seems to only monitor link status, which is a terrible inidicator of a connection's health. Cable, DSL, and LAN extension don't pass along the link status.
Sean
Sean,
What do you base this on? Google turns up plenty of references to the gc_timeout method of failover but none are specific on what they mean by "when a route dies".
I can't think of any reason why the kernel would care if the route timed out because the link was down or because the route was down for some other reason.
John Lange wrote:
Sean,
What do you base this on? Google turns up plenty of references to the gc_timeout method of failover but none are specific on what they mean by "when a route dies".
Yes, I see that there are a ton of obscure parameters that can be set for routing.
I can't think of any reason why the kernel would care if the route timed out because the link was down or because the route was down for some other reason.
It is the definition of what is meant by "a route dies". In further reading it seems to imply problems with the NIC. On the otherhand it needs the Ethernet MAC address of the gateway so when the ARP times out it will not be able to reach the gateway so I assume that this is another case of the route dieing. gc_stale_time sets how long to wait for an ARP entry to become stale.
If you think about it how does the routing algorithm determine that a gateway is dead. It just forwards packets to it. If the gateway responds to the ARP then there is no way for the routing algorithms to determine that the default route is dead.
I never thought of the ARP timeout before but that would be a good way to determine if the default route is at least up. No idea if this mechanism is actually used.
-- Bill
On Wed, 11 May 2005, John Lange wrote:
What do you base this on? Google turns up plenty of references to the gc_timeout method of failover but none are specific on what they mean by "when a route dies".
Poke around net/ipv4/route.c. The gc_timeout parameter is a garbage collection timer for clearing out the routing table. The only place the routing table timers seem to be played with are in that file... Every so often the value is decremented, and if it goes down to zero it's wiped out.
http://mailman.ds9a.nl/pipermail/lartc/2002q4/005296.html
seems to be a good explanation of the various garbage collection parameters.
I can't think of any reason why the kernel would care if the route timed out because the link was down or because the route was down for some other reason.
It's not if the kernel cares, it's how the kernel knows. The only thing specified in a route is the prefix and the next hop.
route add default gw 1.1.1.1 route add default gw 2.2.2.2 metric 10
Given no exchange of routing information between the peers, the only thing the kernel has to go in is the link status of the next hop interfaces, which is also in the routing table.
For instance, unplug your cable modem from the coax side. Check your routing table, the 0.0.0.0 route should still have the "U" flag, meaning the route is active.
A more concrete example (edited slightly for brevity)
Here look at 172.16.20/24, out interface vmnet8.
[root@bob root]# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 172.16.20.0 0.0.0.0 255.255.255.0 U 0 0 0 vmnet8 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 eth0
The U flag means the route is up and viable. I'll shut down int vmnet8, and watch the route go away.
[root@bob root]# ifconfig vmnet8 down; route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 default poochie.ertw0.0.0.0 UG 0 0 0 eth0
Route's gone, because the next hop is gone. Bring back up the interface.
[root@bob root]# ifconfig vmnet8 up; route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 172.16.20.0 * 255.255.255.0 U 0 0 0 vmnet8 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 default poochie.ertw0.0.0.0 UG 0 0 0 eth0
Route's back.
It all comes down to "How CAN the kernel know a route is down". The only information it has is the outgoing interface status.
That said, Bill brought up the idea of ARP tables. This has two problems:
1 - If the link is idle for a while, the arp cache will clear, and be seen as a failure 2 - There are lots of layer 2 networks that don't use ARP. The ethernet code is separate from the ipv4 code.
In addition, this is a common problems on hardware routers. Even a box that's specifically designed to be a router can't do it without resorting to ICMP tests and policy routing.
Sean
Sean A. Walberg wrote:
That said, Bill brought up the idea of ARP tables. This has two problems:
1 - If the link is idle for a while, the arp cache will clear, and be seen as a failure 2 - There are lots of layer 2 networks that don't use ARP. The ethernet code is separate from the ipv4 code.
In addition, this is a common problems on hardware routers. Even a box that's specifically designed to be a router can't do it without resorting to ICMP tests and policy routing.
Good comments Sean. I agree with your comments and it agrees with what I saw when I was doing testing of default route failover a few years back. The only way I could get it to work efectively was for a program ping the gateways to see if they are up and change the default routes when it noticed a failure.
On the otherhand when the NIC dies (layer 1 ?) layer 3 finds out about it but routing does need to be aware of the interface. I think it tries to send packets to a host out the same interface that it received packets from the host.
-- Bill
On Wed, 2005-05-11 at 14:38 -0500, Bill Reid wrote:
Sean A. Walberg wrote:
That said, Bill brought up the idea of ARP tables. This has two problems:
1 - If the link is idle for a while, the arp cache will clear, and be seen as a failure 2 - There are lots of layer 2 networks that don't use ARP. The ethernet code is separate from the ipv4 code.
In addition, this is a common problems on hardware routers. Even a box that's specifically designed to be a router can't do it without resorting to ICMP tests and policy routing.
Good comments Sean. I agree with your comments and it agrees with what I saw when I was doing testing of default route failover a few years back. The only way I could get it to work efectively was for a program ping the gateways to see if they are up and change the default routes when it noticed a failure.
On the otherhand when the NIC dies (layer 1 ?) layer 3 finds out about it but routing does need to be aware of the interface. I think it tries to send packets to a host out the same interface that it received packets from the host.
So bottom line is, altering gc_timeout would only be effective if the cable modem itself died or the cable was pulled. Since that type of failure happens only rarely compared to other networking problems this technique will have minimal impact.
Ultimately you end up resorting to a monitoring program/script of some kind.
So I wonder how the linksys and sonicwall devices do their failover?