180 lines
6.1 KiB
Text
180 lines
6.1 KiB
Text
Documentation on various internal structures.
|
|
|
|
Most important structure use an anonymous shared mmap()
|
|
so that child processes can watch them. (All the cli connections
|
|
are handled in child processes).
|
|
|
|
TODO: Re-investigate threads to see if we can use a thread to handle
|
|
cli connections without killing forwarding performance.
|
|
|
|
session[]
|
|
An array of session structures. This is one of the two
|
|
major data structures that are sync'ed across the cluster.
|
|
|
|
This array is statically allocated at startup time to a
|
|
compile time size (currently 50k sessions). This sets a
|
|
hard limit on the number of sessions a cluster can handle.
|
|
|
|
There is one element per l2tp session. (I.e. each active user).
|
|
|
|
The zero'th session is always invalid.
|
|
|
|
tunnel[]
|
|
An array of tunnel structures. This is the other major data structure
|
|
that's actively sync'ed across the cluster.
|
|
|
|
As per sessions, this is statically allocated at startup time
|
|
to a compile time size limit.
|
|
|
|
There is one element per l2tp tunnel. (normally one per BRAS
|
|
that this cluster talks to).
|
|
|
|
The zero'th tunnel is always invalid.
|
|
|
|
ip_pool[]
|
|
|
|
A table holding all the IP address in the pool. As addresses
|
|
are used, they are tagged with the username of the session,
|
|
and the session index.
|
|
|
|
When they are free'd the username tag ISN'T cleared. This is
|
|
to ensure that were possible we re-allocate the same IP
|
|
address back to the same user.
|
|
|
|
radius[]
|
|
A table holding active radius session. Whenever a radius
|
|
conversation is needed (login, accounting et al), a radius
|
|
session is allocated.
|
|
|
|
char **ip_hash
|
|
|
|
A mapping of IP address to session structure. This is a
|
|
tenary tree (each byte of the IP address is used in turn
|
|
to index that level of the tree).
|
|
|
|
If the value is postive, it's considered to be an index
|
|
into the session table.
|
|
|
|
If it's negative, it's considered to be an index into
|
|
the ip_pool[] table.
|
|
|
|
If it's zero, then there is no associated value.
|
|
|
|
|
|
|
|
============================================================
|
|
|
|
Clustering: How it works.
|
|
|
|
At a high level, the various members of the cluster elect
|
|
a master. All other machines become slaves. Slaves handle normal
|
|
packet forwarding. Whenever a slave get a 'state changing' packet
|
|
(i.e. tunnel setup/teardown, session setup etc) it _doesn't_ handle
|
|
it, but instead forwards it to the master.
|
|
|
|
'State changing' it defined to be "a packet that would cause
|
|
a change in either a session or tunnel structure that isn't just
|
|
updating the idle time or byte counters". In practise, this means
|
|
also all LCP, IPCP, and L2TP control packets.
|
|
|
|
The master then handles the packet normally, updating
|
|
the session/tunnel structures. The changed structures are then
|
|
flooded out to the slaves via a multicast packet.
|
|
|
|
|
|
Heartbeat'ing:
|
|
The master sends out a multicast 'heartbeat' packet
|
|
at least once every second. This packet contains a sequence number,
|
|
and any changes to the session/tunnel structures that have
|
|
been queued up. If there is room in the packet, it also sends
|
|
out a number of extra session/tunnel structures.
|
|
|
|
The sending out of 'extra' structures means that the
|
|
master will slowly walk the entire session and tunnel tables.
|
|
This allows a new slave to catch-up on cluster state.
|
|
|
|
|
|
Each heartbeat has an in-order sequence number. If a
|
|
slave receives a heartbeat with a sequence number other than
|
|
the one it was expecting, it drops the unexpected packet and
|
|
unicasts C_LASTSEEN to tell the master the last heartbeast it
|
|
had seen. The master normally than unicasts the missing packets
|
|
to the slave. If the master doesn't have the old packet any more
|
|
(i.e. it's outside the transmission window) then the master
|
|
unicasts C_KILL to the slave asking it to die. (it should then
|
|
restart, and catchup on state via the normal process).
|
|
|
|
Ping'ing:
|
|
All slaves send out a 'ping' once per second as a
|
|
multicast packet. This 'ping' contains the slave's ip address,
|
|
and most importantly: The number of seconds from epoch
|
|
that the slave started up. (I.e. the value of time(2) at
|
|
that the process started).
|
|
|
|
|
|
Elections:
|
|
|
|
All machines start up as slaves.
|
|
|
|
Each slave listens for a heartbeat from the master.
|
|
If a slave fails to hear a heartbeat for N seconds then it
|
|
checks to see if it should become master.
|
|
|
|
A slave will become master if:
|
|
* It hasn't heard from a master for N seconds.
|
|
* It is the oldest of all it's peers (the other slaves).
|
|
* In the event of a tie, the machine with the
|
|
lowest IP address will win.
|
|
|
|
A 'peer' is any other slave machine that's send out a
|
|
ping in the last N seconds. (i.e. we must have seen
|
|
a recent ping from that slave for it to be considered).
|
|
|
|
The upshot of this is that no special communication
|
|
takes place when a slave becomes a master.
|
|
|
|
On initial cluster startup, the process would be (for example)
|
|
|
|
* 3 machines startup simultaneously, all as slaves.
|
|
* each machine sends out a multicast 'ping' every second.
|
|
* 15 seconds later, the machine with the lowest IP
|
|
address becomes master, and starts sending
|
|
out heartbeats.
|
|
* The remaining two machine hear the heartbeat and
|
|
set that machine as their master.
|
|
|
|
Becoming master:
|
|
|
|
When a slave become master, the only structure maintained up
|
|
to date are the tunnel and session structures. This means
|
|
the master will rebuild a number of mappings.
|
|
|
|
#0. All the session and table structures are marked as
|
|
defined. (Even if we weren't fully up to date, it's
|
|
too late now).
|
|
|
|
#1. All the token bucket filters are re-build from scratch
|
|
with the associated session to tbf pointers being re-built.
|
|
|
|
TODO: These changed tbf pointers aren't flooded to the slave right away!
|
|
Throttled session could take a couple of minutes to start working again
|
|
on master failover!
|
|
|
|
#2. The ipcache to session hash is rebuilt. (This isn't
|
|
strictly needed, but it's a safety measure).
|
|
|
|
#3. The mapping from the ippool into the session table
|
|
(and vice versa) is re-built.
|
|
|
|
|
|
Becoming slave:
|
|
|
|
At startup the entire session and table structures are
|
|
marked undefined.
|
|
|
|
As it seens updates from the master, the updated structures
|
|
are marked as defined.
|
|
|
|
When there are no undefined tunnel or session structures, the
|
|
slave marks itself as 'up-to-date' and starts advertising routes
|
|
(if BGP is enabled).
|