Imported Upstream version 2.1.21
This commit is contained in:
commit
f2a3180cc0
57 changed files with 24656 additions and 0 deletions
265
INTERNALS
Normal file
265
INTERNALS
Normal file
|
|
@ -0,0 +1,265 @@
|
|||
Documentation on various internal structures.
|
||||
|
||||
Most important structure use an anonymous shared mmap()
|
||||
so that child processes can watch them. (All the cli connections
|
||||
are handled in child processes).
|
||||
|
||||
TODO: Re-investigate threads to see if we can use a thread to handle
|
||||
cli connections without killing forwarding performance.
|
||||
|
||||
session[]
|
||||
An array of session structures. This is one of the two
|
||||
major data structures that are sync'ed across the cluster.
|
||||
|
||||
This array is statically allocated at startup time to a
|
||||
compile time size (currently 50k sessions). This sets a
|
||||
hard limit on the number of sessions a cluster can handle.
|
||||
|
||||
There is one element per l2tp session. (I.e. each active user).
|
||||
|
||||
The zero'th session is always invalid.
|
||||
|
||||
tunnel[]
|
||||
An array of tunnel structures. This is the other major data structure
|
||||
that's actively sync'ed across the cluster.
|
||||
|
||||
As per sessions, this is statically allocated at startup time
|
||||
to a compile time size limit.
|
||||
|
||||
There is one element per l2tp tunnel. (normally one per BRAS
|
||||
that this cluster talks to).
|
||||
|
||||
The zero'th tunnel is always invalid.
|
||||
|
||||
ip_pool[]
|
||||
|
||||
A table holding all the IP address in the pool. As addresses
|
||||
are used, they are tagged with the username of the session,
|
||||
and the session index.
|
||||
|
||||
When they are free'd the username tag ISN'T cleared. This is
|
||||
to ensure that were possible we re-allocate the same IP
|
||||
address back to the same user.
|
||||
|
||||
radius[]
|
||||
A table holding active radius session. Whenever a radius
|
||||
conversation is needed (login, accounting et al), a radius
|
||||
session is allocated.
|
||||
|
||||
char **ip_hash
|
||||
|
||||
A mapping of IP address to session structure. This is a
|
||||
tenary tree (each byte of the IP address is used in turn
|
||||
to index that level of the tree).
|
||||
|
||||
If the value is postive, it's considered to be an index
|
||||
into the session table.
|
||||
|
||||
If it's negative, it's considered to be an index into
|
||||
the ip_pool[] table.
|
||||
|
||||
If it's zero, then there is no associated value.
|
||||
|
||||
config->cluster_iam_master
|
||||
|
||||
If true, indicates that this node is the master for
|
||||
the cluster. This has many consequences...
|
||||
|
||||
config->cluster_iam_uptodate
|
||||
|
||||
On the slaves, this indicates if it's seen a full run
|
||||
of sessions from the master, and thus it's safe to be
|
||||
taking traffic.
|
||||
|
||||
On the master, this indicates that all slaves are
|
||||
up to date. If any of the slaves aren't up to date,
|
||||
this variable is false, and indicates that we should
|
||||
shift to more rapid heartbeats to bring the slave
|
||||
back up to date.
|
||||
|
||||
|
||||
============================================================
|
||||
|
||||
Clustering: How it works.
|
||||
|
||||
At a high level, the various members of the cluster elect
|
||||
a master. All other machines become slaves as soon as they hear
|
||||
a heartbeat from the master. Slaves handle normal packet forwarding.
|
||||
Whenever a slave get a 'state changing' packet (i.e. tunnel setup/teardown,
|
||||
session setup etc) it _doesn't_ handle it, but instead forwards it
|
||||
to the master.
|
||||
|
||||
'State changing' it defined to be "a packet that would cause
|
||||
a change in either a session or tunnel structure that isn't just
|
||||
updating the idle time or byte counters". In practise, this means
|
||||
almost all LCP, IPCP, and L2TP control packets.
|
||||
|
||||
The master then handles the packet normally, updating
|
||||
the session/tunnel structures. The changed structures are then
|
||||
flooded out to the slaves via a multicast packet.
|
||||
|
||||
|
||||
Heartbeat'ing:
|
||||
The master sends out a multicast 'heartbeat' packet
|
||||
at least once every second. This packet contains a sequence number,
|
||||
and any changes to the session/tunnel structures that have
|
||||
been queued up. If there is room in the packet, it also sends
|
||||
out a number of extra session/tunnel structures.
|
||||
|
||||
The sending out of 'extra' structures means that the
|
||||
master will slowly walk the entire session and tunnel tables.
|
||||
This allows a new slave to catch-up on cluster state.
|
||||
|
||||
|
||||
Each heartbeat has an in-order sequence number. If a
|
||||
slave receives a heartbeat with a sequence number other than
|
||||
the one it was expecting, it drops the unexpected packet and
|
||||
unicasts C_LASTSEEN to tell the master the last heartbeast it
|
||||
had seen. The master normally than unicasts the missing packets
|
||||
to the slave. If the master doesn't have the old packet any more
|
||||
(i.e. it's outside the transmission window) then the master
|
||||
unicasts C_KILL to the slave asking it to die. (The slave should
|
||||
then restart, and catchup on state via the normal process).
|
||||
|
||||
If a slave goes for more than a few seconds without
|
||||
hearing from the master, it sends out a preemptive C_LASTSEEN.
|
||||
If the master still exists, this forces to the master to unicast
|
||||
the missed heartbeats. This is a work around for a temporary
|
||||
multicast problem. (i.e. if an IGMP probe is missed, the slave
|
||||
will temporarily stop seeing the multicast heartbeats. This
|
||||
work around prevents the slave from becoming master with
|
||||
horrible consequences).
|
||||
|
||||
Ping'ing:
|
||||
All slaves send out a 'ping' once per second as a
|
||||
multicast packet. This 'ping' contains the slave's ip address,
|
||||
and most importantly, the number of seconds from epoch
|
||||
that the slave started up. (I.e. the value of time(2) at
|
||||
that the process started). (This is the 'basetime').
|
||||
Obviously, this is never zero.
|
||||
|
||||
There is a special case. The master can send a single
|
||||
ping on shutdown to indicate that it is dead and that an
|
||||
immediate election should be held. This special ping is
|
||||
send from the master with a 'basetime' of zero.
|
||||
|
||||
Elections:
|
||||
|
||||
All machines start up as slaves.
|
||||
|
||||
Each slave listens for a heartbeat from the master.
|
||||
If a slave fails to hear a heartbeat for N seconds then it
|
||||
checks to see if it should become master.
|
||||
|
||||
A slave will become master if:
|
||||
* It hasn't heard from a master for N seconds.
|
||||
* It is the oldest of all it's peers (the other slaves).
|
||||
* In the event of a tie, the machine with the
|
||||
lowest IP address will win.
|
||||
|
||||
A 'peer' is any other slave machine that's send out a
|
||||
ping in the last N seconds. (i.e. we must have seen
|
||||
a recent ping from that slave for it to be considered).
|
||||
|
||||
The upshot of this is that no special communication
|
||||
takes place when a slave becomes a master.
|
||||
|
||||
On initial cluster startup, the process would be (for example)
|
||||
|
||||
* 3 machines startup simultaneously, all as slaves.
|
||||
* each machine sends out a multicast 'ping' every second.
|
||||
* 15 seconds later, the machine with the lowest IP
|
||||
address becomes master, and starts sending
|
||||
out heartbeats.
|
||||
* The remaining two machine hear the heartbeat and
|
||||
set that machine as their master.
|
||||
|
||||
Becoming master:
|
||||
|
||||
When a slave become master, the only structure maintained up
|
||||
to date are the tunnel and session structures. This means
|
||||
the master will rebuild a number of mappings.
|
||||
|
||||
#0. All the session and table structures are marked as
|
||||
defined. (Even if we weren't fully up to date, it's
|
||||
too late now).
|
||||
|
||||
#1. All the token bucket filters are re-build from scratch
|
||||
with the associated session to tbf pointers being re-built.
|
||||
|
||||
TODO: These changed tbf pointers aren't flooded to the slave right away!
|
||||
Throttled session could take a couple of minutes to start working again
|
||||
on master failover!
|
||||
|
||||
#2. The ipcache to session hash is rebuilt. (This isn't
|
||||
strictly needed, but it's a safety measure).
|
||||
|
||||
#3. The mapping from the ippool into the session table
|
||||
(and vice versa) is re-built.
|
||||
|
||||
|
||||
Becoming slave:
|
||||
|
||||
At startup the entire session and table structures are
|
||||
marked undefined.
|
||||
|
||||
As it seens updates from the master, the updated structures
|
||||
are marked as defined.
|
||||
|
||||
When there are no undefined tunnel or session structures, the
|
||||
slave marks itself as 'up-to-date' and starts advertising routes
|
||||
(if BGP is enabled).
|
||||
|
||||
STONITH:
|
||||
|
||||
Currently, there is very minimal protection from split brain.
|
||||
In particular, there is no real STONITH protocol to stop two masters
|
||||
appearing in the event of a network problem.
|
||||
|
||||
|
||||
|
||||
TODO:
|
||||
Should slaves that have undefined sessions, and receive
|
||||
a packet from a non-existant session then forward it to the master??
|
||||
In normal practice, a slave with undefined session shouldn't be
|
||||
handling packets, but ...
|
||||
|
||||
There is far too much walking of large arrays (in the master
|
||||
specifically). Although this is mitigated somewhat by the
|
||||
cluster_high_{sess,tun}, this benefit is lost as that value gets
|
||||
closer to MAX{SESSION,TUNNEL}. There are two issues here:
|
||||
|
||||
* The tunnel, radius and tbf arrays should probably use a
|
||||
mechanism like sessions, where grabbing a new one is a
|
||||
single lookup rather than a walk.
|
||||
|
||||
* A list structure (simillarly rooted at [0].interesting) is
|
||||
required to avoid having to walk tables periodically. As a
|
||||
back-stop the code in the master which *does* walk the
|
||||
arrays can mark any entry it processes as "interesting" to
|
||||
ensure it gets looked at even if a bug causes it to be
|
||||
otherwiase overlooked.
|
||||
|
||||
Support for more than 64k sessions per cluster. There is
|
||||
currently a 64k session limit because each session gets an id that global
|
||||
over the cluster (as opposed to local to the tunnel). Obviously, the tunnel
|
||||
id needs to be used in conjunction with the session id to index into
|
||||
the session table. But how?
|
||||
|
||||
I think the best way is to use something like page tables.
|
||||
for a given <tid,sid>, the appropriate session index is
|
||||
session[ tunnel[tid].page[sid>>10] + (sid & 1023) ]
|
||||
Where tunnel[].page[] is a 64 element array. As a tunnel
|
||||
fills up it's page block, it allocated a new 1024 session block
|
||||
from the session table and fills in the appropriate .page[]
|
||||
entry.
|
||||
|
||||
This should be a reasonable compromise between wasting memory
|
||||
(average 500 sessions per tunnel wasted) and speed. (Still a direct
|
||||
index without searching, but extra lookups required). Obviously
|
||||
the <6,10> split on the sid can be moved around to tune the size
|
||||
of the page table v the session table block size.
|
||||
|
||||
This unfortunately means that the tunnel structure HAS to
|
||||
be filled on the slave before any of the sessions on it can be used.
|
||||
|
||||
Loading…
Add table
Add a link
Reference in a new issue