Imported Upstream version 2.1.21

2011-07-07 12:45:05 +02:00 · 2011-07-07 12:45:05 +02:00 · f2a3180cc0
commit f2a3180cc0
57 changed files with 24656 additions and 0 deletions
--- a/265
+++ b/265
@ -0,0 +1,265 @@
+Documentation on various internal structures.
+
+Most important structure use an anonymous shared mmap()
+so that child processes can watch them. (All the cli connections
+are handled in child processes).
+
+TODO: Re-investigate threads to see if we can use a thread to handle
+cli connections without killing forwarding performance.
+
+session[]
+	An array of session structures. This is one of the two
+	major data structures that are sync'ed across the cluster.
+
+	This array is statically allocated at startup time to a 
+	compile time size (currently 50k sessions). This sets a
+	hard limit on the number of sessions a cluster can handle.
+
+	There is one element per l2tp session. (I.e. each active user).
+
+	The zero'th session is always invalid.
+
+tunnel[]
+	An array of tunnel structures. This is the other major data structure
+	that's actively sync'ed across the cluster.
+
+	As per sessions, this is statically allocated at startup time
+	to a compile time size limit.
+
+	There is one element per l2tp tunnel. (normally one per BRAS
+	that this cluster talks to).
+
+	The zero'th tunnel is always invalid.
+
+ip_pool[]
+	
+	A table holding all the IP address in the pool. As addresses
+	are used, they are tagged with the username of the session,
+	and the session index.
+
+	When they are free'd the username tag ISN'T cleared. This is
+	to ensure that were possible we re-allocate the same IP
+	address back to the same user.
+
+radius[]
+	A table holding active radius session. Whenever a radius
+	conversation is needed (login, accounting et al), a radius
+	session is allocated.
+
+char **ip_hash
+
+	A mapping of IP address to session structure. This is a
+	tenary tree (each byte of the IP address is used in turn
+	to index that level of the tree).
+
+	If the value is postive, it's considered to be an index
+	into the session table. 
+
+	If it's negative, it's considered to be an index into
+	the ip_pool[] table.
+
+	If it's zero, then there is no associated value.
+
+config->cluster_iam_master 
+
+	If true, indicates that this node is the master for
+	the cluster. This has many consequences...
+	
+config->cluster_iam_uptodate
+	
+	On the slaves, this indicates if it's seen a full run
+	of sessions from the master, and thus it's safe to be
+	taking traffic. 
+
+	On the master, this indicates that all slaves are
+	up to date. If any of the slaves aren't up to date,
+	this variable is false, and indicates that we should
+	shift to more rapid heartbeats to bring the slave
+	back up to date.
+
+
+============================================================
+
+Clustering: How it works.
+
+	At a high level, the various members of the cluster elect
+a master. All other machines become slaves as soon as they hear
+a heartbeat from the master. Slaves handle normal packet forwarding.
+Whenever a slave get a 'state changing' packet (i.e. tunnel setup/teardown,
+session setup etc) it _doesn't_ handle it, but instead forwards it
+to the master.
+
+	'State changing' it defined to be "a packet that would cause
+a change in either a session or tunnel structure that isn't just
+updating the idle time or byte counters". In practise, this means
+almost all LCP, IPCP, and L2TP control packets.
+
+	The master then handles the packet normally, updating
+the session/tunnel structures. The changed structures are then
+flooded out to the slaves via a multicast packet.
+
+
+Heartbeat'ing:
+	The master sends out a multicast 'heartbeat' packet
+at least once every second. This packet contains a sequence number,
+and any changes to the session/tunnel structures that have
+been queued up. If there is room in the packet, it also sends
+out a number of extra session/tunnel structures.
+
+	The sending out of 'extra' structures means that the
+master will slowly walk the entire session and tunnel tables.
+This allows a new slave to catch-up on cluster state.
+
+
+	Each heartbeat has an in-order sequence number. If a
+slave receives a heartbeat with a sequence number other than 
+the one it was expecting, it drops the unexpected packet and
+unicasts C_LASTSEEN to tell the master the last heartbeast it
+had seen. The master normally than unicasts the missing packets
+to the slave. If the master doesn't have the old packet any more
+(i.e. it's outside the transmission window) then the master
+unicasts C_KILL to the slave asking it to die. (The slave should
+then restart, and catchup on state via the normal process).
+
+	If a slave goes for more than a few seconds without
+hearing from the master, it sends out a preemptive C_LASTSEEN.
+If the master still exists, this forces to the master to unicast
+the missed heartbeats. This is a work around for a temporary
+multicast problem. (i.e. if an IGMP probe is missed, the slave
+will temporarily stop seeing the multicast heartbeats. This
+work around prevents the slave from becoming master with 
+horrible consequences).
+
+Ping'ing:
+	All slaves send out a 'ping' once per second as a
+multicast packet. This 'ping' contains the slave's ip address,
+and most importantly, the number of seconds from epoch
+that the slave started up. (I.e. the value of time(2) at
+that the process started). (This is the 'basetime').
+Obviously, this is never zero.
+
+	There is a special case. The master can send a single
+ping on shutdown to indicate that it is dead and that an
+immediate election should be held. This special ping is
+send from the master with a 'basetime' of zero.
+
+Elections:
+
+	All machines start up as slaves.
+
+	Each slave listens for a heartbeat from the master.
+If a slave fails to hear a heartbeat for N seconds then it
+checks to see if it should become master.
+
+	A slave will become master if:
+		* It hasn't heard from a master for N seconds.
+		* It is the oldest of all it's peers (the other slaves).
+		* In the event of a tie, the machine with the
+			lowest IP address will win.
+
+	A 'peer' is any other slave machine that's send out a
+	ping in the last N seconds. (i.e. we must have seen
+	a recent ping from that slave for it to be considered).
+
+	The upshot of this is that no special communication
+	takes place when a slave becomes a master.
+
+	On initial cluster startup, the process would be (for example)
+
+		* 3 machines startup simultaneously, all as slaves.
+		* each machine sends out a multicast 'ping' every second.
+		* 15 seconds later, the machine with the lowest IP
+			address becomes master, and starts sending
+			out heartbeats.
+		* The remaining two machine hear the heartbeat and
+			set that machine as their master.
+
+Becoming master:
+	
+	When a slave become master, the only structure maintained up
+	to date are the tunnel and session structures. This means
+	the master will rebuild a number of mappings.
+
+	#0. All the session and table structures are marked as
+	defined. (Even if we weren't fully up to date, it's
+	too late now).
+
+	#1. All the token bucket filters are re-build from scratch
+	with the associated session to tbf pointers being re-built.
+
+TODO: These changed tbf pointers aren't flooded to the slave right away!
+Throttled session could take a couple of minutes to start working again
+on master failover!
+
+	#2. The ipcache to session hash is rebuilt. (This isn't
+	strictly needed, but it's a safety measure).
+
+	#3. The mapping from the ippool into the session table
+	(and vice versa) is re-built.
+
+
+Becoming slave:
+
+	At startup the entire session and table structures are
+	marked undefined.
+
+	As it seens updates from the master, the updated structures
+	are marked as defined.
+
+	When there are no undefined tunnel or session structures, the
+	slave marks itself as 'up-to-date' and starts advertising routes
+	(if BGP is enabled).
+
+STONITH:
+	
+	Currently, there is very minimal protection from split brain.
+In particular, there is no real STONITH protocol to stop two masters
+appearing in the event of a network problem.
+
+
+
+TODO:
+	Should slaves that have undefined sessions, and receive
+a packet from a non-existant session then forward it to the master??
+In normal practice, a slave with undefined session shouldn't be 
+handling packets, but ...
+
+	There is far too much walking of large arrays (in the master
+specifically).  Although this is mitigated somewhat by the
+cluster_high_{sess,tun}, this benefit is lost as that value gets
+closer to MAX{SESSION,TUNNEL}.  There are two issues here:
+	
+	* The tunnel, radius and tbf arrays should probably use a
+	  mechanism like sessions, where grabbing a new one is a
+	  single lookup rather than a walk.
+
+	* A list structure (simillarly rooted at [0].interesting) is
+	  required to avoid having to walk tables periodically.  As a
+	  back-stop the code in the master which *does* walk the
+	  arrays can mark any entry it processes as "interesting" to
+	  ensure it gets looked at even if a bug causes it to be
+	  otherwiase overlooked.
+
+	Support for more than 64k sessions per cluster. There is
+currently a 64k session limit because each session gets an id that global
+over the cluster (as opposed to local to the tunnel). Obviously, the tunnel
+id needs to be used in conjunction with the session id to index into
+the session table. But how?
+
+	I think the best way is to use something like page tables.
+for a given <tid,sid>, the appropriate session index is
+session[ tunnel[tid].page[sid>>10] + (sid & 1023) ]
+Where tunnel[].page[] is a 64 element array. As a tunnel
+fills up it's page block, it allocated a new 1024 session block
+from the session table and fills in the appropriate .page[] 
+entry.
+
+	This should be a reasonable compromise between wasting memory
+(average 500 sessions per tunnel wasted) and speed. (Still a direct
+index without searching, but extra lookups required). Obviously
+the <6,10> split on the sid can be moved around to tune the size
+of the page table v the session table block size.
+
+	This unfortunately means that the tunnel structure HAS to
+be filled on the slave before any of the sessions on it can be used.
+