Router architectures
=====================

* In this lecture, we will understand how IP networking is implemented
  in a typical end host / router. Most hosts implement IP in the
  kernel, so we will study IP networking in the Linux kernel (other
  OSes should be similar conceptually).

* First, assume that the routing and forwarding tables are populated
  in the kernel (How is this done? We will discuss this later.) What
  happens when a packet is transmitted and received?

* Packet transmission: Application writes packet to kernel. Data
  written into the send buffer. The transport protocol (say, TCP)
  takes the data, forms a segment, adds header, called IP transmit
  function (UDP does something similar). IP adds header, looks up
  destination address for a route, finds out the interface / link to
  send it on, and placed packet in the output queue of the device. All
  this is done when you do a "write" into a socket. From here on, the
  device driver takes on. The kernel schedules the device driver to
  run at a different time. The driver adds link layer headers and
  hands over to the network hardware.

* Packet reception: When packet arrives on the physical medium, the
  device driver stores the packet in a backlog queue in the
  kernel. The kernel scheduler schedules the kernel code to handle the
  packet at a later point. When this code is invoked, the IP layer
  checks for errors and such. If the packet clear the checks, IP
  checks if the destination is the local host, or another host. If the
  packet is destined to the local host, the packet is handed off to
  TCP for its processing (process TCP data and place into receive
  buffer, update TCP state if it is a TCP ACK, etc.). If the packet is
  not destined to the local host, the IP module looks up destination
  address, updates IP headers (like TTL), and places packet in the
  output queue of the corresponding interface.

* The kernel only does forwarding. The routing protocols themselves
  can be implemented as userspace programs (listening on well known
  port numbers) that can modify the forwarding tables in the kernel
  based on the messages they send and receive. The Linux kernel has a
  simple "routed" program that implements a simple intradomain routing
  protocols. Another popular software the "Quagga" software suite that
  implements several intra and interdomain routing protocols.

* So we have seen that a Linux box can serve as a simple router. Why
  do we need specialized routers? Because a Linux machine cannot
  process packets at several Gbps.

* Main components of a commercial router:

- Input ports: perform L1/L2 (physical and link layer) functions. Then
  destination IP lookup (coordinate with forwarding engine), update
  headers, and send packet to output ports via switching fabric. Each
  input port may have a separate forwarding engine to achieve high
  speed lookups. Several input ports are placed in a single linecard
  hardware.

- Forwarding Engine: performs longest prefix match using destination
  IP and forwarding table, returns the output port / link / interface
  the packet has to leave. [In a simple linux router, this is handled
  by CPU.]

- Switching fabric: responsible for transferring packets from input
  ports to output ports.

- Output ports: handle L1/L2 processing during transmission. Also
  responsible for any scheduling policy to be followed.

- Routing processor: runs routing protocols and computes forwarding
  tables. It is like a general purpose CPU.

* Budget to process packets: consider a 10 Gbps link, with 64-byte
  packets. The rate in packets per second (pps) is 10 Gb / 64*8 =
  approximately 20 million pps. That is, we have 1/2million = 50
  nanoseconds per packet. In this time, we must lookup destination and
  transfer packet to output port. These two tasks form the bottleneck
  of any router design.

* IP forwarding algorithms: the algorithms responsible for performing
  a longest prefix match (LPM) of the destination address. Performed
  by the forwarding engine. Why is this a hard problem? With classful
  addressing, you had only 3 possible prefix lengths, so lookup was
  easy. Now, with CIDR, you can have any possible prefix length. So,
  we may have to lookup multiple routing table entries. With a 50 ns
  budget per packet, and typical memory access times being a few
  nanoseconds, we must lookup very few entries to meet this goal. 

* First, why is one address covered by multiple prefixes? Several
  reasons (we will study in detail later). One of them is called
  prefix hole punching - the ISP has a larger (shoter) prefix and it
  gives a smaller (longer) prefix to customer, called hole
  punching. Then the customer might announce this smaller prefix
  through other ISPs. So we see the shorter prefix from ISP and longer
  prefix from customer in the routing tables. We must match on the
  longer one for correctness.

* What is the simplest data structure to implement a forwarding table
  and do LPM? A trie (pronounced as try). A trie is a tree-like
  structure. Every node has 2 branches for 0 and 1. The nodes at any
  level hold information about the prefix formed by walking down the
  tree up to that level. Branches that do not have any prefixes are
  pruned. To perform LPM, walk down the tree as far as possible. This
  method can take O(N) lookups, where N is length of IP address. And
  each level of the tree may be in a separate memory location. So too
  many memory lookups. Improvements to trie based lookup algorithms:
  compress portions of the tree that do not have many branches (path
  compressed tries).

* Another method is to search using prefix length. Organize prefixes
  into different sets based on length. Do a binary search on prefixes
  at various lengths, and find the longest one that matches. For
  example, for a 32 bit address, start with searching at prefixes of
  length 16. If a match is found, continue with prefixes in range
  16-32, and so on. Takes O(log N) steps. Another method is to
  represent prefixes as intervals, and find the shortest interval that
  holds a certain prefix. A lot of work has been done on efficient
  algorithms using all these approaches.

* Now we move on to switching fabric designs. The most common design
  is a "crossbar". A cross bar consists of N data transfer "buses"
  running from each of N inputs, and N buses from each of N
  outputs. The cross bar can be configured to allow any input port to
  send to any output port. Every input will have packets to some
  outputs, and a bipartite matching will be done from inputs to
  outputs to configure the crossbar for each round. In the matching,
  every input can be paired with only one output, and every output
  with only one input. A crossbar based switch will need efficient
  algorithms to perform matching. A cross bar can potentially transfer
  N packets from N input ports to N output ports, so in the best case
  can keep the inputs and outputs fully utilized. A crossbar is the
  most common fabric design used in high speed routers today.

* The speed up of the crossbar is defined as the speed of the crossbar
  I/O bus relative to the input port linespeed. For example, if the
  inputs are 1 Gbps links, and the crossbar can potentially transfer
  10 Gbps betwen each pair of matched input and output in each round
  of the crossbar schedule, then speedup is 10. In practice, you need
  a speedup > 1 to achieve good performance. Of course, the actual
  performance depends on which inputs and outputs are popular, how
  efficiently you can match etc. For example, if all inputs always
  send to the same output, then the crossbar cannot schedule many
  packets in parallel, and you will get a degraded performance even if
  the speedup is high.

* Now, recall we discussed buffers at routers, buffer drops,
  scheduling etc. Where do all of these happen in the router? At input
  ports or output ports? 

- Input ports need to have queues, since the switching fabric may not
  always be available to transport the packet. Suppose there are N
  inputs and N outputs. Suppose the switching fabric speed is same as
  input linespeed (speedup = 1). If all inputs have packets for only
  one output, then the fabric can send packets of only one input to
  the output. And all but one input that is matched will have to queue
  the packets. This is called Input Queueing (IQ). If the switching
  fabric speed is N times that of the input speed (speedup = N), then
  the fabric can transfer all N packets from all N ports in the time
  that each port receives one packet. So, in such cases, IQ won't be
  needed. In general, the switching fabric speed is not N times but
  greater than input linespeed, so a small amount of queueing may
  occur. A reasonable speedup (say 2X), a reasonable distribution of
  traffic from inputs to outputs (not all inputs going to same output
  always), and a good matching algorithm can avoid too much input
  queueing.

- What if the first packet in the input queue is to a busy output
  port, but the rest of the packets are to free output ports? If we
  have only one queue for all packets destined to all outputs, we may
  have head of line (HoL) blocking. That is, we cannot send the first
  packet (to the busy outout port), so we don't even get to the
  packets destined to less busy ports. To avoid this, inputs have
  different queues for packets destined for different output
  ports. That is, after packet arrives, forwarding table is looked up,
  and packet is placed in a queue at the input port, with packets
  destined to different outputs residing in different sub-queues. This
  is called Virtual Output Queueing (VOQ).

- After crossing the fabric, outputs also have queues to implement QoS
  related scheduling policies. It makes sense to implement
  these at the output queue because we want to distribute the outgoing
  link bandwidth between flows.