Router Design
==================================

Outline
- Recap of IP router architectures
- MPLS
- Intro to routing protocols.


* Main components of a commercial router:

- Input ports: perform L1/L2 functions. Then destination IP lookup
  (coordinate with forwarding engine), update headers, and send packet
  to output ports via switching fabric. Each input port may have a
  separate forwarding engine to achieve high speed lookups. Several
  input ports are placed in a single linecard hardware.

- Forwarding Engine: performs longest prefix match using destination
  IP and forwarding table, returns the output port / link / interface
  the packet has to leave. [In a simple linux router, this is handled
  by CPU.]

- Switching fabric: responsible for transferring packets from input
  ports to output ports.

- Output ports: handle L1/L2 processing during transmission. Also
  responsible for any scheduling policy to be followed.

- Routing processor: runs routing protocols and computes forwarding
  tables.

* Budget to process packets: consider a 10 Gbps link, with 64-byte
  packets. The rate in packets per second (pps) is 10 Gb / 64*8 =
  approximately 20 million pps. That is, we have 1/2million = 50
  nanoseconds per packet. In this time, we must lookup destination and
  transfer packet to output port. These two tasks form the bottleneck
  of any router design.

* IP forwarding algorithms: the algorithms responsible for performing
  a longest prefix match (LPM) of the destination address. Performed
  by the forwarding engine. Why is this a hard problem? With classful
  addressing, you had only 3 possible prefix lengths, so lookup was
  easy. Now, with CIDR, you can have any possible prefix length. So,
  we may have to lookup multiple routing table entries. With a 50 ns
  budget per packet, and typical memory access times being a few
  nanoseconds, we must lookup very few entries to meet this goal. 

* First, why is one address covered by multiple prefixes? Several
  reasons (we will study in detail later). One of them is called
  prefix hole punching - the ISP has a larger (shoter) prefix and it
  gives a smaller (longer) prefix to customer, called hole
  punching. Then the customer might announce this smaller prefix
  through other ISPs. So we see the shorter prefix from ISP and longer
  prefix from customer in the routing tables. We must match on the
  longer one for correctness.

* What is the simplest data structure to implement a forwarding table
  and do LPM? A trie (pronounced as try). A trie is a tree-like
  structure. Every node has 2 branches for 0 and 1. The nodes at any
  level hold information about the prefix formed by walking down the
  tree up to that level. Branches that do not have any prefixes are
  pruned. To perform LPM, walk down the tree as far as possible. This
  method can take O(N) lookups, where N is length of IP address. And
  each level of the tree may be in a separate memory location. So too
  many memory lookups. Improvements to trie based lookup algorithms:
  compress portions of the tree that do not have many branches (path
  compressed tries).

* Another method is to search using prefix length. Organize prefixes
  into different sets based on length. Do a binary search on prefixes
  at various lengths, and find the longest one that matches. For
  example, for a 32 bit address, start with searching at prefixes of
  length 16. If a match is found, continue with prefixes in range
  16-32, and so on. Takes O(log N) steps. Another method is to
  represent prefixes as intervals, and find the shortest interval that
  holds a certain prefix. A lot of work has been done on efficient
  algorithms using all these approaches.

* Now we move on to switching fabric designs.

- Switching fabric using shared memory: what is the switching fabric
  in a simple linux router? Main memory. You copy packet from device
  driver to memory, perform routing, and copy to another device
  driver. So the connection between input and output ports happens via
  shared memory. We use direct memory access (DMA) without involving
  CPU for copy to and from device driver, and CPU is involved only
  with the forwarding decision. Still, we are limited by the memory
  speed. Consider a memory clock speed of 133 MHz * 64 bit bus = 8
  Gbps. Since you need two copies for each packet, you can only
  forward at 4 Gbps in ideal settings, across all ports.

- Switching fabric using shared bus. Older router designs had a shared
  bus from input ports to output ports, so packet went on the bus only
  once. So you can get 2X better performance than with shared memory
  for the same I/O bus. We need to cope with bus being busy also (only
  one packet transfer at a time). Suppose you have a switch with 10 1
  Gbps input ports and 10 1 Gbps output ports, but a 1 Gbps shared
  bus. Then the total traffic through the switch is only 1 Gbps, not
  10 Gbps.

- Switching fabric with crossbar: instead of one I/O bus to connect N
  input ports and N output ports, we can have a cross bar of 2N buses,
  which can be configured to allow any input port to send to any
  output port. Every input will have packets to some outputs, and a
  bipartite matching will be done from inputs to outputs to configure
  the crossbar for each round. In the matching, every input can be
  paired with only one output, and every output with only one input. A
  crossbar based switch will need efficient algorithms to perform
  matching. A cross bar can potentially transfer N packets from N
  input ports to N output ports, so in the best case can keep the
  inputs and outputs fully utilized. A crossbar is the most common
  fabric design used in high speed routers today. 

* The speed up of the crossbar is defined as the speed of the crossbar
  I/O bus relative to the input port linespeed. For example, if the
  inputs are 1 Gbps links, and the crossbar can potentially transfer
  10 Gbps betwen each pair of matched input and output in each round
  of the crossbar schedule, then speedup is 10. In practice, you need
  a speedup > 1 to achieve good performance. Of course, the actual
  performance depends on which inputs and outputs are popular, how
  efficiently you can match etc. For example, if all inputs always
  send to the same output, then the crossbar cannot schedule many
  packets in parallel, and you will get a degraded performance even if
  the speedup is high.

* Simple matching algorithm for corssbar switch: parallel iterative
  matching (PIM). Each iteration of the algorithm has 3 steps:
  request, grant, accept.  Every input sends a request specifying
  which outputs it has packets for. Among multiple requests, it output
  selects one request at random and sends a grant to the input. Each
  input may receive several grants, and picks one among them to
  accept. After one round, the unmatched inputs and outputs
  participate in the next round to add more matches. Not guaranteed to
  give the best possible match every time, but works reasonably in
  practice. Several improvements have been made in scheduling /
  matching algorithms for crossbars.

* Now, recall we discussed buffers at routers, buffer drops,
  scheduling etc. Where do all of these happen in the router? At input
  ports or output ports? 

- Input ports need to have queues, since the switching fabric may not
  always be available to transport the packet. Suppose there are N
  inputs and N outputs. Suppose the switching fabric speed is same as
  input linespeed (speedup = 1). If all inputs have packets for only
  one output, then the fabric can send packets of only one input to
  the output. And all but one input that is matched will have to queue
  the packets. This is called Input Queueing (IQ). If the switching
  fabric speed is N times that of the input speed (speedup = N), then
  the fabric can transfer all N packets from all N ports in the time
  that each port receives one packet. So, in such cases, IQ won't be
  needed. In general, the switching fabric speed is not N times but
  greater than input linespeed, so a small amount of queueing may
  occur. A reasonable speedup (say 2X), a reasonable distribution of
  traffic from inputs to outputs (not all inputs going to same output
  always), and a good matching algorithm can avoid too much input
  queueing.

- What if the first packet in the input queue is to a busy output
  port, but the rest of the packets are to free output ports? If we
  have only one queue for all packets destined to all outputs, we may
  have head of line (HoL) blocking. That is, we cannot send the first
  packet (to the busy outout port), so we don't even get to the
  packets destined to less busy ports. To avoid this, inputs have
  different queues for packets destined for different output
  ports. That is, after packet arrives, forwarding table is looked up,
  and packet is placed in a queue at the input port, with packets
  destined to different outputs residing in different sub-queues. This
  is called Virtual Output Queueing (VOQ).

- After crossing the fabric, outputs also have queues to implement QoS
  related scheduling policies, RED etc. It makes sense to implement
  these at the output queue because we want to distribute the outgoing
  link bandwidth between flows. Output queues can also implement RED
  and other buffer drop policies. 

---------------------

* MPLS (multi protocol label switching) is a technique to design
  faster routers. MPLS is not based on circuit switching, and is
  designed to work with packet switching and IP datagrams. However, it
  wishes to modify the forwarding logic of IP datagrams to make it
  look more like circuit switching, for improved performance. At some
  point, it was thought that IP lookup algorithms based on longest
  prefix match would be too slow to forward data on high speed links,
  and that a fixed label lookup was needed. This was the initial
  motivation for MPLS.

* When an IP datagram arrives into an MPLS enabled network, the first
  MPLS edge router introduces a label on the packet. Subsequent MPLS
  routers perform switching based on the label, and not based on the
  destination IP. That is, all MPLS routers along the path maintain
  mapping from incoming label and link to outgoing link and label, and
  swap labels on packets. So MPLS routers are also called label
  switching routers (LSRs). 

* Where is MPLS label added? MPLS has a 20 bit label in a 4 byte
  header. This header is usually placed between IP and layer 2
  headers, so MPLS is considered layer 2.5.

* While the initial motivation for MPLS wasn't strong (IP lookup
  algorithms became fast enough), MPLS has found other uses. Some of
  these are listed below.

- Traffic engineering. MPLS can used to "pin" different IP flows to
  the same destination to different label switched paths (LSPs), to
  distribute load evenly and perform traffic engineering in ISP
  backbone networks.

- MPLS can be used for fast reroute, to compute alternate paths in
  case of link failures before other layer 2/3 protocols recover and
  find an alternate path.

- MPLS can be used to build Layer 2 and layer 3 VPNs. The concept is
  similar to IP-in-IP tunneling. That is, to connect two private
  networks, the IP datagram in the private space is encapsulated with
  an MPLS header and tunneled to the other end point, after which the
  MPLS header is removed. However, MPLS based VPNs are more efficient
  because the MPLS header (4 bytes) is smaller than the IP header
  overhead (20 bytes) of IP-in-IP tunneling.

- In general, MPLS has found many uses because the labels can be
  refurbished to mean different things and serve different purposes.

* How are MPLS labels distributed? It depends on the purpose of using
  MPLS. For simple destination based forwarding, labels can be
  distributed with prefixes as part of routing protocols. For traffic
  engineering, a reservation protocol like RSVP along with a shortest
  path algorithm based on bandwidth constraints can be used to compute
  paths and fix labels. For VPN services, labels are distributed with
  BGP. So distribution of labels requires some routing protocol, much
  like normal IP.

----------------------

* MPLS borrows ideas heavily from ATM networks. We will not cover ATM
  in detail, but here is a brief overview.

* Note that we mainly studied IP as part of network layer. IP has won
  out over several other competing proposals for the network layer,
  notably ATM (Aynchronous Transfer Mode). ATM was a networking stack
  (mostly layers 2 and 3) that was based on virtual circuits and
  provided reliability and other such QoS guarantees at the network
  layer. In ATM, data was sent as 53 byte ATM cells. So variable sized
  packets had to be broken down into cells and the overhead was too
  high. Before a flow started, a virtual cirtuit has to be
  setup. Every outgoing flow on a link was given a virtual cirtuit
  identifier (VCI) during connection setup.  VCIs are unique only over
  a link (not globally). During connection setup, for every incoming
  flow, routing was done once to determine outgoing link, an outgoing
  VCI was computed, and a mapping between incoming VCI and outgoing
  VCI was stored. So forwarding a cell required looking up incoming
  VCI to determine outgoing VCI and outgoing link. This lookup was
  considered simpler than longest prefix match on every
  packet. However, due to various technical and business reasons, ATM
  didn't really take off as a network layer technology. Eventually, IP
  (and datagrams, as opposed to virtual circuits) won over as the
  layer 3 technology of choice. However, ATM and related ideas have
  uses in Layer 2.

* ATM is based on the concept of virtual circuit switching. When A
  wants to start a flow to destination D, it sends a setup message
  that is routed along the correct path to D (say, via ATM switches B
  and C). After the virtual circuit is setup is complete, state is
  established along the path on how to forward data of this
  circuit. Every subsequent packet only carries a virtual circuit
  identifier (VCI) and intermediate switches will forward data using
  this VCI. ATM uses fixed length 53 bute frames called "cells".

* Note that a VCI need not have global scope (i.e., unique across the
  network). VCI in ATM only has "link local" scope. That is, flows
  over a given link need to have unique VCIs. So, when a node sends a
  cell on a link for the first time, it picks a VCI that is not in use
  at that link. The receiving switch notes the incoming VCI, picks an
  outgoing link and VCI again. So, switches in ATM maintain a table
  mapping incoming link and incoming VCI to outgoing links and
  outgoing VCI. Whenever a cell arrives on the link, an ATM switch
  looks up the entry corresponding to the incoming link and VCI, finds
  out the outgoing link for that cell, and rewrites the outgoing VCI
  number.

* ATM was considered to be better than datagram based IP networks
  because of fixed length VCI and cells, leading to bounded latencies,
  and the benefit of setting up circuits to provide guarantees
  services. However, the best effort Internet performed reasonably,
  and ATM lost out to IP. 

* Note that ATM switches can be easily reused to be Label
  Switching Routers (LSRs).

-------------------

* So far, we have only seen the forwarding functionality of a
  router. Now, we will understand how routing protocols (that populate
  the forwarding tables of routers) work.

* Some background on link state and distance vector routing protocols.

- Link state routing: "tell about your neighbors to everyone". Each
  node collects information about all its neighbors and the link
  metrics. This LSA (link state announcement) of every node is flooded
  through the entire network. So, at the end of it, each node has
  complete view of the network graph. Each node then independently
  runs Dijkstra's shortest path algorithm to get the shortest path to
  every destination, based on which it figures out the next hop for
  every destination.

- How are link metrics chosen? Can be simple hop count, based on
  bandwidth, latency, physical distance etc. Load-sensitive metrics
  are not preferred due to oscillations. For example, if some link is
  loaded and its metric reduces, all flows may move away from it,
  leading to load on another link, and all flows will move back again.

- Distance vector routing: "tell about everyone to your
  neighbors". Every node exchanges a distance vector, a vector
  containing its estimate of distance to each destination, with its
  neighbors. Upon receiving a neighbor's distance vector, a node
  updates its distance vector by adding link cost to neighbor. If a
  better path is founf through neighbor, it updates its best route.

- Distance vector has "count to infinity" problem. Simple solution
  is "split horizon" or "poisoned reverse".

----------------

* Further reading

- "A 50-Gb/s IP Router", Craig Partridge et al. A classic paper
  describing what it takes to build a 50 Gbps router in great detail.

- "Survey and Taxonomy of IP Address Lookup Algorithms", Ruiz-Sanchez
  et al. Describes various trie-based data structures and other
  techniques to do longest prefix match lookups efficiently.

- "The iSLIP Scheduling Algorithm for Input-Queued Switches", Nick
  McKeown. Describes a cross bar scheduling algorithm which is an
  extension of the PIM algorithm above. The main idea is that instead
  of using of selecting multiple contenders at random to give grants
  to, the output ports follow a round-robin type policy to serve input
  ports, assuring that no input is starved.