Router Design ================================== Outline - Recap of IP router architectures - MPLS - Intro to routing protocols. * Main components of a commercial router: - Input ports: perform L1/L2 functions. Then destination IP lookup (coordinate with forwarding engine), update headers, and send packet to output ports via switching fabric. Each input port may have a separate forwarding engine to achieve high speed lookups. Several input ports are placed in a single linecard hardware. - Forwarding Engine: performs longest prefix match using destination IP and forwarding table, returns the output port / link / interface the packet has to leave. [In a simple linux router, this is handled by CPU.] - Switching fabric: responsible for transferring packets from input ports to output ports. - Output ports: handle L1/L2 processing during transmission. Also responsible for any scheduling policy to be followed. - Routing processor: runs routing protocols and computes forwarding tables. * Budget to process packets: consider a 10 Gbps link, with 64-byte packets. The rate in packets per second (pps) is 10 Gb / 64*8 = approximately 20 million pps. That is, we have 1/2million = 50 nanoseconds per packet. In this time, we must lookup destination and transfer packet to output port. These two tasks form the bottleneck of any router design. * IP forwarding algorithms: the algorithms responsible for performing a longest prefix match (LPM) of the destination address. Performed by the forwarding engine. Why is this a hard problem? With classful addressing, you had only 3 possible prefix lengths, so lookup was easy. Now, with CIDR, you can have any possible prefix length. So, we may have to lookup multiple routing table entries. With a 50 ns budget per packet, and typical memory access times being a few nanoseconds, we must lookup very few entries to meet this goal. * First, why is one address covered by multiple prefixes? Several reasons (we will study in detail later). One of them is called prefix hole punching - the ISP has a larger (shoter) prefix and it gives a smaller (longer) prefix to customer, called hole punching. Then the customer might announce this smaller prefix through other ISPs. So we see the shorter prefix from ISP and longer prefix from customer in the routing tables. We must match on the longer one for correctness. * What is the simplest data structure to implement a forwarding table and do LPM? A trie (pronounced as try). A trie is a tree-like structure. Every node has 2 branches for 0 and 1. The nodes at any level hold information about the prefix formed by walking down the tree up to that level. Branches that do not have any prefixes are pruned. To perform LPM, walk down the tree as far as possible. This method can take O(N) lookups, where N is length of IP address. And each level of the tree may be in a separate memory location. So too many memory lookups. Improvements to trie based lookup algorithms: compress portions of the tree that do not have many branches (path compressed tries). * Another method is to search using prefix length. Organize prefixes into different sets based on length. Do a binary search on prefixes at various lengths, and find the longest one that matches. For example, for a 32 bit address, start with searching at prefixes of length 16. If a match is found, continue with prefixes in range 16-32, and so on. Takes O(log N) steps. Another method is to represent prefixes as intervals, and find the shortest interval that holds a certain prefix. A lot of work has been done on efficient algorithms using all these approaches. * Now we move on to switching fabric designs. - Switching fabric using shared memory: what is the switching fabric in a simple linux router? Main memory. You copy packet from device driver to memory, perform routing, and copy to another device driver. So the connection between input and output ports happens via shared memory. We use direct memory access (DMA) without involving CPU for copy to and from device driver, and CPU is involved only with the forwarding decision. Still, we are limited by the memory speed. Consider a memory clock speed of 133 MHz * 64 bit bus = 8 Gbps. Since you need two copies for each packet, you can only forward at 4 Gbps in ideal settings, across all ports. - Switching fabric using shared bus. Older router designs had a shared bus from input ports to output ports, so packet went on the bus only once. So you can get 2X better performance than with shared memory for the same I/O bus. We need to cope with bus being busy also (only one packet transfer at a time). Suppose you have a switch with 10 1 Gbps input ports and 10 1 Gbps output ports, but a 1 Gbps shared bus. Then the total traffic through the switch is only 1 Gbps, not 10 Gbps. - Switching fabric with crossbar: instead of one I/O bus to connect N input ports and N output ports, we can have a cross bar of 2N buses, which can be configured to allow any input port to send to any output port. Every input will have packets to some outputs, and a bipartite matching will be done from inputs to outputs to configure the crossbar for each round. In the matching, every input can be paired with only one output, and every output with only one input. A crossbar based switch will need efficient algorithms to perform matching. A cross bar can potentially transfer N packets from N input ports to N output ports, so in the best case can keep the inputs and outputs fully utilized. A crossbar is the most common fabric design used in high speed routers today. * The speed up of the crossbar is defined as the speed of the crossbar I/O bus relative to the input port linespeed. For example, if the inputs are 1 Gbps links, and the crossbar can potentially transfer 10 Gbps betwen each pair of matched input and output in each round of the crossbar schedule, then speedup is 10. In practice, you need a speedup > 1 to achieve good performance. Of course, the actual performance depends on which inputs and outputs are popular, how efficiently you can match etc. For example, if all inputs always send to the same output, then the crossbar cannot schedule many packets in parallel, and you will get a degraded performance even if the speedup is high. * Simple matching algorithm for corssbar switch: parallel iterative matching (PIM). Each iteration of the algorithm has 3 steps: request, grant, accept. Every input sends a request specifying which outputs it has packets for. Among multiple requests, it output selects one request at random and sends a grant to the input. Each input may receive several grants, and picks one among them to accept. After one round, the unmatched inputs and outputs participate in the next round to add more matches. Not guaranteed to give the best possible match every time, but works reasonably in practice. Several improvements have been made in scheduling / matching algorithms for crossbars. * Now, recall we discussed buffers at routers, buffer drops, scheduling etc. Where do all of these happen in the router? At input ports or output ports? - Input ports need to have queues, since the switching fabric may not always be available to transport the packet. Suppose there are N inputs and N outputs. Suppose the switching fabric speed is same as input linespeed (speedup = 1). If all inputs have packets for only one output, then the fabric can send packets of only one input to the output. And all but one input that is matched will have to queue the packets. This is called Input Queueing (IQ). If the switching fabric speed is N times that of the input speed (speedup = N), then the fabric can transfer all N packets from all N ports in the time that each port receives one packet. So, in such cases, IQ won't be needed. In general, the switching fabric speed is not N times but greater than input linespeed, so a small amount of queueing may occur. A reasonable speedup (say 2X), a reasonable distribution of traffic from inputs to outputs (not all inputs going to same output always), and a good matching algorithm can avoid too much input queueing. - What if the first packet in the input queue is to a busy output port, but the rest of the packets are to free output ports? If we have only one queue for all packets destined to all outputs, we may have head of line (HoL) blocking. That is, we cannot send the first packet (to the busy outout port), so we don't even get to the packets destined to less busy ports. To avoid this, inputs have different queues for packets destined for different output ports. That is, after packet arrives, forwarding table is looked up, and packet is placed in a queue at the input port, with packets destined to different outputs residing in different sub-queues. This is called Virtual Output Queueing (VOQ). - After crossing the fabric, outputs also have queues to implement QoS related scheduling policies, RED etc. It makes sense to implement these at the output queue because we want to distribute the outgoing link bandwidth between flows. Output queues can also implement RED and other buffer drop policies. --------------------- * MPLS (multi protocol label switching) is a technique to design faster routers. MPLS is not based on circuit switching, and is designed to work with packet switching and IP datagrams. However, it wishes to modify the forwarding logic of IP datagrams to make it look more like circuit switching, for improved performance. At some point, it was thought that IP lookup algorithms based on longest prefix match would be too slow to forward data on high speed links, and that a fixed label lookup was needed. This was the initial motivation for MPLS. * When an IP datagram arrives into an MPLS enabled network, the first MPLS edge router introduces a label on the packet. Subsequent MPLS routers perform switching based on the label, and not based on the destination IP. That is, all MPLS routers along the path maintain mapping from incoming label and link to outgoing link and label, and swap labels on packets. So MPLS routers are also called label switching routers (LSRs). * Where is MPLS label added? MPLS has a 20 bit label in a 4 byte header. This header is usually placed between IP and layer 2 headers, so MPLS is considered layer 2.5. * While the initial motivation for MPLS wasn't strong (IP lookup algorithms became fast enough), MPLS has found other uses. Some of these are listed below. - Traffic engineering. MPLS can used to "pin" different IP flows to the same destination to different label switched paths (LSPs), to distribute load evenly and perform traffic engineering in ISP backbone networks. - MPLS can be used for fast reroute, to compute alternate paths in case of link failures before other layer 2/3 protocols recover and find an alternate path. - MPLS can be used to build Layer 2 and layer 3 VPNs. The concept is similar to IP-in-IP tunneling. That is, to connect two private networks, the IP datagram in the private space is encapsulated with an MPLS header and tunneled to the other end point, after which the MPLS header is removed. However, MPLS based VPNs are more efficient because the MPLS header (4 bytes) is smaller than the IP header overhead (20 bytes) of IP-in-IP tunneling. - In general, MPLS has found many uses because the labels can be refurbished to mean different things and serve different purposes. * How are MPLS labels distributed? It depends on the purpose of using MPLS. For simple destination based forwarding, labels can be distributed with prefixes as part of routing protocols. For traffic engineering, a reservation protocol like RSVP along with a shortest path algorithm based on bandwidth constraints can be used to compute paths and fix labels. For VPN services, labels are distributed with BGP. So distribution of labels requires some routing protocol, much like normal IP. ---------------------- * MPLS borrows ideas heavily from ATM networks. We will not cover ATM in detail, but here is a brief overview. * Note that we mainly studied IP as part of network layer. IP has won out over several other competing proposals for the network layer, notably ATM (Aynchronous Transfer Mode). ATM was a networking stack (mostly layers 2 and 3) that was based on virtual circuits and provided reliability and other such QoS guarantees at the network layer. In ATM, data was sent as 53 byte ATM cells. So variable sized packets had to be broken down into cells and the overhead was too high. Before a flow started, a virtual cirtuit has to be setup. Every outgoing flow on a link was given a virtual cirtuit identifier (VCI) during connection setup. VCIs are unique only over a link (not globally). During connection setup, for every incoming flow, routing was done once to determine outgoing link, an outgoing VCI was computed, and a mapping between incoming VCI and outgoing VCI was stored. So forwarding a cell required looking up incoming VCI to determine outgoing VCI and outgoing link. This lookup was considered simpler than longest prefix match on every packet. However, due to various technical and business reasons, ATM didn't really take off as a network layer technology. Eventually, IP (and datagrams, as opposed to virtual circuits) won over as the layer 3 technology of choice. However, ATM and related ideas have uses in Layer 2. * ATM is based on the concept of virtual circuit switching. When A wants to start a flow to destination D, it sends a setup message that is routed along the correct path to D (say, via ATM switches B and C). After the virtual circuit is setup is complete, state is established along the path on how to forward data of this circuit. Every subsequent packet only carries a virtual circuit identifier (VCI) and intermediate switches will forward data using this VCI. ATM uses fixed length 53 bute frames called "cells". * Note that a VCI need not have global scope (i.e., unique across the network). VCI in ATM only has "link local" scope. That is, flows over a given link need to have unique VCIs. So, when a node sends a cell on a link for the first time, it picks a VCI that is not in use at that link. The receiving switch notes the incoming VCI, picks an outgoing link and VCI again. So, switches in ATM maintain a table mapping incoming link and incoming VCI to outgoing links and outgoing VCI. Whenever a cell arrives on the link, an ATM switch looks up the entry corresponding to the incoming link and VCI, finds out the outgoing link for that cell, and rewrites the outgoing VCI number. * ATM was considered to be better than datagram based IP networks because of fixed length VCI and cells, leading to bounded latencies, and the benefit of setting up circuits to provide guarantees services. However, the best effort Internet performed reasonably, and ATM lost out to IP. * Note that ATM switches can be easily reused to be Label Switching Routers (LSRs). ------------------- * So far, we have only seen the forwarding functionality of a router. Now, we will understand how routing protocols (that populate the forwarding tables of routers) work. * Some background on link state and distance vector routing protocols. - Link state routing: "tell about your neighbors to everyone". Each node collects information about all its neighbors and the link metrics. This LSA (link state announcement) of every node is flooded through the entire network. So, at the end of it, each node has complete view of the network graph. Each node then independently runs Dijkstra's shortest path algorithm to get the shortest path to every destination, based on which it figures out the next hop for every destination. - How are link metrics chosen? Can be simple hop count, based on bandwidth, latency, physical distance etc. Load-sensitive metrics are not preferred due to oscillations. For example, if some link is loaded and its metric reduces, all flows may move away from it, leading to load on another link, and all flows will move back again. - Distance vector routing: "tell about everyone to your neighbors". Every node exchanges a distance vector, a vector containing its estimate of distance to each destination, with its neighbors. Upon receiving a neighbor's distance vector, a node updates its distance vector by adding link cost to neighbor. If a better path is founf through neighbor, it updates its best route. - Distance vector has "count to infinity" problem. Simple solution is "split horizon" or "poisoned reverse". ---------------- * Further reading - "A 50-Gb/s IP Router", Craig Partridge et al. A classic paper describing what it takes to build a 50 Gbps router in great detail. - "Survey and Taxonomy of IP Address Lookup Algorithms", Ruiz-Sanchez et al. Describes various trie-based data structures and other techniques to do longest prefix match lookups efficiently. - "The iSLIP Scheduling Algorithm for Input-Queued Switches", Nick McKeown. Describes a cross bar scheduling algorithm which is an extension of the PIM algorithm above. The main idea is that instead of using of selecting multiple contenders at random to give grants to, the output ports follow a round-robin type policy to serve input ports, assuring that no input is starved.