Routing protocols - BGP =============================== Outline - Hierarchical routing - intra and inter domain - Need for a new interdomain routing protocol, history of BGP. - Path vector, AS, provider, customer, peer - BGP route advertisement and route selection - iBGP and e BGP - Implementation details ================================================ * Routing in the Internet is typically hierarchical. Typically in an organization (Autonomous System or AS), hosts are grouped into subnets based on physical proximity. A subnet is connected by a layer 2 technology (e.g., Ethernet) to its first hop IP router. Multiple IP routers are connected to each other by point-to-point links or Ethernet. All these routers in an organization run an "intra-domain" routing protocol or "Interior Gateway Protocol" (IGP) like OSPF (link state routing protocol) or RIP (distance vector protocol) to compute paths among themselves. For every router in an organization knows how to reach other internal IP destinations. * What about IP destinations outside the organiztion? There are special "border" routers at the edges of organizations that connect to other border routers. Internet Service Providers (ISPs) run a bunch of borer routers that connect various "stub" organizations and other ISPs. These border routers run "inter-domain" routing protocols (BGP is the defacto standard today). These inter-domain routing protocols determine paths between organizations. The intra-domain and inter-domain routing protocols together fill up the forwarding tables in such a way that every IP router along the path can correctly route packets. * Why separate inter-domain and intra-domain? - For scale. The internet routing tables will become very bulky if every IP router runs shortest path to every IP prefix. Instead, with the interdomain and intradomain separation, each set of protocols needs to handle lesser information. - For policy. Interdomain routing may not want to do simple shortest path, but complex policies based on business deals and trust, as we see below. * History of BGP. Initially, when the Internet comprised of a small number of universities, the edge routers at these organizations just ran a simple routing protocol between them. These edge routers and few more routers added for connecting these (together called the Internet backbone) was managed by the US government for free. Soon, the internet expanded, and the internet backbone was commercialized. We now have Internet Service Providers (ISPs) that run a bunch of BGP routers (and intradomain routers to connect their organization) that connect various organizations. So the backbone of the internet is now composed of several ISPs. All the backbone routers have adopted the current version of BGP as the interdomain routing protocol since 1990s. * BGP is a "path vector" protocol. That is, it is based on the distance vector philosophy, where you tell your neighbors about all the destinations you know of. However, you don't just tell the cost, but you tell the entire path to the destination. This addition of specifying the entire path avoids routing loops and counting to infinity (because if a neighbor is announcing a path through you already, you won't use that path). Like DV protocols, there is a route announcement phase (where you exchange path vectors) and a best route computation phase (where you decide what your best path is). These two phases have slight differences with general DV protocols (that do simple shortest path) in order to incorporate policy. * What is the granularity at which the path is specified? Do you list all the IP routers on the path> No, that's too messy. The internet paths are listed as sequence of Autonomous System Numbers(ASN). An AS is an organization that can be considered as one unit for the purpose of interdomain routing. Every AS has a unique AS number. For example, IITB is an AS. An ISP is an AS. A large organization that is spread across many locations can have different AS numbers for each part. Routers inside an AS run intradomain routing, and routers across ASes run interdomain routing. * So how does a path vector announcement in BGP look like? For every prefix, you list the AS path so far, and a few extra attributes for policy. For route selection, you pick the shortest AS path, along with some other decision criteria based on policy. * ASes are mainly of two types: stub ASes or end-user organizations, and ISPs that provide connectivity between these organizations. Of course, there is a thin line between the two. ISPs are classified into tiers 1, 2, and 3 (informally). AS to AS relationships are of two types: transit and peering. * Transit relationship exists between ISPs and their customers. A customer pays an ISP some money for Internet connectivity. This means that the ISP takes the responsibility of announcing the customers prefixes over BGP to the rest of the internet. Note that advertising a route implies a promise to carry traffic, and traffic flows in the reverse direction of route advertisements. This means that traffic to the customer from other hosts in the Internet will flow via the ISP. Similarly, when the customer sends traffic, the ISP will have routes to several destinations, and will forward the customer's traffic onwards. "Buying service from an ISP" means that the ISP is getting paid for announcing your routes, bringing you traffic, and sending your traffic. This type of a provider-customer relationship is also called a transit relationship. * Consider two organizations that have lots of traffic to each other. Instead of both of them paying a provider to forward traffic to each other, they may seek to establish a connection directly, and forward each other's traffic directly. Such a relationship is called a peering relationship. Peering is usually between similar ASes that have roughly equal traffic to each other, and does not involve any payments. It is intended to carry traffic between peers by cutting out the middleman ISP. * Is your ISP always guaranteed be connected to everyone in the entire Internet, and have routes to every destination? Not in the case of * smaller ISPs. Usually, the smaller ISPs buy service from bigger ISPs, which buy from bigger ISPs are so on. The ISPs at the top of the hierarchy are called tier-1 ISPs. By definition, a tier-1 ISP does not have any other ISP as its provider. All tier-1 ISPs peer amongst themselves. The customers of tier-1 ISPs are tier-2 ISPs and so on. [Draw an example topology with transit and peering relationships.] * ASes or BGP routers which have BGP sessions between them are "neighbors" as far as the path vector routing protocol is considered. We will now revisit the question of what gets advertised to neighbors and how best path is computed. * Policy decision: when do you advertise a route? Two rules. First, routes from customers and self routes (routes to destinations within an ISP or organization) are announced to all neighbors - because you want to provide as much visibility as possible. Two, routes about all destinations that you learn from your neighbors are announced only to your customers. * What is the logic behind these rules? Let's understand. Why not announce peer or provider routes to other peers and providers? Because you don't want to carry traffic on behalf of your peers and providers (you don't make any money from it). The only announcements that happen are intended to provide reachability to self and customers, and let customers and self reach everyone. No other type of connectivity suits business interests. Contrast this with an intradomain routing protocol whose goal is to provide shortest path connectivity between everyone. * Policy decision: which routes do you prefer during best route selection? Even before shortest AS hop count, you have a policy based rule. Prefer customer routes > peer routes > provider routes. Why? If you use customer route, it means you will send traffic through customer, and customer pays for it. If you use peer routes, nothing lost nothing gained. If you use provider route to send traffic, you pay for it. So use the cheapest option. Typically, routes that come in on a link are marked with an attribute called "localpref" to indicate if it is a customer link / peer link / provider link. This attribute is checked first before seeing shortest AS hop coount. * How do typical routes look at AS level? Typically, you have zero or more customer to provider links, then you hit a peering link or a tier-1 provider, then you go down zero or more provider to customer links to reach the other end point. In general, ASes do not divulge who their providers and customers are. So how do you guess the relationships between ASes. The paper in the references below proposes one heuristic. The paper says that AS paths consist of an "uphill" path of customer->provider links, folowed by 0 or 1 peering links, followed by a "downhill" path of provider->customer links. So the paper says that we can look at lots of routing table entries, see all the AS paths, and try to match them to this pattern. How do you identify the "top of the hill"? Identify it as the AS with the largest number of neighbors (signifying a large ISP). Then we can map all links before the top of hill as customer to provider, and all links after the top as provider to customer. Of course, this is a very rough heuristic, but can work reasonably well in most cases. * BGP is between border routers. How does this translate to forwarding table at the interior routers? BGP sessions are of two types: eBGP (external BGP) between border routers in different organizations, and iBGP (internal BGP) between routers in the same organization. That is, an organization may have more than one BGP routers (each potentially talking to several other BGP routers in different organizations). Each BGP router sends all the external routes it learns to all other internal BGP routers using BGP itself - this is called an iBGP session. * Note that eBGP and iBGP are both just BGP, not much difference in the protocol. The main difference is as follows. Every BGP route for a prefix has a value called "next hop". In eBGP sessions, the next hop is simply the IP address of the other BGP endpoint, indicating that if this route is used, traffic should be sent to this next hop. This next hop is updated on eBGP sessions. On the other hand, the next hop over iBGP sessions is simply set to the first BGP router that introduced the route into an organization. That is, if BGP router A sends a BGP route to B over iBGP, which in turn sends it to C, then the next hop of the BGP route will be IP address of A and not that of B. So, all internal BGP routers will have BGP routes from all BGP routers (typically all BGP routers talk to each other), with next hop being the BGP router that introduced the route. The internal routers will also be running an intradomain IGP to populate routes to internal destinations. So when a packet comes in for the external destination, the internal router looks up the BGP routing table. It may find several routes from several border BGP routers. It will pick the closest border router based on the next hop. Then the path to the border router will be determined by the IGP / intradomain protocol. So the forwarding is done by combining the intradomain routing tables with the interdomain routing tables. * Please do not confuse iBGP with IGP. In fact, iBGP sessions are run over TCP. So iBGP messages are themselves put inside IP datagrams that are routed and forwarded using IGP. So in some sense, iBGP runs over IGP. * Why not just introduce the external routes also into the IGP? For example, send link state announcements for external routes as well? One reason is scale. The other is that BGP routes have lots of additional information (like, is this a customer route or peer route?) which cannot be sent as part of IGP. So the two are kept separate. * Some implementation details. BGP runs as a userspace process over TCP on a well known port (179). Two routers that want to become BGP neighbors (also called BGP peers, do not confuse with peering between ASes) open a TCP connection between them. Then, they exchange the path vectors (and other information) for all prefixes. This is called the full table dump. From then on, only updates (announcements, withdrawals, any changes) are communicated. Because BGP runs over TCP, reliability is assured. Periodically, the peers / neighbors also exchange heartbeat / keepalive messages to let the other end point know they are alive. * The BGP routing table consists of prefixes followed by a set of attributes for each prefix. Here is are some important attributes associated with each prefix: - Localpref: a local variable (specific to each link) which indicates whether the route came from a customer, provider, or peer. - AS path: sequence of ASes the route has traversed. - Next hop: the IP address of the next hop BGP router for this route. If this route is picked as the best route, traffic will be sent to this next hop. For eBGP sessions, the next hop is the BGP peer at the other end of the BGP session. For routes learned over iBGP, the next hop is the border BGP router that first introduced the route into the network. - Multi-exit discriminator (MED): When two ASes are connected at multiple points, the MED attribute is used to indicate a preference for which connection should be used to send packets. Consider a customer and provider that are connected on the east and west. For a prefix located in the east, the customer will expect to receive traffic at the eastern link, and vice versa. So it will set the MED attribute differently on the east and west BGP session announcements, so that the provider can know of this. Normally, every network tries to get rid of traffic as soon as possible - "hot potato" routing. So traffic coming in near the western link at P will be pushed to the customer over the western link. However, with MED, the customer can force the provider to carry the traffic all the way to the east and hand it over there, thus doing "cold potato routing". MED is not widely respected, especially if there is no money involved (for example, you needn't honor MED from peer). * BGP neighbors exchange their current best route with their peers subject to the policy we discussed earlier (announce customer and self routes to everyone, and all best routes to customers). When updates arrive from neighbors, a BGP router updates its best route for every prefix using the following criteria in this order. - Pick the route with the highest localpref (prefer customer > peer > provider) - Shortest AS path preferred - Lowest MED preferred - eBGP routes preferred over iBGP (if you are directly connected to external network, use that instead of going via your own network) - Lowest IGP path to next hop border router - Smallest router ID or IP address to break ties. * Demo: BGP routeviews project. The routeviews project is a publicly available BGP routing table repository. The BGP router at routeviews sets up BGP sessions with several BGP routers in many different ASes, and gets a BGP routing table dump from them. The snapshot of the BGP routing table at this router is periodically stored online. Looking through these tables gives us a view of real BGP routing tables at several vantage points in the Internet. * Further reading: - "On Inferring Autonomous System Relationships in the Internet", Lixin Gao. - Lecture notes on BGP by Prof. Hari Balakrishnan.