Routing protocols - BGP
===============================

Outline
- Hierarchical routing - intra and inter domain
- Need for a new interdomain routing protocol, history of BGP.
- Path vector, AS, provider, customer, peer
- BGP route advertisement and route selection
- iBGP and e BGP
- Implementation details

================================================

* Routing in the Internet is typically hierarchical. Typically in an
  organization (Autonomous System or AS), hosts are grouped into
  subnets based on physical proximity. A subnet is connected by a
  layer 2 technology (e.g., Ethernet) to its first hop IP
  router. Multiple IP routers are connected to each other by
  point-to-point links or Ethernet. All these routers in an
  organization run an "intra-domain" routing protocol or "Interior
  Gateway Protocol" (IGP) like OSPF (link state routing protocol) or
  RIP (distance vector protocol) to compute paths among
  themselves. For every router in an organization knows how to reach
  other internal IP destinations.

* What about IP destinations outside the organiztion? There are
  special "border" routers at the edges of organizations that connect
  to other border routers. Internet Service Providers (ISPs) run a
  bunch of borer routers that connect various "stub" organizations and
  other ISPs. These border routers run "inter-domain" routing
  protocols (BGP is the defacto standard today). These inter-domain
  routing protocols determine paths between organizations. The
  intra-domain and inter-domain routing protocols together fill up the
  forwarding tables in such a way that every IP router along the path
  can correctly route packets.

* Why separate inter-domain and intra-domain? 

- For scale. The internet routing tables will become very bulky if
  every IP router runs shortest path to every IP prefix. Instead, with
  the interdomain and intradomain separation, each set of protocols
  needs to handle lesser information.

- For policy. Interdomain routing may not want to do simple shortest
  path, but complex policies based on business deals and trust, as we
  see below.

* History of BGP. Initially, when the Internet comprised of a small
  number of universities, the edge routers at these organizations just
  ran a simple routing protocol between them. These edge routers and
  few more routers added for connecting these (together called the
  Internet backbone) was managed by the US government for free. Soon,
  the internet expanded, and the internet backbone was
  commercialized. We now have Internet Service Providers (ISPs) that
  run a bunch of BGP routers (and intradomain routers to connect their
  organization) that connect various organizations. So the backbone of
  the internet is now composed of several ISPs. All the backbone
  routers have adopted the current version of BGP as the interdomain
  routing protocol since 1990s.

* BGP is a "path vector" protocol. That is, it is based on the
  distance vector philosophy, where you tell your neighbors about all
  the destinations you know of. However, you don't just tell the cost,
  but you tell the entire path to the destination. This addition of
  specifying the entire path avoids routing loops and counting to
  infinity (because if a neighbor is announcing a path through you
  already, you won't use that path). Like DV protocols, there is a
  route announcement phase (where you exchange path vectors) and a
  best route computation phase (where you decide what your best path
  is). These two phases have slight differences with general DV
  protocols (that do simple shortest path) in order to incorporate
  policy.

* What is the granularity at which the path is specified? Do you list
  all the IP routers on the path> No, that's too messy. The internet
  paths are listed as sequence of Autonomous System Numbers(ASN). An
  AS is an organization that can be considered as one unit for the
  purpose of interdomain routing. Every AS has a unique AS number. For
  example, IITB is an AS. An ISP is an AS. A large organization that
  is spread across many locations can have different AS numbers for
  each part. Routers inside an AS run intradomain routing, and routers
  across ASes run interdomain routing.

* So how does a path vector announcement in BGP look like? For every
  prefix, you list the AS path so far, and a few extra attributes for
  policy. For route selection, you pick the shortest AS path, along
  with some other decision criteria based on policy. 

* ASes are mainly of two types: stub ASes or end-user organizations,
  and ISPs that provide connectivity between these organizations. Of
  course, there is a thin line between the two. ISPs are classified
  into tiers 1, 2, and 3 (informally). AS to AS relationships are of
  two types: transit and peering.

* Transit relationship exists between ISPs and their customers. A
  customer pays an ISP some money for Internet connectivity. This
  means that the ISP takes the responsibility of announcing the
  customers prefixes over BGP to the rest of the internet. Note that
  advertising a route implies a promise to carry traffic, and traffic
  flows in the reverse direction of route advertisements. This means
  that traffic to the customer from other hosts in the Internet will
  flow via the ISP. Similarly, when the customer sends traffic, the
  ISP will have routes to several destinations, and will forward the
  customer's traffic onwards. "Buying service from an ISP" means that
  the ISP is getting paid for announcing your routes, bringing you
  traffic, and sending your traffic. This type of a provider-customer
  relationship is also called a transit relationship.

* Consider two organizations that have lots of traffic to each
  other. Instead of both of them paying a provider to forward traffic
  to each other, they may seek to establish a connection directly, and
  forward each other's traffic directly. Such a relationship is called
  a peering relationship. Peering is usually between similar ASes that
  have roughly equal traffic to each other, and does not involve any
  payments. It is intended to carry traffic between peers by cutting
  out the middleman ISP.

* Is your ISP always guaranteed be connected to everyone in the entire
  Internet, and have routes to every destination? Not in the case of *
  smaller ISPs. Usually, the smaller ISPs buy service from bigger
  ISPs, which buy from bigger ISPs are so on. The ISPs at the top of
  the hierarchy are called tier-1 ISPs. By definition, a tier-1 ISP
  does not have any other ISP as its provider. All tier-1 ISPs peer
  amongst themselves. The customers of tier-1 ISPs are tier-2 ISPs and
  so on. [Draw an example topology with transit and peering
  relationships.]

* ASes or BGP routers which have BGP sessions between them are
  "neighbors" as far as the path vector routing protocol is
  considered. We will now revisit the question of what gets advertised
  to neighbors and how best path is computed.

* Policy decision: when do you advertise a route? Two rules. First,
  routes from customers and self routes (routes to destinations within
  an ISP or organization) are announced to all neighbors - because you
  want to provide as much visibility as possible. Two, routes about
  all destinations that you learn from your neighbors are announced
  only to your customers. 

* What is the logic behind these rules? Let's understand. Why not
  announce peer or provider routes to other peers and providers?
  Because you don't want to carry traffic on behalf of your peers and
  providers (you don't make any money from it). The only announcements
  that happen are intended to provide reachability to self and
  customers, and let customers and self reach everyone. No other type
  of connectivity suits business interests. Contrast this with an
  intradomain routing protocol whose goal is to provide shortest path
  connectivity between everyone.

* Policy decision: which routes do you prefer during best route
  selection? Even before shortest AS hop count, you have a policy
  based rule. Prefer customer routes > peer routes > provider
  routes. Why? If you use customer route, it means you will send
  traffic through customer, and customer pays for it. If you use peer
  routes, nothing lost nothing gained. If you use provider route to
  send traffic, you pay for it. So use the cheapest option. Typically,
  routes that come in on a link are marked with an attribute called
  "localpref" to indicate if it is a customer link / peer link /
  provider link. This attribute is checked first before seeing
  shortest AS hop coount.

* How do typical routes look at AS level? Typically, you have zero or
  more customer to provider links, then you hit a peering link or a
  tier-1 provider, then you go down zero or more provider to customer
  links to reach the other end point. In general, ASes do not divulge
  who their providers and customers are. So how do you guess the
  relationships between ASes. The paper in the references below
  proposes one heuristic. The paper says that AS paths consist of an
  "uphill" path of customer->provider links, folowed by 0 or 1 peering
  links, followed by a "downhill" path of provider->customer links. So
  the paper says that we can look at lots of routing table entries,
  see all the AS paths, and try to match them to this pattern. How do
  you identify the "top of the hill"? Identify it as the AS with the
  largest number of neighbors (signifying a large ISP). Then we can
  map all links before the top of hill as customer to provider, and
  all links after the top as provider to customer. Of course, this is
  a very rough heuristic, but can work reasonably well in most cases.

* BGP is between border routers. How does this translate to forwarding
  table at the interior routers? BGP sessions are of two types: eBGP
  (external BGP) between border routers in different organizations,
  and iBGP (internal BGP) between routers in the same
  organization. That is, an organization may have more than one BGP
  routers (each potentially talking to several other BGP routers in
  different organizations). Each BGP router sends all the external
  routes it learns to all other internal BGP routers using BGP itself
  - this is called an iBGP session. 

* Note that eBGP and iBGP are both just BGP, not much difference in
  the protocol. The main difference is as follows. Every BGP route for
  a prefix has a value called "next hop". In eBGP sessions, the next
  hop is simply the IP address of the other BGP endpoint, indicating
  that if this route is used, traffic should be sent to this next
  hop. This next hop is updated on eBGP sessions. On the other hand,
  the next hop over iBGP sessions is simply set to the first BGP
  router that introduced the route into an organization. That is, if
  BGP router A sends a BGP route to B over iBGP, which in turn sends
  it to C, then the next hop of the BGP route will be IP address of A
  and not that of B. So, all internal BGP routers will have BGP routes
  from all BGP routers (typically all BGP routers talk to each other),
  with next hop being the BGP router that introduced the route. The
  internal routers will also be running an intradomain IGP to populate
  routes to internal destinations. So when a packet comes in for the
  external destination, the internal router looks up the BGP routing
  table. It may find several routes from several border BGP
  routers. It will pick the closest border router based on the next
  hop. Then the path to the border router will be determined by the
  IGP / intradomain protocol. So the forwarding is done by combining
  the intradomain routing tables with the interdomain routing tables.

* Please do not confuse iBGP with IGP. In fact, iBGP sessions are run
  over TCP. So iBGP messages are themselves put inside IP datagrams
  that are routed and forwarded using IGP. So in some sense, iBGP runs
  over IGP.

* Why not just introduce the external routes also into the IGP? For
  example, send link state announcements for external routes as well?
  One reason is scale. The other is that BGP routes have lots of
  additional information (like, is this a customer route or peer
  route?) which cannot be sent as part of IGP. So the two are kept
  separate.

* Some implementation details. BGP runs as a userspace process over
  TCP on a well known port (179). Two routers that want to become BGP
  neighbors (also called BGP peers, do not confuse with peering
  between ASes) open a TCP connection between them. Then, they
  exchange the path vectors (and other information) for all
  prefixes. This is called the full table dump. From then on, only
  updates (announcements, withdrawals, any changes) are
  communicated. Because BGP runs over TCP, reliability is
  assured. Periodically, the peers / neighbors also exchange heartbeat
  / keepalive messages to let the other end point know they are alive. 

* The BGP routing table consists of prefixes followed by a set of
  attributes for each prefix. Here is are some important attributes
  associated with each prefix:

- Localpref: a local variable (specific to each link) which indicates
  whether the route came from a customer, provider, or peer.

- AS path: sequence of ASes the route has traversed.

- Next hop: the IP address of the next hop BGP router for this
  route. If this route is picked as the best route, traffic will be
  sent to this next hop. For eBGP sessions, the next hop is the BGP
  peer at the other end of the BGP session. For routes learned over
  iBGP, the next hop is the border BGP router that first introduced
  the route into the network.

- Multi-exit discriminator (MED): When two ASes are connected at
  multiple points, the MED attribute is used to indicate a preference
  for which connection should be used to send packets. Consider a
  customer and provider that are connected on the east and west. For a
  prefix located in the east, the customer will expect to receive
  traffic at the eastern link, and vice versa. So it will set the MED
  attribute differently on the east and west BGP session
  announcements, so that the provider can know of this. Normally,
  every network tries to get rid of traffic as soon as possible - "hot
  potato" routing. So traffic coming in near the western link at P
  will be pushed to the customer over the western link. However, with
  MED, the customer can force the provider to carry the traffic all
  the way to the east and hand it over there, thus doing "cold potato
  routing". MED is not widely respected, especially if there is no
  money involved (for example, you needn't honor MED from peer).

* BGP neighbors exchange their current best route with their peers
  subject to the policy we discussed earlier (announce customer and
  self routes to everyone, and all best routes to customers). When
  updates arrive from neighbors, a BGP router updates its best route
  for every prefix using the following criteria in this order.

- Pick the route with the highest localpref (prefer customer > peer > provider)

- Shortest AS path preferred

- Lowest MED preferred 

- eBGP routes preferred over iBGP (if you are directly connected to
external network, use that instead of going via your own network)

- Lowest IGP path to next hop border router

- Smallest router ID or IP address to break ties.

* Demo: BGP routeviews project. The routeviews project is a publicly
  available BGP routing table repository. The BGP router at routeviews
  sets up BGP sessions with several BGP routers in many different
  ASes, and gets a BGP routing table dump from them. The snapshot of
  the BGP routing table at this router is periodically stored
  online. Looking through these tables gives us a view of real BGP
  routing tables at several vantage points in the Internet.

 * Further reading:

- "On Inferring Autonomous System Relationships in the Internet", Lixin Gao.

- Lecture notes on BGP by Prof. Hari Balakrishnan.