BGP: Traffic Engineering and Demo
=================================

* Recap of BGP from last class, and continue discussion on BGP: Who
  needs to run BGP? Does a stub organization have to run BGP? In the
  common case, a small organization that buys internet service from an
  ISP just has a static route pointing to the ISP, and gets a small
  chunk of the ISP's IP address space. The ISP takes care of
  advertising a (usually) larger prefix that covers the IP space of
  several customers in one aggregated entry.

* However, many organizations try to have more than one ISP for
  redundancy. This is called multihoming. In this case, the
  organization usually prefers to get its own IP address block, and
  announce it over BGP to several ISPs. Goals of multihoming: load
  balancing across ISPs so that the traffic rate is within the
  contract, to provide redundancy as primary / backup (use second ISP
  only if first is down), use specific ISPs for specific kinds of
  traffic (for example, some ISPs or backbone networks charge less for
  traffic between academic institutions).

* What happens if you use IP prefixes from one ISP's address space,
  while being multihomed?  Suppose your first ISP gives you a prefix
  A/24, and the ISP itself announces a larger A/20 prefix. Now, you
  take your A/24 and announce it via your second ISP also. Now, due to
  longest prefix match, your traffic will always arrive via the second
  ISP which is announcing A/24.

* How do you do load balancing with multihoming? Announce some
  prefixes via one ISP and rest via the other. If a prefix is
  announced via two ISPs, you can load balance on outgoing traffic
  (you can set your forwarding tables to send half the traffic on one
  link, and rest on the other). However, load balancing for inbound
  traffic is hard. That is, you have no control over which link the
  traffic destined to your hosts arrives on. It depends on how other
  routers compute their best paths.

* How do you do configure primary/backup links when you have
  multihoming? Suppose you have two ISPs, P and S, to serve as your
  primary and secondary ISPs. That is, you want traffic to come via S
  only if your link via P is down. If you announce the same prefix via
  P and S, you have no control on which link traffic arrives on. So
  you can do some tricks using BGP. For example, you may announce an
  aggregated prefix (say, one /24) via S, and more specific prefixes
  (say, two /25) via P. So, by default, all traffic will come via the
  P link. Only when the /25 prefixes announced by P are withdrawn due
  to failure of the link via P, then traffic will come via the /24
  through S. Another trick is AS path padding. You can pad your origin
  AS number several times on the path announced via S (e.g., AS1 AS2
  AS3 AS3 AS3), so that it appears as a longer AS path and will not be
  picked by default.

* Aggregating routing table entries. About half the prefixes in
  Internet routing tables are more specific versions of larger
  prefixes. So people argue that a lot of these entries can be
  aggregated to save valuable space, especially in core routing tables
  that tend to be very large. However, note that about half these
  cases of a larger and smaller prefix both appearing are due to
  multihoming. That is, the larger prefix and the smaller prefix have
  different AS paths, indicating that some kind of traffic engineering
  is being done. In such cases, if you aggregate the prefixes, then
  the desired traffic engineering cannot be achieved. BGP has a field
  to indicate if the route can be aggregated or not, which should be
  respected by other BGP routers.

* BGP routing convergence. When BGP routing information changes (e.g.,
  a new prefix or path has come up, or an existing one has gone down),
  BGP updates (announcements or withdrawals) are sent between BGP
  routers. What is the time taken from the time the change happens to
  the time that all routers know of and act according to this new
  information? This time is called the convergence time of a BGP
  update. Empirical studies have shown that this time is several tens
  of seconds. For example, researchers have inserted artificial path
  changes and observed how long it takes for these changes to get
  reflected in publicly available routing tables. This delay is around
  40-80 seconds for new routes coming up, and 80-200 seconds for old
  routes going down. It has also been observed that connectivity to
  the prefix is bad before convergence.

* What explains this high convergence time and issues with BGP? BGP
  undergoes what is called "path exploration" when a routing change
  happens. Since no router has complete knowledge of the topology, a
  few paths have to be tried before the correct routing tables are
  computed. While BGP does not have the routing loops and count to
  infinity problems with distance vector protocols, it does not have
  the quick convergence property of link state protocols either.

* Example: consider 3 ASes 1, 2, 3, all connected to each other and to
  a router R. Ignore the policy aspects of BGP for now. Initially, all
  ASes use their direct path to R, but also learn of the other 1-hop
  paths to R. Now suppose 1 discovers first that its link to R is
  down. Then it sends a withdrawal for its direct route, starts to use
  the route 2->R, and announced this to everyone. Now, this route will
  be come invalid when 2 discovers its link to R is down and withdraws
  the route. But in the meantime, 3 discovers its link to R is down,
  and decides to use the route 3->1->2->R that was announced by 1. So
  messages from everyone about their direct routes (and all other 1
  hop / 2 hop / 3 hop routes) going down have to reach everyone
  else. Only then will the routing table entries converge to the new
  state of no routes to R.

* Note that path exploration will be more as the connectivity of the
  AS graph increases. If the AS graph were a tree, then there would be
  no exploration, as everyone would have only one path between any two
  points. However, once some extra edges are added, path exploration
  starts to happen.

* In addition to the path exploration described above, routers also
  have a certain timer called the minimum route advertisement interval
  (MRAI). When an update to a prefix is sent by a router, it waits for
  MRAI before sending next update. This is to bunch updates related to
  a certain network event instead of flooding the network with
  information. The MRAI adds to the convergence delay, because nodes
  have to wait 30 seconds between successive steps in the path
  exploration.

* Not all updates seen in BGP are due to genuine network topology
  changes either. Sometimes a router gets overloaded and stops
  responding to BGP keepalive messages temporarily. The other BGP peer
  assumes the link is down and sends a withdrawal, followed by an
  announcement soon after. So the route goes between on and off states
  quite frequently. This is called "route flapping". BGP also has a
  mechanism to identify such unstable prefixes. Prefixes that change
  often are awarded a penalty, and updates to such prefixes are
  suppressed for a certain duration once the penalty crosses a
  threshold. This is called route flap dampening.

* Now, BGP changes lead to disruption of network connectivity. During
  path exploration, packets may often go down temporarily incorrect
  paths. Various pathologies can occur. For example, you can have
  transient forwarding loops where packets go around in loops and get
  discarded. Various measurements have observed the correlation
  between temporary bad connectivity to a destination, and the BGP
  updates for that destination.