Datacenter Networking
====================

Outline
- Datacenter topologies: fat trees
- Network load balancing / traffic engineering
- Application / service load balancing

================================================

* Data Centers (DCs) are one of the hot things in the field of
  computer systems. What are DCs? Large companies like Google store
  lots of data and run various computations (e.g., computing search
  indices) over the data. Even small/medium sized enterprises have
  need for well-engineered DCs. The applications may be hosted on
  physical or virtual machines. So, one of the networking problems in
  DC design is: how to interconnect the various machines (i.e., how to
  place switches / routers) such that the performance of the
  applications is not bottlenecked by the network. This problem is
  particularly interesting for applications that transfer lots of
  data, and are bottlenecked by the network.

* Typically, 20-40 servers are placed in a rack, and all connected to
  a Top-of-Rack (ToR) switch. Several ToR switches connect to an
  End-of-Row (EoR) switch. The switches are further connected by as
  many levels as needed in a tree. At the top of the tree are core
  switches. Below that the layers are named agrgegation, access, edge
  etc., depending on how many levels exist. The topology is a mix of
  L2 and L3, depending on number of servers and scalability
  requirements.

* What is the issue with such an organization. Switches higher up in
  the tree need to carry more bandwidth, and have to be more powerful
  / expensive. So, simple tree-based topologies are hard to scale. It
  is unlikely that any two servers that are far away and need to go
  through the top layers of switches will get to send/receive at their
  linerate, due ot bottlenecks at the higher layer switches.

* An alternative to such tree-based topology is what are called "fat
  tree based topologies". Fat trees connect switches in such a way
  that (a) any host can communicate with any other host at full
  bandwidth, and (b) switches do not have to be more powerful /
  expensive as you go higher up in the tree. That is, all switches at
  all levels can be the same, commodity switches that are cheaply
  available. Fat trees achieve this by providing lots of extra
  switches at higher layers, lots of interconnections and paths
  between switches.

* A k-ary fat-tree can be used to interconnect k^3/4 hosts using
  switches of k ports each. A fat tree is arranged into k pods, each
  having k switches each. The switches in a pod are arranged into two
  rows of k/2 switches each. The k/2 lower switches are connected to
  k/2 hosts each, and the other k/2 ports connect to all the upper
  layer switches. The remaining k/2 ports at each upper layer switch
  connect to k/2 different core switches. 

- How many lower layer switches in a pod? k * k/2

- How many lower layer switches in a pod? k * k/2

- How many core switches? Each of the upper layer switches needs k/2
  connections, so total ports at the core must be k/2 * k/2 * k. Since
  each core switch provides k ports, we need k/2 * k/2 core switches
  of k ports each.

- Note that all switches at all levels are uniform. "Fat trees, skinny
  switches"!

* [Draw figure of a normal tree and a fat tree, explain the
  differences. Convince yourself that there are enough independent
  paths in the fat tree from any source to any destination to enable
  communication at the linerate. The two references on fat trees
  explain this concept well pictorially.]

* Bisection bandwidth - if you bisect the network into two halves, the
  bandwidth between the two halves is called bisection bandwidth. The
  minimum value of bisection bandwidth for any possible bisection is
  an interesting quantity. It determines the network bottleneck
  bandwidth when several hosts are communicating with each other. The
  bisection bandwidth of the fat tree does not reduce as you go up the
  tree (as it does for a normal tree), so you can get communication at
  linerate without any network bottlenecks.

* Now that we have a network that provides multiple paths between
  leaves of the fat tree, how can we let the hosts use it? Layer 2 or
  layer 3 forwarding usually follows one best path for each
  destination, even if other paths are available. OSPF ECMP (Equal
  Cost Multi Path) can take advantage of multiple paths to a
  destination and split traffic among multiple paths. For ECMP to
  work, the leaves should look like separate destinations in the
  forwarding table, that is, we need to maintain state for k/2 * k/2
  prefixes.

* Another idea to utilize multiple paths: from every leaf, tunnel
  traffic to a random core switch via one of several paths. Once the
  traffic is diffused among core switches, then there is only pne path
  from a core switch to a given destination, and packets can take that
  path (core removes outer IP header and sends packet). This idea is
  similar to the idea of "Valiant Load Balancing". How do we achieve
  this diffusion of traffic from leaf up to core? Several ideas
  exist. For example, use IP anycast. All core switches announce the
  same anycast address. Then OSPF ECMP picks one path at random from
  the several paths to the core. See the VL2 paper reference for a
  detailed explanation of this idea.

* Balancing load between several paths when multiple paths exist in
  the network is the problem of network load balancing or traffic
  engineering. Traditionally, traffic engineering is done offline
  using estimated traffic matrices (from every leaf to other),
  computing optimal paths offline, and pinning flows to paths using
  MPLS. However, with fast changing traffic patterns, offline traffic
  engineering may not work.

* For online traffic engineering, simple techniques like ECMP work
  fine for the most part, except when there is large asymmetry in the
  network traffic patterns / link capacities etc. For example, when a
  large flow starts on one of the paths, equally splitting traffic
  over all paths may not be the best idea.

* Load balancing / traffic engineering can be done in several ways:
  (see the Conga reference for a detailed explanation of this
  classification tree).

- centralized (a central entity fixes paths) vs distributed (each
  switch works independently)

- at the network level (i.e., redirecting TCO flows to different
  paths) or at the transport layer (using multipath TCP)

- using no knowledge of congestion (ECMP, oblivious to other flows) or
  only local knowledge of congestion (of next link) or global
  knowledge (of entire path).

- at the granularity of packets (leads to TCP reordering), flows
  (large flows may skew bandwidth), or flowlets (small bursts of a
  flow).

* The Conga paper proposes a logic that runs at the leaves of the fat
  tree. The leaf switches communicate with each other by using a
  special header, which stores congestion information. All switches
  along the path update this information. This way, the leaf switches
  learn of congestion along the path, and adjust accordingly. So, the
  paper proposes distributed but global congestion aware traffic
  engineering / load balancing.

* Next, we come to the layer 2 vs. layer 3 question. What are the pros
  and cons of interconnecting servers at layer 2 vs. layer 3? Servers
  in DCs are often hosted on VMs, and operators may want to move VMs
  around for various reasons. So, layer 3 IP addresses have to change
  during migration. For this reason, layer 2 might be more preferable,
  provided we deal with the scalability issues. The VL2 reference
  provides another novel idea: use two different IP addresses: one
  location specific, and the other application specific. And an agent
  at the switches rewrites between these two, so that end servers of a
  particular service always think they are in the same subnet
  irrespective of where they are located. A level of indirection in
  addressing solves this problem. We will see a better solution to
  this problem in later lectures when we study network virtualization.
 
* Now we move on to a different problem in datacenters:
  application-layer load balancing. Consider a web server that has
  advertised a certain public IP. In reality, we don't have one but
  several servers in a server farm that appear as one server. All of
  these servers have one virtual IP (VIP), and each replica has its
  own Direct IP (DIP). An application-layer load balancer (LB)
  intercepts traffic to VIP, rewrites the destination to be one of the
  DIPs. Similarly, outgoing traffic from DIP should also be source
  NATted to a VIP. The LB should try to equally split load amongst all
  DIPs, and also ensure that a given flow gets mapped to the same DIP
  every time.

* Usually LBs are built in hardware because they need to process lots
  of traffic. However, hardware LBs are expensive. Instead, the
  reference "Ananta" proposes a software LB. There are several
  software multiplexers (SMux). There are several SMuxes, and can
  scale out to any number. Traffic reaches one of the MUXes using
  ECMP. For each packet, the SMux uses hash of flow tuple to map to a
  DIP, ensuring that all packets of a flow go to the same DIP. Some
  state needed for the MUX (e.g., VIP to DIP mapping, outgoign NAT
  mapping) are computed by a central entity. 

* Understand the difference between vertical scaling (getting a more
  powerful loadbalancer) vs horizontal scaling (adding more replicas
  to increase capacity). Also understand how central state management
  makes horizontal scaling using software running on commodity
  hardware possible.

* Further reading:

- "A Scalable, Commodity Data Center Network Architecture", Al-Fares
  et al. Proposes the idea of using fat tree topologies in datacenter
  networks.

- "VL2: A Scalable and Flexible Data Center Network", Greenberg et
  al. Refines the ideas of fat tree DC networks, and incorporates
  ideas of ECMP / Valiant Load Balancing, virtual Layer-2 addresses
  etc.

- "CONGA: Distributed Congestion-Aware Load Balancing for
  Datacenters", Alizadeh et al. Describes fine-grained
  congestion-aware load balancing switches for data centers.

- "Ananta: Cloud Scale Load Balancing", Patel et al. Describes a
  solution for scalable application-layer load balancing.