Data center networking
========================

* Data Centers (DCs) are one of the hot things in the field of
  computer systems. What are DCs? Large companies like Google / Apple
  / Amazon / Microsoft store lots of data and run various computations
  (e.g., computing search indices) over the data or host public
  clouds. Even small/medium sized enterprises have need for
  well-engineered DCs. The applications may be hosted on physical or
  virtual machines. So, one of the networking problems in DC design
  is: how to interconnect the various machines (i.e., how to place
  switches / routers) such that the performance of the applications is
  not bottlenecked by the network. This problem is particularly
  interesting for applications that transfer lots of data, and are
  bottlenecked by the network.

* Typically, 20-40 servers are placed in a rack, and all connected to
  a Top-of-Rack (ToR) switch. Several ToR switches connect to an
  End-of-Row (EoR) switch. The switches are further connected by as
  many levels as needed in a tree. At the top of the tree are core
  switches. Below that the layers are named agrgegation, access, edge
  etc., depending on how many levels exist. The topology is a mix of
  L2 and L3, depending on number of servers and scalability
  requirements.

* What are the pros and cons of interconnecting servers at layer 2
  vs. layer 3? Servers in DCs are often hosted on virtual machines
  (VMs). The hypervisor or virtual machine mnager acts as a software
  switch to connect the VMs to the physical network. DC operators may
  want to move VMs around for various reasons (e.g., failure of
  underlying physical machine). So, layer 3 IP addresses have to
  change during migration. For this reason, layer 2 might be more
  preferable, provided we deal with the scalability issues. In
  general, there is a general need for systems that can specify
  "virtual topologies" and the network takes care of mapping them to
  physical network nodes. This problem, called "Network
  Virtualization", is a new topic of research.

* In network virtualization, we want to create a virtual network,
  nodes, and links, without worrying about what the physical network
  looks like. What is the motivation?  Currently, when you create a
  network of VMs in a datacenter, the IP address of the VM is
  restricted to the IP address subnet of the physical machine. The IP
  addresses of multiple users must be coordinated. And a VM cannot
  move to any physical machine outside its subnet. All these
  retrictions make it hard to optimally work with VMs. Key idea to
  network virtualization - tunnelling. Create virtual networks in a
  separate IP space, assuming that no other networks exist. Then, when
  a packet has to move from one virtual network element to another
  over the physical network, simply tunnel to the next physical
  node. So, the virtual network is built as an overlay over the
  physical network.

* What is some of the the other issues with such DCs? One is network
  bottleneck. Switches higher up in the tree need to carry more
  bandwidth, and have to be more powerful / expensive. So, simple
  tree-based topologies are hard to scale. It is unlikely that any two
  servers that are far away and need to go through the top layers of
  switches will get to send/receive at their linerate, due ot
  bottlenecks at the higher layer switches.

* An alternative to such tree-based topology is what are called "fat
  tree based topologies". Fat trees connect switches in such a way
  that (a) any host can communicate with any other host at full
  bandwidth, and (b) switches do not have to be more powerful /
  expensive as you go higher up in the tree. That is, all switches at
  all levels can be the same, commodity switches that are cheaply
  available. Fat trees achieve this by providing lots of extra
  switches at higher layers, lots of interconnections and paths
  between switches.

* [Draw figure of a normal tree and a fat tree, explain the
  differences. Convince yourself that there are enough independent
  paths in the fat tree from any source to any destination to enable
  communication at the linerate. ]

* Bisection bandwidth - if you bisect the network into two halves, the
  bandwidth between the two halves is called bisection bandwidth. The
  minimum value of bisection bandwidth for any possible bisection is
  an interesting quantity. It determines the network bottleneck
  bandwidth when several hosts are communicating with each other. The
  bisection bandwidth of the fat tree does not reduce as you go up the
  tree (as it does for a normal tree), so you can get communication at
  linerate without any network bottlenecks.

* Another problem of data centers: traffic engineering or network load
  balancing: balancing load between several paths when multiple paths
  exist in the network.  Layer 2 or layer 3 forwarding usually follows
  one best path for each destination, even if other paths are
  available and free. Traditionally, traffic engineering is done
  offline using estimated traffic matrices (between every pair of
  nodes), computing optimal paths offline, and pinning flows to paths
  using MPLS. However, with fast changing traffic patterns, offline
  traffic engineering may not work. For online traffic engineering,
  OSPF ECMP (Equal Cost Multi Path) can take advantage of multiple
  paths to a destination and split traffic among multiple paths. ECMP
  works fine for the most part, except when there is large asymmetry
  in the network traffic patterns / link capacities etc. For example,
  when a large flow starts on one of the paths, equally splitting
  traffic over all paths may not be the best idea.

* Load balancing can also be centralized (done by one central node) or
  distributed (complicated). Transport layer load balancing (e.g.,
  multipath TCP) is also possible.

* Now we move on to a different problem in datacenters:
  application-layer load balancing. Consider a web server that has
  advertised a certain public IP. In reality, we don't have one but
  several servers in a server farm that appear as one server. All of
  these servers have one virtual IP (VIP), and each replica has its
  own Direct IP (DIP). An application-layer load balancer (LB)
  intercepts traffic to VIP, rewrites the destination to be one of the
  DIPs. Similarly, outgoing traffic from DIP should also be source
  NATted to a VIP. The LB should try to equally split load amongst all
  DIPs, and also ensure that a given flow gets mapped to the same DIP
  every time.

* Usually LBs are built in hardware because they need to process lots
  of traffic. However, hardware LBs are expensive. There is some
  recent research to build a software horizontally scalable load
  balancer.

* Understand the difference between vertical scaling (getting a more
  powerful loadbalancer) vs horizontal scaling (adding more replicas
  to increase capacity). Key to horizontal scaling: make state
  management simple.

* This is an example of another recent trend in networking: Network
  Function Virtualization (NFV). Common network functions (load
  balancing, firewalls etc) that are usually implemented in hardware
  are now moving into software.