Linux IP networking
=========================

*** Outline

- TCP/IP processing in the Linux Kernel
- Netfilter hooks and iptables
- tun devices and raw sockets


* In this lecture, we will understand how IP networking is implemented
  in a typical end host. Most hosts implement IP in the kernel, so we
  will study IP networking in the Linux kernel (other OSes should be
  similar conceptually).

* First, assume that the routing tables are populated in the kernel
  (How is this done? We will discuss this later.) What happens when a
  packet is transmitted and received?

* Packet transmission: Application writes packet to kernel. Data
  written into the send buffer. The transport protocol (say, TCP)
  takes the data, forms a segment, adds header, called IP transmit
  function (UDP does something similar). IP adds header, looks up
  destination address for a route, finds out the interface / link to
  send it on, and placed packet in the output queue of the device. All
  this is done when you do a "write" into a socket. From here on, the
  device driver takes on. The kernel schedules the device driver to
  run at a different time. The driver adds link layer headers and
  hands over to the network hardware.

* Packet reception: When packet arrives on the physical medium, the
  device driver stores the packet in a backlog queue in the
  kernel. The kernel scheduler schedules the kernel code to handle the
  packet at a later point. When this code is invoked, the IP layer
  checks for errors and such. If the packet clear the checks, IP
  checks if the destination is the local host, or another host. If the
  packet is destined to the local host, the packet is handed off to
  TCP for its processing (process TCP data and place into receive
  buffer, update TCP state if it is a TCP ACK, etc.). If the packet is
  not destined to the local host, the IP module looks up destination
  address, updates IP headers (like TTL), and places packet in the
  output queue of the corresponding interface.

* Now, how are the routing tables populated? Linux has three types of
  routing tables. One is the neighbor table, which includes
  information on all destinations that are directly reachable (e.g.,
  nodes on the same LAN). This table has the MAC-layer addresses of
  neighbors as well, to populate the link-layer headers. This table is
  filled in by the "ARP" protocol that we will study about when we
  study the link layer. Next, we have the FIB (Forwarding Information
  Base or forwarding table), which has the next hop information for
  all destination prefixes. The most frequently used destinations from
  the FIB and placed in the "routing cache". During route lookup, the
  routing cache is checked first. If the destination is not found
  here, the FIB is consulted.

* The kernel only does forwarding. The routing protocols themselves
  can be implemented as userspace programs that can modify the
  forwarding tables in the kernel based on the messages they send and
  receive. The Linux kernel has a simple "routed" program that
  implements a simple intradomain routing protocols. Another popular
  software the "Quagga" software suite that implements several intra
  and interdomain routing protocols.

* Note that your Linux desktop doesn't forward packets by default (and
  hence doesn't act a router by default). You need to set a sysctl
  variable to enable IP forwarding and make it act as a router. Once
  you do this configuration, it can serve as a decent router for low
  rate traffic. In the next lecture, we will see how real high speed
  routers are built.

* Newer networking hardware (Network Interface Cards or NICs) provide
  some advanced functionality. One such feature is TCP segmentation
  Offload (TSO). With a TSO-enabled NIC, the kernel needn't do TCP
  segmentation. The kernel can write up to 64KB segments into the
  device driver queue. The NIC will perform TCP segmentation and
  adding headers to the individual segments in hardware for better
  performance.

* Linux also has an optional traffic control and queueing discipline,
  between the IP processing and device driver queue. By enabling and
  configuring these modules, you can implement various scheduling
  policies and traffic shapers (e.g., priority queues, token bucket
  filters etc.). You can lookup the "tc" command in Linux for more
  details.

* What if you want to modify the IP packet processing in the kernel?
  For example, you may want to implement a NAT or firewall
  functionality. One way is to modify the kernel, but that is a hard
  thing to do. Instead, newer kernels provide you with "hooks" at
  several points of the packet processing via a framework called
  "Netfilter". There are 5 netfilter hooks defined (meaning, there are
  5 places in the kernel where you can intercept packets easily):

(1) Prerouting: after packets enter the machine and pass some sanity
    checks, before any routing decision is made. After this hook, the
    destination is looked up to determine if the packet is for the
    local host or external destination.

(2) Input: this hook gives you access to incoming packets that are
    destined for the local host.

(3) Forward: this hook gives you access to incoming packets that are
    destined to another interface and are forwarded by this machine.

(4) Output: this hook gives you access to packets generated by the
    local machine before any routing decision is made.

(5) Postrouting: this hook gives you access to all packets that are
    leaving the machine (both generated locally, as well as forwarded
    from other hosts).

* To do simple packet processing using the netfilter framework, you
  can write a kernel module that intercepts the packet at one or more
  of these hooks. All packets that pass through this hook will pass
  through your kernel module code, and you can do many things with the
  packet, like accept it and pass it on to the next module, drop it,
  rewrite some headers, pass to some user space process etc. You can
  write a simple NAT or firewall as a kernel module using netfilter
  hooks.

* What if you don't want to touch the kernel at all, and want to
  remain in userspace? You can use the "iptables" framework to write a
  simple NAT or firewall using simple commands from
  userspace. "iptables" is a user space program built on top of
  netfilter hooks. You can write a set of rules/commands in iptables
  using the commandline, and these will be implemented via netfilter
  hooks for you.

* You can do three main types of things with iptables, each of which
  has a dedicated table where you can store your rules: 

(1) The "filter" table is used to store rules related to
    filtering. This is used when you want to drop packets, e.g., to
    implement a firewall. This is the default table.

(2) The "nat" table is used when you want to alter the source and
    destination IP addresses or ports to implement NAT functionality.

(3) The "mangle" table is used for any other type of packet alteration
    (besides filter and NAT).

* You can use iptables for one of the three functions above (filter /
  nat / mangle). You can write several rules in each of these three
  tables. These rules will be organized into "chains", based on which
  netfilter hook you want to invoke them. For example, the filter
  table has 3 chains: input, output, and forward. This means that you
  can write a rule to inspect and drop packets at the input, output,
  and forward netfilter hooks. Similarly, the NAT table has 3 chains:
  prerouting, postrouting, and output. 

* Note that you cannot have arbitrary combination of table and chain
  in iptables. For example, can you write a rule that tries to use NAT
  filter in the input chain? No, it doens't make sense for you to
  change source and destination IPs anymore after you have decided
  that the packet going into your application space. So not all tables
  (filter / NAT ) can be used at all Netfilter hooks (or iptables
  chains)

* So, once you select a table (filter / nat / mangle) based on what
  you want to do, and once you select a chain (based on which
  netfilter hook you want to operate at), you can add a rule that
  matches on one or more packet header fields and does some action
  (like drop, rewrite destination address, etc.). So, the iptables
  command will specify the table, chain, and rule (which includes
  pattern to match packet, and action to take). Note that all rules
  under a particular table and chain are executed in a particular
  order (you can specify the order, by default it is the order in
  which they were added.)

* You can choose a table with "-t" command (iptables -t nat ..)

* You can add your rule to one of the chains with the -A command
(iptables -t nat -A POSTROUTING ...)

* You can specify a pattern to match a subset of packets based on
  protocol (tcp/udp), source or destination IP addresses or port
  numbers, whether packet is from a new TCP connection or from
  established connection, etc.

* Some possible actions (you can specify actions with -j flag):

- Accept or Drop the packet

- Source NAT (SNAT): change the source address for NAT operation to
  some static value.

- Masquerade: like SNAT, except that source IP will be picked
  dynamically based on which interface the packet is leaving. No need
  to specify source IP.

- Destination NAT: change destination address to divert packets to
  some other machine.

- Redirect to a local machine

* For example, consider the following simple rule:

  iptables -t filter -A OUTPUT -p tcp --dport 80 -j DROP

  This rules says use the filter table, add a rule to the OUTPUT
  chain. The rule itself matches on TCP packets on destination port
  80, and drops all such packets. This is a simple (though
  meaningless) firewall rule, which when executed will prohibit HTTP
  access from your machine.

* Another example:

  iptables -t nat -A OUTPUT -p tcp --dport 80 -j DNAT --to-destination 10.129.5.191:8080

  This rule says that all outgoing HTTP packets (going to port 80 at
  some destination) must be redirected to a different destination
  (that, for example, has a proxy server running on it). Note that
  this has to be done at the OUTPUT hook and not at POSTROUTING
  hook. [Why? Because it doesn't make sense to change destination
  address at POSTROUTING after a routing decision has been made based
  on the destination address. Destination address should be changed
  before routing, so that packet can be routed using new destination
  address.]

* You can list all iptables rules using "iptables -L". You can clear
  all rules using "iptables -F". Please try out a few examples to get
  a better feel.

* Another interesting concept in the Linux kernel: tun/tap devices. A
  tun device is a simple virtual device. When you send traffic to a
  regular device, it goes out the physical medium. In contrast, when
  you divert traffic to a tun device (say, by setting up a route
  command), then the traffic will undergo regular IP processing as
  usual, and will will be handed over to a userspace program that
  attaches to the tun device. When the user space program attached to
  the tun device does a read on the socket connected to the tun
  device, it will receive IP datagrams with TCP/IP headers in tact,
  much like how it goes over the physical interface. Now, if you want
  to perform tunnelling, you can once again write this IP datagram
  obtains from tun device into a socket connected to a real physical
  device. The packet will then undergo another round on encapsulation
  with IP headers. So, read from tun device and writing to regular
  socket connected to physical device will lead to IP-in-IP
  encapsulation.

* What about writing to a tun device? Any traffic written into the tun
  device by the userspace program will go through regular IP
  processing like traffic that came from a real device. So, any write
  done on a socket connected to the tun device must write IP datagrams
  with IP headers and all (not just regular application layer
  messages). Suppose a userspace program reads data from a regular
  network device (after IP processing), and writes to tun device. Then
  IP headers will be removed twice, i.e., decapsulation of IP-in-IP
  packets.

* A "tap" device does the same for Layer 2 processing. That is, when a
  userspace program reads from tap device, it gets Layer 2 frames with
  L2 headers. Similarly, when it writes to a tap device, it must write
  with L2 headers.

* The tun device is used to implement tunnelling concepts. For
  example, suppose you have two offices using a private address space,
  and they both want to connect via the public internet. You can use a
  VPN-like application that is based on tun devices. At one of the
  locations, you can divert all traffic destined to private address
  space into a tun device. The user space program can encapsulate it
  (IP-in-IP tunneling) to the other end point, where another program
  will inject into tun device again.

* Another advanced concept in the Linux kernel: RAW sockets. When you
  open a RAW socket and write data to it, all TCP/IP kernel processing
  will be bypassed on the data. That is, you will need to write
  complete packets with all the headers and everything correctly. The
  packets will directly be put into the device queue. Similar, when
  you read from raw sockets, you will get the packets intact with all
  headers. Can you think of one application of raw sockets that you
  have all used? [tcpdump uses raw sockets to read packets with all
  headers intact from the device.] Many security applications (e.g.,
  generate badly formed TCP/IP packets to test your security solution)
  also have uses for raw sockets.

* See the references for some interesting links. 

- "How To Set Up a Firewall Using IPTables on Ubuntu 14.04"
https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-using-iptables-on-ubuntu-14-04 

- "Linux IP Networking"
http://www.cs.unh.edu/cnrg/people/gherrin/linux-net.html 

- "Network Address Translation"
http://www.karlrupp.net/en/computer/nat_tutorial

- "Tun/Tap interface tutorial"
http://backreference.org/2010/03/26/tuntap-interface-tutorial/