Application layer: SMTP, P2P, web services ========================================== * Another example of a client-server application: email or SMTP (Simple Mail Transfer Protocol). Two entities: user agents and mail servers. User agents (email clients like Thunderbird or Outlook) interface between users and mail servers. Mail servers are for a group of users. * Example. Suppose A (userA@sender.com) wants to send email to B (userB@rx.com) - User agent of A sends the mail to his mail server, say, mail.sender.com. (How? we will see later) - Mail server of sender.com opens SMTP connection with mail server of rx.com. SMTP runs on TCP. How does it know the IP and port on which to open TCP connection? It resolves domain name to target mail server IP address using DNS; see below. Port number is standard for SMTP. - Once mail is delivered, B uses his user agent to retrieve the mail (how?) * Why split functionality between user agent and mail server? Why can't A and B run SMTP between their machines? Because machines can't be always on, may need to retry etc. So mail servers manage mail boxes of many users. * SMTP has simple commands like HELO, MAIL FROM, RCPT TO, DATA, QUIT etc. to transfer the message. SMTP uses persistent TCP connections, so can send/receive multiple mails at once. * Differences between HTTP and SMTP? HTTP is pull vs SMTP is push. HTTP has separate responses for each object, but all attachments and objects are sent as one mail in SMTP. * SMTP is between mail servers. What about user agent to mail server. At sender side, we can use SMTP again. That is, A's user agent can be SMTP client and S's mail server is SMTP server. Note that A's mail server acts as SMTP server to A and SMTP client when sending to another mail server. * Can we use SMTP at receiver side? No, because it is a push protocol. For receiver, we need pull protocol, where user periodically checks if he has any mail from his mail server. These protocols are called Mail Access Protocols. E.g., POP3, IMAP etc. * Even HTTP can be used when you used between user agents and mail servers - this is webmail. However, mail server to mail server communication is always SMTP. ------------- * What are web services? Any service you can get using the web or the Internet. Browsing news is one common service people avail. Maps is another. However the term web service usually refers to several machine-to-machine communications that happen over the Internet (beyond the human use of web for just browsing etc). For example, you can view your location on a map, and provide/view real time traffic. Here, you are not only consuming the map information, but you are also populating the database at the mapping service about speed and traffic from your side. The common term for what you can do with web services is: CRUD (create, read, update, delete) any piece of data over the Internet. Summary: web services refer to the generic way of exchanging information over the Internet (usually excluding the easy-to-understand case of human using the Internet). * The nascent way to implement web services is via RPC (remote procedure call). You have a client that calls a certain procedure on a server with certain parameters. The client and server need to agree on the data format, APIs etc. Some web service standards are built along the lines of this RPC model. For example, SOAP-based web services (google up if you want to know more). A SOAP web service client is tightly coupled with the server, and they both agree on data formats, function calls etc. * Newer and simpler ways exist to easily develop web services today. One such example is called REST (representational state transfer). REST uses HTTP protocol (can run on anything that supports viewing/updating URIs). It uses the 4 HTTP request verbs (GET, PUT, POST, DELETE) for reading, updating, creating, deleting respectively. For example, you can use GET to get information from a database server, or use PUT to update information at the server. The URL/URI contains information on what you want to get/put. REST is stateless and simple, while RPC is more generic/powerful but complicated to use. --------------- * So far, we have seen client-server application architectures. What is the limitation? Scalability / cost. Suppose a server needs to distribute a large file to a large number of clients. The server needs to be very powerful and have a high bandwidth link. Instead, the server can send the file to a few clients, and the clients can help distribute to other clients. This is the idea behind peer-to-peer (P2P) architectures. * Most popular P2P application: BitTorrent. How does BitTorrent work? The collection of all peers that participate in distributing a file is called a "torrent" or a "swarm". A file is divided into chunks (say 256KB) and peers download chunks from one another. Peers that have a chunk also upload the chunk to other peers. * How to locate peers in a torrent/swarm? Each torrent has a centralized node called "tracker" (information about the swarm and tracker can be got from other means, say webpage). Every peer informs tracker when it joins, so tracker knows which peers exist. When a new peer joins, the tracker picks a random subset of peer and introduces them to the new peer (i.e., provides IP addresses of other peers). The new peer then establishes TCP connections to other peers. * Peers exchange information on who has which chunks of files. Then a peer will request chunks that it does not have from other peers. Peers use "rarest first" policy. That is, they first download the chunk that has least presence amongst peers. This way, less popular chunks get distributed fast. * In addition to downloads, peers also have to help other peers by uploading. BitTorrent uses a tit-for-tat policy, where a peer uploads most to those peers who are supplying it data at highest rate. It calculates these peers every few seconds and these peers are said to be "unchoked". Typically, 4 unchoked peers. A peer may also choose to optimistically unchoke a random peer. Lot of research done on the incentives of BitTorrent, can it be gamed etc. * But the tracker is centralized and can become a bottleneck / point of failure. How to deal with it? We could build a distributed tracker that maps a file to a node/nodes that is/are responsible for it. We need this information to be distributed across several nodes, so that no one node is the bottleneck. The generalized version of such a distributed application is called a DHT (Distributed Hash Table). DHTs store (key,value) pairs. For example, in BitTorrent, the key is name of torrent identifier (say, movie name) and value is list of IP addresses of peers. These key-value pairs are not stored in a central database but distributed over several nodes. A DHT can survive even if some nodes join/leave etc. * Let's design a simple DHT. What is the key step in a DHT? We should be able to map keys to nodes. Once we know which nodes are responsible for which set of keys, then any operation like creating / searching / updating / deleting a key-value pair can happen by simply going to the node responsible for the key, and requesting the specific operation. So the key step is, given key, identify node. * How do DHTs solve this problem? The basic recipe of all DHTs is as follows. Pick a name space (say 160 bit numbers), and map keys and nodeIDs in this region. For example, key=hash(movie_name), and nodeID = hash(node IP address). Now, assign every key-value pair to its closest (or immediately suceeding) node in the 160-bit space. This is called "consistent hashing". For example, if we consider 4 bit name space, let the keys to be assigned be 2, 7, 10. Let the nodes have IDs 3, 6, 9. Then key 2 will be assigned to node 3, 7->9, 10->3. So every time you want to search / update a key, go to the node with the ID immediately after the key and ask for the key-value pair. * But how do you know which node ID is close to which key? Must maintain a list of all nodeIDs and their IP addresses. Or, learn about a small subset of "neighbors" (e.g., nodes with IDs just before and after yours), and pass on the query. All DHTs have this basic tradeoff - the more neighbors you have, the faster you will be able to search/update a key-value pair.