Application Layer: Introduction to the Web =========================================== *** Outline *** - Application layer architectures: client-server vs P2P - Socket interface: TCP vs UDP semantics - Application types: elastic vs realtime - WWW and HTTP - Persistent vs non-persistent connections - HTTP message formats, headers - Caching, cookies - FTP, SMTP * Application layer consists of all the useful things we do with computer networks. Examples: web, email, audio/video streaming over Internet, voice/video calls using Internet, file downloads from P2P applications etc. The next several lectures will study these applications. * Applications are typically structured as client-server applications. When you browse the web, your browser is the "client" and the web server is the "server". The server is always on, waiting for clients. Clients approach the server when needed. Servers are maintained in data centers or server farms. * In contrast, in P2P applications, all peers are equal. Anyone can come and go at any time. Sometimes we have hybrid architectures, where a centralized entity/server facilitates P2P interaction. E.g., in bit torrent, a "tracker" informs you of the IP addresses of peers, after which the file download happens in P2P fashion. * For now, let's focus on client-server architectures. A popular example is the World Wide Web. This lecture, we'll see how the web works. Two pieces: a browser/client and a web server. * Network applications are typically user space processes. The transport layer and below is usually implemented in the operating system kernel. The kernel exposes the "socket" API to the user space programs. Application developers use the socket interface to send and receive "messages" (e.g., client sends server a message requesting a web page). Think of sockets as post boxes, and application layer messages as mails that you put and get from post boxes. The transport layer handles how the message is delivered, much like how the postal system handles how mail is delivered. * What services does the socket/transport layer provide? TCP provides connection-oriented reliable in-order delivery of a byte stream. Useful to tranfer files/text. UDP just delivers messages, without any reliability promises. UDP is preffered by some real time applications, as retransmissions can add to delay. Security add-ons exist (like "Secure Sockets Layer" or SSL) using with secure applications (e.g., HTTPS) can be built. However, current transport protocols on the internet provide no bandwidth or delay guarantees. Bandwidth and delay is "best effort". * If several sockets open on a machine, how do we know which socket a message is destined to (note: IP address takes care of delivering the message to a particular machine, but doesn't help beyond that). Answer: port numbers. 16-bit identifier that uniquely identifies the several open sockets on a machine. More on this later. * Classifying applications. Elastic applications - can adjust with any bandwidth (e.g., file transfer). In contrast to bandwidth-sensitive applications (e.g., watching live HD video). Some applications can tolerate delay (e.g., email), others cannot (realtime voice/video chat, multiplayer games). Some applications are interactive (e.g., web), others are not (e.g., video download). Some applications need reliability (e.g., file transfer), while some don't (e.g., audio call). We will see several example applications from all these classes, and how the design of the application layer changes. The choice of transport protocol also depends on the application type. * Rest of the lecture, we will study the most popular application on the Internet. The World Wide Web: developed by Tim Berners Lee in the 1990s. The "killer app" for the Internet. The web consists of "web pages" connected by links, so you can browse a lot of information easily. The file transfer / download text / email existed before the web. The invention of the web was organizing information using hyperlinks, so that it is easy to use. * Web page - HTML file, contains embedded images, videos, javascripts etc. Web pages are hosted on web servers (e.g., Apache). Web clients or browsers request web pages by specifying a "URL" or clicking on links in other web pages. URL has two parts: domain name of server (which is resolved to IP address), and the path of the file in the server's working directory. The URL can also optionally embed some data fields to fill in forms, any additional parameters that the server can look at etc. * The communication protocol used by web clients and servers is Hyper Text Transfer Protocol (HTTP). Example: user types URL or clicks on a link. Browser resolves domain name to IP address, opens TCP connections, sends a HTTP GET "rquest" message to server (specifying the page to get). Browser sends a HTTP "response" saying OK, and then transfer the web page. Browser renders the HTML to display the page. This request/response message exchange is called the HTTP protocol. By adhering to this protocol, any compliant browser can communicate with any web server. * Typically, a HTML page has multiple "embedded objects". Need to "GET" all of those as well. Browsers do all of this automatically. When multiple objects to GET, do browsers do them serially or in parallel? The answer is both. Typically, a browser opens multiple TCP connections (say, 4). Sends GET requests on these connections in parallel. Once a request finishes, it sends the next one on the same connection. Why this way? TCP connection setup has an initial cost, so want to reuse the same connection for next few requests also ("persistent" TCP connections). Also, multiple parallel/concurrent TCP connections lead to better utilization of bandwidth. * Walk through example of how a web page is downloaded with/without persistent connections, with/without parallel connections. * HTTP requests format: Starts with a request line. Typically GET (requesting a web page) or POST (posting some information) request, mentioning a URL and HTTP version. This main request line is followed by a variable number of "HTTP headers" specifying various things (e.g., whether persistent connections are preffered or not, browser model, preferred language etc). Then there is an empty line followed by an optional body message. * HTTP response has a response code (OK, error etc), followed by HTTP headers (metadata about the response like length, last-modified etc), followed by the actual data requested. Common HTTP response codes: 2xx indicates success (e.g., 200 OK), 3xx indicates redirection (301 Moved Permanently), 4xx is client error (400 Bad Request, 404 Not Found), 5xx Server error (503 Service Unavailable). * Let's take a minute to ponder the difference between the protocol used by the application (HTTP: consists of requests and responses) vs. the actual data transmitted by the protocol (HTML-based web pages with various types of embedded objects). HTTP does not specify what the data should be, just specifies the protocol for communication. Generating the application data itself is a complicated matter (e.g., how do you encode audio and video files etc). The HTTP protocol itself is oblivious to this part (e.g., you can transfer and view a text file in your browser). That is, the HTTP application layer protocol is a small part of the "network application" ecosystem consisting of web servers, web clients, HTML-based and other types of application data etc. * Note that HTTP is stateless. Every request from a client is independent and self-contained. Server does not have to maintain any state, and can forget about the client after sending a response. * If HTTP is stateless, how does Amazon or Google recognize you even when you visit their webpages after a break? Answer: using a mechanism called HTTP cookies. A cookie is a separate identifier sent with HTTP requests. The first time you visit a webpage, the server creates a unique ID (cookie) for you and returns it as part of the HTTP response. Every subsequent HTTP request to that site will carry the cookie, so that the server can identify you. A local cookie database exists in your browser, a bigger database exists at the server. * HTTP caching: if several people from the same organization request the same web page, why should we transfer it over and over again? We can have an intermediate server (usually called proxy server) that can "cache" the page and serve it on subsequent requests. Proxy server parses the HTTP request, checks the cache. If the page is found in cache, it serves from cache. Else it requests the server. Note that a proxy server "splits" the HTTP and TCP connections, client-proxy and proxy-server. (Is splitting necessary?) Proxy can do other tasks also like content filtering etc. * Caching also happens in browsers. Browser checks before issuing request. * Some HTTP headers help in implementing caching and cache validation. Most responses have a "Last-modified" header that specifies when the page was last modified at the server. Client can do a conditional GET request, where it specifies the "last-modified" time of the cached page it has. If the page hasn't been modified at the server, server sends a "304 Not Modified", else it sends the latest page. Server can expire pages after a while using the "Expires" header. Content freshness can also be checked using the "ETag" header (which is a type of checksum of the page). Client returns the ETag it has and server resends if page modified. * FTP - precursor to HTTP. Conceptually similar. You connect to a FTP server, and can "get" or "put" files, after which data transfer happens. There are some differences: (1) FTP sends control and data through separate TCP connections. (2) FTP keeps state over a session, e.g., user's current working directory etc. HTTP subsumes a lot of the simple FTP functionality. * Another example of a client-server application: email or SMTP (Simple Mail Transfer Protocol). Two entities: user agents and mail servers. User agents (email clients like Thunderbird or Outlook) interface between users and mail servers. Mail servers are for a group of users. * Example. Suppose A (userA@sender.com) wants to send email to B (userB@rx.com) - User agent of A sends the mail to his mail server, say, mail.sender.com. (How? we will see later) - Mail server of sender.com opens SMTP connection with mail server of rx.com. SMTP runs on TCP. How does it know the IP and port on which to open TCP connection? It resolves domain name to target mail server IP address using DNS; see below. Port number is standard for SMTP. - Once mail is delivered, B uses his user agent to retrieve the mail (how?) * Why split functionality between user agent and mail server? Why can't A and B run SMTP between their machines? Because machines can't be always on, may need to retry etc. So mail servers manage mail boxes of many users. * SMTP has simple commands like HELO, MAIL FROM, RCPT TO, DATA, QUIT etc. to transfer the message. SMTP uses persistent TCP connections, so can send/receive multiple mails at once. * Differences between HTTP and SMTP? HTTP is pull vs SMTP is push. HTTP has separate responses for each object, but all attachments and objects are sent as one mail in SMTP. * SMTP is between mail servers. What about user agent to mail server. At sender side, we can use SMTP again. That is, A's user agent can be SMTP client and S's mail server is SMTP server. Note that A's mail server acts as SMTP server to A and SMTP client when sending to another mail server. * Can we use SMTP at receiver side? No, because it is a push protocol. For receiver, we need pull protocol, where user periodically checks if he has any mail from his mail server. These protocols are called Mail Access Protocols. E.g., POP3, IMAP etc. * Even HTTP can be used when you used between user agents and mail servers - this is webmail. However, mail server to mail server communication is always SMTP.