Tuesday, August 21, 2007

Current Web Architecture

Introduction

This section of the Internet Tool Survey describes the current architecture of the World Wide Web (WWW). The NCSA Glossary is a useful starting point for Web terms. Another is the ILC glossary of Internet Terms.
The following sections describe

  • the basic two-tier architecture of the web in which static web pages (documents) are transferred from information servers to browser clients world-wide,

  • extensions that permit three-tiered architectures where content pages can be constructed dynamically and where programs as well as data can be transferred,

  • other information transfer protocols, and
  • related standards.



Basic Web Architecture

The basic web architecture is two-tiered and characterized by a web client that displays information content and a web server that transfers information to the client. This architecture depends on three key standards: HTML for encoding document content, URLs for naming remote information objects in a global namespace, and HTTP for staging the transfer.

  • HyperText Markup Language (HTML) - the common representation language for hypertext documents on the Web. HTML had a first public release as HTML 0.0 in 1990, was Internet draft HTML 1.0 in 1993, and HTML 2.0 in 1994. The September 22 1995 draft of the HTML 2.0 specification has been approved as a standard by the IETF Application Area HTML Working Group. HTML 3.0 and Netscape HTML are competing next generations of HTML 2.0. Proposed features in HTML 3.0 include: forms, style sheets, mathematical markup, and text flow around figures. For more detailed information, see the HTML Reference Manual.

    HTML is an application of the Standard Generalized Markup Language (SGML ISO-8879), an international standard approved in 1986, which specifies a formal meta-language for defining document markup systems (more here and here). An SGML Document Type Definition (DTD) specifies valid tag names and element attributes. HTML consists of embedded content separated by hierarchical case sensitive start and end tag names which may contain embedded element attributes in the start tag. These attributes may be required, optional, or empty. In addition, documents can be inter or intra linked by establishing source and target anchor points. Many HTML documents are the result of manual authoring or word processing HTML converters, but now several WYSIWYG editors support HTML styles -- see listing at W3C and the Internet Tools Survey section on Authoring HTML.

    HTML files are viewed using a WWW client browser (software), the primary user interface to the Web. HTML allows for embedding of images, sounds, video streams, form fields and simple text formatting. References, called hyperlinks, to other objects are embedded using URLs (see below). When an object is selected by a hyperlink, the browser takes an action based on the URL's type, e.g., retrieve a file, connect to another Web site and display a HTML file stored there, or launch an application such as an E-mail or newsgroup reader.

  • Universal Resource Identifier (URI) - an IETF addressing protocol for objects in the WWW ("if it's out there, we can point at it"). There are two types of URIs, Universal Resource Names (URN) and the Universal Resource Locators (URL). The current IETF URI spec is here and the URL spec is here. URLs are location dependent and contain four distinct parts: the protocol type, the machine name, the directory path and the file name. There are several kinds of URLs: file URLs, FTP URLs, Gopher URLs, News URLs, and HTTP URLs. URLs may be relative to a directory or offsets into a document. Arguments to CGI programs (see below) may be embedded in URLs after the ? character.

  • HyperText Transfer Protocol (HTTP) - an application-level network protocol for the WWW. Tim Berners-Lee, father of the Web, describes it as a "generic stateless object-oriented protocol." Stateless means neither the client nor the server store information about the state of the other side of an ongoing connection. Statelessness is a scalability property but is not necessarily efficient since HTTP sets up a new connection for each request, which is not desirable for situations requiring sessions or transactions.


    • In HTTP, commands (request methods) can be associated with particular types of network objects (files, documents, network services). Commands are provided for


      • establishing a TCP/IP connection to a WWW server,

      • sending a request to the server (containing a method to be applied to a specific network object identified by the object's identifier, and the HTTP protocol version, followed by information encoded in a header style),

      • returning a response from the server to the client (consisting of three parts: a status line, a response header, and response data), and

      • closing the connection.


    • HTTP supports dynamic data representation through client-server negotiation. The requesting client specifies it can accept certain MIME content types (more on this below) and the server responds with one of these. All WWW clients can handle text/plain and text/html.

    • HTTP/1.0 Internet Draft 05 (the seventh release of HTTP/1.0) is targeted as an Internet Informational RFC. The next immediate version of HTTP is HTTP/1.1 Internet Draft 01.


Web Architecture Extensibility

This basic web architecture is fast evolving to serve a wider variety of needs beyond static document access and browsing. The Common Gateway Interface (CGI) extends the architecture to three-tiers by adding a back-end server that provides services to the Web server on behalf of the Web client, permitting dynamic composition of web pages. Helpers/plug-ins and Java/JavaScript provide other interesting Web architecture extensions.

  • Common Gateway Interface(CGI) - CGI is a standard for interfacing external programs with Web servers (see Figure 1). The server hands client requests encoded in URLs to the appropriate registered CGI program, which executes and returns results encoded as MIME messages back to the server. CGI's openness avoids the need to extend HTTP. The most common CGI applications handle HTML <FORM> and <ISINDEX> commands.

    • CGI programs are executable programs that run on the Web server. They can be written in any scripting language (interpreted) or programming language (must be compiled first) available to be executed on a Web server, including C, C++, Fortran, PERL, TCL, Unix shells, Visual Basic, Applescript, and others. Security precautions typically require that CGI programs be run from a specified directory (e.g, /cgi-bin) under control of the webmaster (Web system administrator), that is, they must be registered with the system.
    • Arguments to CGI programs are transmitted from client to server via environment variables encoded in URLs. The CGI program typically returns HTML pages that it constructs on the fly.
    • Some problems with CGI are:
      • the CGI interface requires the server to execute a program
      • the CGI interface does not provide a way to share data and communications resources so if a program must access an external resource, it must open and close that resource. It is difficult to construct transactional interactions using CGI.

    • The current version is CGI/1.1. W3C and others are experimenting with next generation object-oriented APIs based on OMG IDL; Netscape provides Netscape Server API (NSAPI) and Progress Software and Microsoft provide Internet Server API (ISAPI).


  • Helpers/Plug-ins - When a client browser retrieves a file, it launches an installed helper application or plug-in to process the file based on the file's MIME-type (see below). For example, it may launch a Postscript or Acrobat reader, or MPEG or QuickTime player. A helper application runs external to the browser while a plug-in runs within the browser. For information on how to create new Netscape Navigator plug-ins, see The Plug-in Developer's Guide.

  1. Common Client Gateway (CCI) - this gateway allows a third-party application to remotely control the Web browser client. Netscape Client APIs 2.0 (NCAPIs) depends on platform specific native methods of interprocess communication (IPC). They plan to support DDE and OLE2 for Windows clients, X properties for UNIX clients, and Apple Events for Macintosh clients.

  2. Extensions to HTTP. W3C and IETF Application Area HTTP Working Group are working together on current and future versions of HTTP. The HTTP-NG project is assessing two implementation approaches to HTTP "replacements":


    • Spero's approach - allows many requests per connection, the requests can be asynchronous and the server can respond in any order, allowing several transfers in parallel. A "session layer" divides the connection into numerous channels. Control messages (GET requests, meta information) are returned in a control channel; each object is returned in its own channel.

    • W3C approach - Jim Gettys at W3C is using Xerox ILU (a CORBA variant) to implement an ILU transport similar to Spero's session protocol. The advantages of this approach are openness with respect to pluggable transport protocols, support for multiple language environments, and a step towards viewing the "web of objects." Related to this approach, Netscape recently announced future support for OMG Internet Inter-ORB Protocol (IIOP) standard on both client and server. This will provide a uniform and language neutral object interchange format making it easier to construct distributed object applications.


  3. Java/ JavaScript - Java is a cross-platform WWW programming language modeled after C++ from Sun Microsystems. Java programs embedded in HTML documents are called applets and are specified using <APPLET> tags. The HTML for an applet contains a code attribute that specifies the URL of the compiled applet file. Applets are compiled to a platform-independent bytecode which can be safely downloaded and executed by the Java interpreter embedded into the Web browser. Browsers that support Java are said to be Java-enabled. If performance is critical, a Java applet can be compiled to native machine language on the fly. Such a compiler is known as a Just-In-Time (JIT) compiler.

    JavaScript is a scripting language designed for creating dynamic, interactive Web applications that link together objects and resources on both clients and servers. A client JavaScript can recognize and respond to user events such as mouse clicks, form input, and page navigation, and query the state or alter the performance of an applet or plug-in. A server JavaScript script can exhibit behavior similar to common gateway interface (CGI) programs. JavaScript scripts are embedded in HTML documents using <SCRIPT> tags. Similar to Java applets, JavaScript scripts are directly interpreted within the client's browser and are therefore platform-independent. For a comparison of Java and JavaScript, see here.

    The Java Language Specification can be found here, a Java tutorial here, the Java Virtual Machine (interpreter) here, the Java Developer's Kit (JDK) here, and Java FAQs here. A comprehensive Java page of resources can be found at JPL.

    The JavaScript Language Specification can be found here, a JavaScript tutorial here, and the JavaScript FAQs here.

  4. The IETF Security Area Web Transaction Security (WTS) Working Group is working on security services for WWW. As chartered, it has produced Internet-drafts of a Requirements for Web Transaction Security and a Secure HyperText Transfer Protocol specification plus Security Extensions For HTML.


Other Transfer Protocols


The Web also uses other HTTP-related protocols for transferring and representing information, including:



  • Transmission Control Protocol/Internet Protocol (TCP/IP) - the fundamental protocol that provides for the reliable delivery of streams of data from one host to another. An introduction to TCP/IP is here.

  • File Transfer Protocol (FTP) - a common method of moving files between two Internet sites. It is based on TCP/IP.

  • Gopher - a distributed document search and retrieval protocol (IETF RFC 1436) for obtaining files or information from hierarchical menus in the Gopher information-retrieval system.

  • Internet Inter-ORB Protocol (IIOP) - an inter-ORB protocol for communication between objects and applications. It is based on the Common Object Request Broker Architecture (CORBA) specification.

  • Multipurpose Internet Mail Extensions (MIME) - the protocol for multimedia email and a building block of HTTP. The first packet of information received by a client identifies the type of file the server has sent, e.g., binary, audio, video, movie, formatted word-processor documents, graphics, spreadsheets, etc.. The extensions to the SMTP format allow it to carry multiple types of data. When multimedia files are sent using the MIME standard they are encoded into non-readable text. The Web browser maintains a list of pairs of MIME-Types and helper applications for handling each type.

  • Network News Transfer Protocol (NNTP) - the protocol used to connect to Usenet discussion groups.

  • Secure Socket Layer (SSL) - a security protocol developed by Netscape for sending and receiving encrypted information. It is based on encryption technology developed by RSA, Inc..

  • Simple Mail Transfer Protocol (SMTP) - a protocol for transferring electronic mail from one host to another.

  • Simple Network Management Protocol (SNMP) - a protocol that allows a network administrator to monitor network devices over the network.

  • Z39.50 - a protocol that governs the formats and procedures by which two computers interact with one another. It is used to search several databases of the same type, and is session-oriented and stateful.


Other Open Standards


The Web also builds on additional open standards:



  • GIF, JPEG, and XBM image formats.

  • Virtual Reality Modeling Language (VRML) - a proposed standard language for describing multi-participant interactive simulations within the WWW.

  • HyperMedia Management Protocol (HMMP) - a protocol to access and manipulate components of the Hypermedia Management Schema (HMMS), a data representation formalism (schema) for representing managed objects. HMMS and HMMP are major components of the Web-Based Enterprise Management standards effort to integrate existing standards, such as SNMP/UDP, HTML/HTTP and DMI/RPC into a browser-managed architecture.

  • Real Time Streaming Protocol (RTSP) - a recently proposed communication protocol for control and delivery of video and audio in real-time.

  • Proxy and SOCKS firewall protocols.

  • S-HTTP security protocol.


A more complete list of standards can be found at Netscape and the World Wide Web Consortium. A complete list of Internet Engineering Task Force (IETF) standard RFCs can be found here.


This research is sponsored by the Defense Advanced Research Projects Agency and managed by the U.S. Army Research Laboratory under contract DAAL01-95-C-0112. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, U.S. Army Research Laboratory, or the United States Government.

No comments: