1 Networking

  • Protocols
  • Bottom layers: physical network, IP, TCP
  • Top layers: FTP, telnet, http
  • Type of connection: connection-oriented vs. packet-oriented

    1.1 The http protocol

    You can manually interact with a webserver by typing telnet www.soc.napier.ac.uk 80, then type GET / HTTP/1.0 and then press enter twice. In general, the format of the first line of an HTTP request is

    METHOD space Request-URI space HTTP-Version \n\n

    Methods are GET, HEAD, OPTIONS, POST, DELETE, etc. The Request-URI is usually the path and name of the document (eg. "/" or "/somedir/index.html"). The HTTP-Version is currently either 1.0 or 1.1.

    The HTTP request is normally generated by the user's browser. A typical request with "GET" is

    GET /cgi-bin/somefile.cgi HTTP/1.0
    CONNECTION: Keep-Alive
    USER-AGENT: Mozilla/4.0 (compatible; MSIE 5.22; Mac_PowerPC)
    PRAGMA: no-cache
    HOST: www.dcs.napier.ac.uk
    ACCEPT: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    QUERY_STRING: name=Mary&comments=Hello
    

    A typical request with "POST" is

    POST /cgi-bin/somefile.cgi HTTP/1.0
    REFERER: http://www.dcs.napier.ac.uk/cgi-bin/someform.html
    CONNECTION: Keep-Alive
    USER-AGENT: Mozilla/4.0 (compatible; MSIE 5.22; Mac_PowerPC)
    HOST: www.dcs.napier.ac.uk
    ACCEPT: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
    CONTENT-TYPE: application/x-www-form-urlencoded
    CONTENT-LENGTH: 42
    
    name=Mary&comments=Hello
    

    A typical answer from a webserver starts like this:

    HTTP/1.1 200 OK
    Date: Thu, 18 Nov 2004 15:41:04 GMT
    Server: Apache/2.0.40 (Red Hat Linux)
    Accept-Ranges: bytes
    X-Powered-By: Open SoCks 1.2.0000
    Last modified: Thu, 18 Nov 2004 15:41:04 GMT
    Connection: close
    Content-Type: text/html; charset=ISO-8859-1
    
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
    "http://www.w3.org/TR/html4/loose.dtd">
    
    <html>
    

    These are different possible first lines (302 indicates a redirection):

    HTTP/1.1 200 OK
    HTTP/1.1 404 Not Found
    HTTP/1.1 302 Found
    HTTP/1.0 500 internal error
    

    1.2 Exercises

    1) Find out which options are accepted by www.soc.napier.ac.uk (Hint: use * as the Request-URI.)

    1.3 Connecting to the http port with Perl

    There are different ways to download pages from the web with Perl. The IO::Socket module provides a primitive interface which shows most of the HTTP interaction fairly openly. The LWP module provides a more sophisticated user-friendly interface (and implements the libwww-perl library). It would be possible to use LWP to implement a basic command-line browser. Documentation about LWP can be found here and here.

    1.4 An example using IO::Socket

    In this script "$remote" is a variable for a two directional socket handle (similar to a filehandle, which is usually one directional).

    #!/usr/local/bin/perl
    use IO::Socket;

    $current_host = "www.napier.ac.uk";
    $current_doc = "/";

    $remote =IO::Socket::INET->new(Proto => "tcp",
    PeerAddr => $current_host,
    PeerPort => "http(80)",
    );
    if (!$remote) {die "cannot connect to http daemon on $current_host"}
    $remote->autoflush(1);
    print $remote "GET $current_doc HTTP/1.0\r\n";
    print $remote "Host: $current_host\r\n\r\n";
    $line = <$remote> ;
    while ($line) {
    print "$line";
    $line = <$remote> ;
    }
    close $remote;

    1.5 Exercises

    2) Try the script using pages you know. Notice the text that is printed before the "<html>" tag (use "more" so that it doesn't scroll of the screen). What happens if there exists no file at the requested URL?

    3) Print the output to a file and then view the file through your browser.

    1.6 Examples using LWP

    The simplest version for retrieving a document is

    use LWP::Simple;
    $doc = get 'http://www.napier.ac.uk/';
    print $doc;

    The user-agent version of LWP allows to control the HTTP request header information, to use cookies, https and password protected pages.

    use LWP::UserAgent;
    $ua = LWP::UserAgent->new;
    $req = HTTP::Request->new(GET => 'http://www.napier.ac.uk/');
    $req->header('Accept' => 'text/html');
    $res = $ua->request($req);
    if ($res->is_success) {
    print $res->content;
    } else {
    print "Error: " . $res->status_line . "\n";
    }

    1.7 Exercises

    4) Modify the LWP user agent example above so that it sends a request to one of your webpages, which contains a form. Hints: use "POST" instead of "GET". The $req->content_type('...') should be set according to the information under 1.1 above. The $req->content('...') should contain the query string.

    2 Connecting to other CGI sites

    To connect to another CGI script on the web you can submit the information that the script requires via the URL and then display the result through your CGI script. For example, type
    http://search.yahoo.com/bin/search?p=WORD
    into the URL field of your browser and replace WORD by a topic you want to retrieve.

    2.1 Exercises

    5) Incorporate this into the script from exercise 4). Let the user input a topic. Display the result without any Javascript, images, advertisments etc, which might be on the page.

    Note: it may be illegal to incorporate other CGI scripts into your script without asking the owner of the original script for permission. At a minimum you would have to inform users about the underlying script.

    3 Optional: Connecting to the ftp port

    In the example below, you need to fill in the addess of an anonymous ftp server and your email address before you use the script. This script is only an example of how to use the IO::Socket. For serious implementations, LWP or NET::FTP should be used.

    #!/usr/local/bin/perl -w
    use IO::Socket;

    $current_host ="ftp.somewhere.ac.uk";
    $email_address ="user\@napier.ac.uk";

    $remote =IO::Socket::INET->new(Proto => "tcp",
    PeerAddr => $current_host,
    PeerPort => "ftp(21)",
    );
    if (!$remote) { die "cannot connect to http daemon on $current_host"}
    $remote->autoflush(1);
    print $remote "USER anonymous\n";
    print $remote "PASS $email_address\n";
    print $remote "quit\n";
    while (<$remote>)
    {print;}
    close $remote;

    4 HTML Parsing and Web Crawlers

    See the example of "searching a webpage" in week 7, number 3.