Exercises for Monday, October 4, 2004

Last class we discussed the client-server model, protocols, and the
world-wide web.  Your browser (Explorer, FireFox, Mozilla, Safari) is
an HTTP client which enables it to interact with HTTP servers such as
the Apache web server that serves up web pages for the computer
science department.  There are other simpler HTTP clients such as 
'curl' (for copy URL) that supply more limited functionality:

% curl "http://www.cs.brown.edu/~tld/talk/home.html"

Most browsers do a good deal more than simply request and fetch web
pages; a browser is said to 'parse' HTML documents and then 'render'
them for display in a browser window.  In this exercise, we'll write
an HTML parser and simple text-based renderer.  There's a very handy
text-based browser called 'lynx' that we'll use as a model for our
code.  This first invocation causes 'lynx' to take over a terminal
window running a shell in much the same was as 'info' does when
invoked on a the name of a shell command.

% lynx "http://www.cs.brown.edu/~tld/talk/home.html"

The next invocation causes 'lynx' to fetch a specified page, parse the
HTML file, remove the HTML tags and then 'dump' the result to the
standard output.  You can redirect the output to a file if you want to
save the result.  The -nolist option directs 'lynx' not to list the
hypertext links at the end of the listing as it normally would do.

% lynx -dump -nolist "http://www.cs.brown.edu/~tld/talk/home.html"

Parsing involves revealing the structure of a text.  When you parse a
sentence, something that you may have done in grade school, you
identify the subject, verb, direct object, prepositional phrases,
etc. and how they are related to one another.  In parsing an HTML
document we'll focus on identifying the HTML tags.  Technically this
is referred to as 'lexical analysis' but it is generally considered as
a part of the parsing process.  It's primary purpose for us is to
highlight the tags and thereby simplify subsequent processing.

We're going to use a 'sed' program to implement our parser.  A 'sed'
program is basically a list of 'sed' commands in a file.  Recall the
syntax of the 'sed' substitution command.

% echo "Damien has TA hours tonight." | sed 's/Damien/Vanessia/g'
Vanessia has TA hours tonight.

The usage 's/PATTERN/REPLACEMENT/g' is one of the most commonly used
'sed' commands; the 's' refers to substitution, the PATTERN is a
regular expression with much the same syntax as 'grep', and the 'g' is
the 'global' option which makes 'sed' make the substitution for every
occurrence of the pattern in a line of text.

We can run a batch of 'sed' commands at the same time by putting them
in a file, instructing 'sed' to load the file using the -f option and
then applying the commands in the order as they appear in the file to
each line in the standard input.

% cat change.sed
s/Damien/Vanessia/g
s/he/she/g
s/his/her/g
% echo "Damien left town; he will miss his class." | sed -f change.sed
Vanessia left town; she will miss her class.

1. Modify the file 'change.sed' so that it correct handles both the
   above example and the following one:

   % echo "Damien left town. He will miss his class." | sed -f change.sed
   Vanessia left town. She will miss her class.


2. Create a 'sed' program called 'viewpoint.sed' that produces the 
   following behavior.  This one is a little tricky. 

   % echo "you make me feel like I am tall." | sed -f viewpoint.sed
   I make you feel like you are tall.


 In parsing an HTML file, we want to convert each HTML tag to a
 canonical identifier called a 'token'.  Our parser should ignore the
 case of tags (though strictly speaking tags should be in lower case
 to comply with HTML 4.0 syntax) and it will ignore most of the
 options and formatting directives that are embedded in tags. Here is
 a simple command that converts the tag that signals the beginning of
 an HTML document; the 'g' option isn't really needed here as we don't
 expect multiple occurrences of the 'html' tag, but we will need it
 most of the time and it doesn't hurt.  The replacement string adds
 space before and after the token that we use to mark the occurrence
 of the 'html' tag.

s/<html>/ @BEGIN_HTML@ /g

We'll need another command to handle an uppercase tag:

s/<HTML>/ @BEGIN_HTML@ /g


3. How would you handle both lowercase and uppercase tags with a
   single 'sed' command?


4. That was pretty easy, but how would you manage a 'table' HTML tag 
   which takes options as in: 

   '<table align="center">'

   In this case we need a little more complicated regular expression.
   Suppose we want our parser to convert this tag into two tokens:

   ' @BEGIN_TABLE@ @CENTER@ '  

   Similarly, '<table align="left">' would be converted into 

   ' @BEGIN_TABLE@ @LEFT@ '  

   What would the 'sed' command look like?


5. Suppose we want to completely ignore the content of an anchor tag so that

   '<a href="http://www.cs.brown.edu/~tld/my_talk/home.html#name">text</a>'

   would be converted to 

   ' @BEGIN_ANCHOR@ text @END_ANCHOR@ '

   Write the sequence of 'sed' commands to implement this transformation.


6. What would happen if the '<' of a tag is on one line and the '>' is on
   another line?  How might you avoid this problem?


7. Iteration using the C-shell scripting language tends to be word
   based whereas iteration involving Unix pipes tend to be either 
   line based, e.g., 'awk', 'grep', 'sed' and 'sort', or character 
   based, e.g., 'tr'.

   With the exception of a 'verbatim' mode (implemented using the
   'pre' tag), HTML pretty much ignores white space.  It recognizes
   separate words but doesn't distinguish among space, tab and
   linefeed characters or strings of such characters.  All formatting
   is specified by using appropriate HTML tags.  For example, the 'p'
   tag is used to format paragraphs; <p> is used to signal the
   beginning of a paragraph and </p> is used to signal the end a
   paragraph.  Given that spaces, tabs and linefeeds are irrelevant
   except to distinguish separate words, without losing information 
   we can convert an HTML document into a list of words using the
   backtick operator and then iterate through the list of words.

   Write a script 'format' that takes as its only argument the name of
   a file consisting of words with HTML 'p' tags and formats the
   output with an indentation of two spaces and a maximum line width
   of 72 characters.  The text alignment should be ragged right and
   there should be a blank line separating paragraphs.

   Here's the basic skeleton for your script:

   #!/bin/csh
   set max_width = 72
   set line_width = 0
   foreach word ( `cat $argv` )
     if ( "$word" == '<p>' ) then
       ...
       >> print statements <<
       ...
     else ... >> control logic << ...
       ...
       >> print statements <<
       ...
     endif
   end

   You'll need additional 'if', 'then' and 'else' keywords to
   implement the control logic.  Recall that 'printf' is more general
   than 'echo' in that it doesn't automatically add a linefeed, but
   you can always add line feeds by explicitly inserting them into the
   format string as in 'printf "\nText\n\n"'.

   You'll want to count the length of each word to stay within the
   maximum line width.  The wordcount command 'wc' counts words (-w),
   lines (-l), characters (-m), and bytes (-c).  What do the following
   incantations accomplish: 'ls *.jpg | wc -l', 'cat file.txt | wc -w' 
   and 'grep "<p>" file.html | wc -l'?  The following script will 
   come in handy in implementing your 'format' command:

   @ width += `echo $word | wc -m`

   By the way, you can define an alias for this short script by using
   an alias.  In defining an alias, '\!^' refers to the arguments to
   the alias.

   alias length "echo \!^ | wc -m"

   Now you can refer to the alias as you would any other command and
   the name of the alias, if chosen judiciously, provides the reader
   with a better idea what's going on.

   @ width += `length $word`

   Note that in implementing word length in this way we compute the
   length of a word as the number of characters plus one.  (Why is
   this so and how might we avoid it?)  For this application, however,
   this is exactly what we want. (Why?)

   Appendix A below contains an example of 'format' input and output.


8. A case statement can come in handy when your flow-of-control logic
   has to deal with several alternatives.  Use 'info csh' to read
   about the syntax for using 'switch' together with 'case' and then
   reimplement 'format' using a case statement.  The basic skeleton
   for your script should look something like the following:

   #!/bin/csh
   set width = 0
   set max_width = 72
   foreach word ( `cat $argv` )
     switch ( $word )
       case '<p>':
         ...
         breaksw
       case '...':
         ...
         breaksw
       default: 
         ...
     endsw
   end

9. Now put it all together.  Write a 'sed' program that handles
   'html', 'head', 'title', 'body', 'h1' (heading), 'h2', 'h3', 'p'
   and 'a' (anchor) tags and ignores the rest. Then write a renderer
   that can format appropriate output for these tags.  Your renderer
   should work like 'format' except that it uses the tokens generated
   by your parser instead of raw HTML tags.

   If you're ambitious, I'll show you how to get input from the user
   in a shell script, and then using 'curl' you can actually write a
   simple browser.  A suitably expanded version of this exercise would
   serve as a midterm project.


Appendix A: An example of 'format' input and output

% cat paragraphs.html
<p>
The csh is a command language interpreter incorporating 
a history mechanism, job control facilities, interactive file name 
and user name completion, and a C-like syntax. 
It is used both as an interactive
login shell and a shell script command processor.
</p>
<p>
The break flag forces a halt to option processing, causing 
any further shell arguments to be treated as 
non-option arguments.  
The remaining arguments will not be interpreted as shell 
options.
This may be used to pass options to a shell script without 
confusion or possible subterfuge.
</p>
<p>
The shell repeatedly performs the following sequence of actions: 
a line of command input is read and broken into words.
This sequence of words is placed on the command history list 
and parsed.
Finally each command in the current line is executed.
</p>

% format paragraphs.html
 
  The csh is a command language interpreter incorporating a history 
  mechanism, job control facilities, interactive file name and user 
  name completion, and a C-like syntax. It is used both as an 
  interactive login shell and a shell script command processor. 

  The break flag forces a halt to option processing, causing any 
  further shell arguments to be treated as non-option arguments. The 
  remaining arguments will not be interpreted as shell options. This 
  may be used to pass options to a shell script without confusion or 
  possible subterfuge. 

  The shell repeatedly performs the following sequence of actions: a 
  line of command input is read and broken into words. This sequence of 
  words is placed on the command history list and parsed. Finally each 
  command in the current line is executed.