Exercises for Monday, October 4, 2004 Last class we discussed the client-server model, protocols, and the world-wide web. Your browser (Explorer, FireFox, Mozilla, Safari) is an HTTP client which enables it to interact with HTTP servers such as the Apache web server that serves up web pages for the computer science department. There are other simpler HTTP clients such as 'curl' (for copy URL) that supply more limited functionality: % curl "http://www.cs.brown.edu/~tld/talk/home.html" Most browsers do a good deal more than simply request and fetch web pages; a browser is said to 'parse' HTML documents and then 'render' them for display in a browser window. In this exercise, we'll write an HTML parser and simple text-based renderer. There's a very handy text-based browser called 'lynx' that we'll use as a model for our code. This first invocation causes 'lynx' to take over a terminal window running a shell in much the same was as 'info' does when invoked on a the name of a shell command. % lynx "http://www.cs.brown.edu/~tld/talk/home.html" The next invocation causes 'lynx' to fetch a specified page, parse the HTML file, remove the HTML tags and then 'dump' the result to the standard output. You can redirect the output to a file if you want to save the result. The -nolist option directs 'lynx' not to list the hypertext links at the end of the listing as it normally would do. % lynx -dump -nolist "http://www.cs.brown.edu/~tld/talk/home.html" Parsing involves revealing the structure of a text. When you parse a sentence, something that you may have done in grade school, you identify the subject, verb, direct object, prepositional phrases, etc. and how they are related to one another. In parsing an HTML document we'll focus on identifying the HTML tags. Technically this is referred to as 'lexical analysis' but it is generally considered as a part of the parsing process. It's primary purpose for us is to highlight the tags and thereby simplify subsequent processing. We're going to use a 'sed' program to implement our parser. A 'sed' program is basically a list of 'sed' commands in a file. Recall the syntax of the 'sed' substitution command. % echo "Damien has TA hours tonight." | sed 's/Damien/Vanessia/g' Vanessia has TA hours tonight. The usage 's/PATTERN/REPLACEMENT/g' is one of the most commonly used 'sed' commands; the 's' refers to substitution, the PATTERN is a regular expression with much the same syntax as 'grep', and the 'g' is the 'global' option which makes 'sed' make the substitution for every occurrence of the pattern in a line of text. We can run a batch of 'sed' commands at the same time by putting them in a file, instructing 'sed' to load the file using the -f option and then applying the commands in the order as they appear in the file to each line in the standard input. % cat change.sed s/Damien/Vanessia/g s/he/she/g s/his/her/g % echo "Damien left town; he will miss his class." | sed -f change.sed Vanessia left town; she will miss her class. 1. Modify the file 'change.sed' so that it correct handles both the above example and the following one: % echo "Damien left town. He will miss his class." | sed -f change.sed Vanessia left town. She will miss her class. 2. Create a 'sed' program called 'viewpoint.sed' that produces the following behavior. This one is a little tricky. % echo "you make me feel like I am tall." | sed -f viewpoint.sed I make you feel like you are tall. In parsing an HTML file, we want to convert each HTML tag to a canonical identifier called a 'token'. Our parser should ignore the case of tags (though strictly speaking tags should be in lower case to comply with HTML 4.0 syntax) and it will ignore most of the options and formatting directives that are embedded in tags. Here is a simple command that converts the tag that signals the beginning of an HTML document; the 'g' option isn't really needed here as we don't expect multiple occurrences of the 'html' tag, but we will need it most of the time and it doesn't hurt. The replacement string adds space before and after the token that we use to mark the occurrence of the 'html' tag. s// @BEGIN_HTML@ /g We'll need another command to handle an uppercase tag: s// @BEGIN_HTML@ /g 3. How would you handle both lowercase and uppercase tags with a single 'sed' command? 4. That was pretty easy, but how would you manage a 'table' HTML tag which takes options as in: '' In this case we need a little more complicated regular expression. Suppose we want our parser to convert this tag into two tokens: ' @BEGIN_TABLE@ @CENTER@ ' Similarly, '
' would be converted into ' @BEGIN_TABLE@ @LEFT@ ' What would the 'sed' command look like? 5. Suppose we want to completely ignore the content of an anchor tag so that 'text' would be converted to ' @BEGIN_ANCHOR@ text @END_ANCHOR@ ' Write the sequence of 'sed' commands to implement this transformation. 6. What would happen if the '<' of a tag is on one line and the '>' is on another line? How might you avoid this problem? 7. Iteration using the C-shell scripting language tends to be word based whereas iteration involving Unix pipes tend to be either line based, e.g., 'awk', 'grep', 'sed' and 'sort', or character based, e.g., 'tr'. With the exception of a 'verbatim' mode (implemented using the 'pre' tag), HTML pretty much ignores white space. It recognizes separate words but doesn't distinguish among space, tab and linefeed characters or strings of such characters. All formatting is specified by using appropriate HTML tags. For example, the 'p' tag is used to format paragraphs;

is used to signal the beginning of a paragraph and

is used to signal the end a paragraph. Given that spaces, tabs and linefeeds are irrelevant except to distinguish separate words, without losing information we can convert an HTML document into a list of words using the backtick operator and then iterate through the list of words. Write a script 'format' that takes as its only argument the name of a file consisting of words with HTML 'p' tags and formats the output with an indentation of two spaces and a maximum line width of 72 characters. The text alignment should be ragged right and there should be a blank line separating paragraphs. Here's the basic skeleton for your script: #!/bin/csh set max_width = 72 set line_width = 0 foreach word ( `cat $argv` ) if ( "$word" == '

' ) then ... >> print statements << ... else ... >> control logic << ... ... >> print statements << ... endif end You'll need additional 'if', 'then' and 'else' keywords to implement the control logic. Recall that 'printf' is more general than 'echo' in that it doesn't automatically add a linefeed, but you can always add line feeds by explicitly inserting them into the format string as in 'printf "\nText\n\n"'. You'll want to count the length of each word to stay within the maximum line width. The wordcount command 'wc' counts words (-w), lines (-l), characters (-m), and bytes (-c). What do the following incantations accomplish: 'ls *.jpg | wc -l', 'cat file.txt | wc -w' and 'grep "

" file.html | wc -l'? The following script will come in handy in implementing your 'format' command: @ width += `echo $word | wc -m` By the way, you can define an alias for this short script by using an alias. In defining an alias, '\!^' refers to the arguments to the alias. alias length "echo \!^ | wc -m" Now you can refer to the alias as you would any other command and the name of the alias, if chosen judiciously, provides the reader with a better idea what's going on. @ width += `length $word` Note that in implementing word length in this way we compute the length of a word as the number of characters plus one. (Why is this so and how might we avoid it?) For this application, however, this is exactly what we want. (Why?) Appendix A below contains an example of 'format' input and output. 8. A case statement can come in handy when your flow-of-control logic has to deal with several alternatives. Use 'info csh' to read about the syntax for using 'switch' together with 'case' and then reimplement 'format' using a case statement. The basic skeleton for your script should look something like the following: #!/bin/csh set width = 0 set max_width = 72 foreach word ( `cat $argv` ) switch ( $word ) case '

': ... breaksw case '...': ... breaksw default: ... endsw end 9. Now put it all together. Write a 'sed' program that handles 'html', 'head', 'title', 'body', 'h1' (heading), 'h2', 'h3', 'p' and 'a' (anchor) tags and ignores the rest. Then write a renderer that can format appropriate output for these tags. Your renderer should work like 'format' except that it uses the tokens generated by your parser instead of raw HTML tags. If you're ambitious, I'll show you how to get input from the user in a shell script, and then using 'curl' you can actually write a simple browser. A suitably expanded version of this exercise would serve as a midterm project. Appendix A: An example of 'format' input and output % cat paragraphs.html

The csh is a command language interpreter incorporating a history mechanism, job control facilities, interactive file name and user name completion, and a C-like syntax. It is used both as an interactive login shell and a shell script command processor.

The break flag forces a halt to option processing, causing any further shell arguments to be treated as non-option arguments. The remaining arguments will not be interpreted as shell options. This may be used to pass options to a shell script without confusion or possible subterfuge.

The shell repeatedly performs the following sequence of actions: a line of command input is read and broken into words. This sequence of words is placed on the command history list and parsed. Finally each command in the current line is executed.

% format paragraphs.html The csh is a command language interpreter incorporating a history mechanism, job control facilities, interactive file name and user name completion, and a C-like syntax. It is used both as an interactive login shell and a shell script command processor. The break flag forces a halt to option processing, causing any further shell arguments to be treated as non-option arguments. The remaining arguments will not be interpreted as shell options. This may be used to pass options to a shell script without confusion or possible subterfuge. The shell repeatedly performs the following sequence of actions: a line of command input is read and broken into words. This sequence of words is placed on the command history list and parsed. Finally each command in the current line is executed.