Exercises for Monday, October 4, 2004 Last class we discussed the client-server model, protocols, and the world-wide web. Your browser (Explorer, FireFox, Mozilla, Safari) is an HTTP client which enables it to interact with HTTP servers such as the Apache web server that serves up web pages for the computer science department. There are other simpler HTTP clients such as 'curl' (for copy URL) that supply more limited functionality: % curl "http://www.cs.brown.edu/~tld/talk/home.html" Most browsers do a good deal more than simply request and fetch web pages; a browser is said to 'parse' HTML documents and then 'render' them for display in a browser window. In this exercise, we'll write an HTML parser and simple text-based renderer. There's a very handy text-based browser called 'lynx' that we'll use as a model for our code. This first invocation causes 'lynx' to take over a terminal window running a shell in much the same was as 'info' does when invoked on a the name of a shell command. % lynx "http://www.cs.brown.edu/~tld/talk/home.html" The next invocation causes 'lynx' to fetch a specified page, parse the HTML file, remove the HTML tags and then 'dump' the result to the standard output. You can redirect the output to a file if you want to save the result. The -nolist option directs 'lynx' not to list the hypertext links at the end of the listing as it normally would do. % lynx -dump -nolist "http://www.cs.brown.edu/~tld/talk/home.html" Parsing involves revealing the structure of a text. When you parse a sentence, something that you may have done in grade school, you identify the subject, verb, direct object, prepositional phrases, etc. and how they are related to one another. In parsing an HTML document we'll focus on identifying the HTML tags. Technically this is referred to as 'lexical analysis' but it is generally considered as a part of the parsing process. It's primary purpose for us is to highlight the tags and thereby simplify subsequent processing. We're going to use a 'sed' program to implement our parser. A 'sed' program is basically a list of 'sed' commands in a file. Recall the syntax of the 'sed' substitution command. % echo "Damien has TA hours tonight." | sed 's/Damien/Vanessia/g' Vanessia has TA hours tonight. The usage 's/PATTERN/REPLACEMENT/g' is one of the most commonly used 'sed' commands; the 's' refers to substitution, the PATTERN is a regular expression with much the same syntax as 'grep', and the 'g' is the 'global' option which makes 'sed' make the substitution for every occurrence of the pattern in a line of text. We can run a batch of 'sed' commands at the same time by putting them in a file, instructing 'sed' to load the file using the -f option and then applying the commands in the order as they appear in the file to each line in the standard input. % cat change.sed s/Damien/Vanessia/g s/he/she/g s/his/her/g % echo "Damien left town; he will miss his class." | sed -f change.sed Vanessia left town; she will miss her class. 1. Modify the file 'change.sed' so that it correct handles both the above example and the following one: % echo "Damien left town. He will miss his class." | sed -f change.sed Vanessia left town. She will miss her class. 2. Create a 'sed' program called 'viewpoint.sed' that produces the following behavior. This one is a little tricky. % echo "you make me feel like I am tall." | sed -f viewpoint.sed I make you feel like you are tall. In parsing an HTML file, we want to convert each HTML tag to a canonical identifier called a 'token'. Our parser should ignore the case of tags (though strictly speaking tags should be in lower case to comply with HTML 4.0 syntax) and it will ignore most of the options and formatting directives that are embedded in tags. Here is a simple command that converts the tag that signals the beginning of an HTML document; the 'g' option isn't really needed here as we don't expect multiple occurrences of the 'html' tag, but we will need it most of the time and it doesn't hurt. The replacement string adds space before and after the token that we use to mark the occurrence of the 'html' tag. s// @BEGIN_HTML@ /g We'll need another command to handle an uppercase tag: s// @BEGIN_HTML@ /g 3. How would you handle both lowercase and uppercase tags with a single 'sed' command? 4. That was pretty easy, but how would you manage a 'table' HTML tag which takes options as in: '