Monday, September 20, 2004 Supplementary tips and questions for 'Ciphers and Secrets' http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2 Shell scripts with lists of commands provide one way of stringing together a sequence of computations. The different computations / commands are connected together using files and variables - the only memory that shell scripts have. Variables and files are the means whereby different computations communicate with one another. In this exercise, we'll explore an alternative method for computations to pass information back and forth namely Unix pipes which you were first introduced to in Chapter 1. As demonstrated in Chapter 2, pipes allow you to do things that you might accomplish using intermediate files, but without the clutter and overhead of saving, reading and deleting files. We'll start by looking at some commands that work particularly well with pipes. 1. Here are some commands that will come in handy in this exercise. # 'translate' all uppercase characters to lowercase characters % cat text.txt | tr "A-Z" "a-z" What does 'tr "0-9" "a-z"' do? % echo "3.14" | tr "0-9" "a-z" What about 'tr "a-z" "0-9"' do? % cat text.txt | tr "a-z" "0-9" # delete all occurrences of the numeric characters 0-9 % echo "Pi is approximately 3.14" | tr -d "0-9" What does 'tr -dc "a-z \n"' do? % cat text.txt | tr -dc "a-z \n" # 'squeeze' consecutive spaces into a single space using -s % echo "compresses long sequences of spaces" | tr -s " " What does 'tr -s " " "\n"' do? % cat text.txt | tr -s " " "\n" # eliminate repeated lines in a sorted list % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq What does 'uniq -c' do? % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq -c # 'explode' a file of lowercase words into characters % cat text.txt | sed 's/\([a-z ]\)/\1@/g' | tr -s "@" "\n" Explain how this script works - \(RE\) 'saves' match in \1 - Why not use sed 's/\([a-z ]\)/\1\n/g'? The -s squeezes the whitespace. # 'implode' a file - basically the inverse of 'explode' # cat text.txt | tr "\n" " " | sed 's/\([A-Za-z]\) /\1/g' | tr -s " " Explain how this script works. # 'sort' in reverse numerical order using the first field in a # file, then resolve any ties by sorting on the second field % cat text.txt | sort -k1,1nr -k2 This is the sort of special-purpose application of 'sort' that you have to figure out every time you need to use it. What does the following one-line shell script do? % paste abc.txt 123.txt where abc.txt and 123.txt are text files. Can you guess what the 'cut' command does? % paste abc.txt 123.txt | cut -f 2 OK. Now we're going to apply what we learned to the task of creating ciphers or coded messages. I want you to think in terms of sending text messages through pipes, encrypting a plain text message using a secret code at one end of the pipe, sending it through the pipe, and then decrypting it back into plain text at the other end. We're going to experiment with 'cracking' a particular class of coding schemes called 'substitution codes.' Samuel Pepys was a high-level government bureaucrat in the mid 17th Century (he was more or less equivalent to the US Secretary of the Navy). He kept a diary for ten years beginning in 1660. While often cited for its accounts of Pepys extramarital exploits, it provides a unique glimpse of 17th Century life in London. In his diary, Pepys wrote a detailed account of the great fire of London and many other important historical events that would be otherwise lost to us. He also gossiped about anyone and everyone and revealed his most intimate thoughts. It's no wonder that he took pains to render his diaries secret. 2. Here's what I did to create the mystery file. % set code = `cat alpha.txt | explode | shuffle | implode` % set file = `ls pepys/*.txt | shuffle | head -1` % cat $file | \ tr "A-Z" "a-z" | \ tr -dc "a-z \n" | \ tr "a-z" "$code" > secret.tmp This is equivalent to running the 'secret' command as follows: % secret pepys In the following steps, I create a bunch of temporary files of the form *.hst (histograms) and *.tmp. You can delete the lot of them using the 'cleanup' command. You might want to scan the 'README' file in this directory to learn a little more about the commands (shell scripts) used in this exercise. 3. Create one big file from all the diary entries in the ./pepys/ directory. % cat ./pepys/*.txt > pepys.tmp 4. Create a character histogram of the combined entries in ./pepys/. % charhist pepys.tmp > pepys.chars.hst 5. Create a character histogram of the mystery text. % charhist secret.tmp > secret.chars.hst 6. Put the histograms side-by-side for making comparisons. % paste pepys.chars.hst secret.chars.hst > compare.chars.tmp 7. Figure out which character was substituted for the space and then use 'tr' to put the spaces back in the right places. % cat secret.tmp | tr "#" "#" > secret.space.tmp This isn't quite right! Figure out what's wrong and fix it. Think about using uppercase letters to keep track of the substitutions you've made so far. 8. If you correctly identified the space character, you have a chance to exploit statistical regularities in word lengths. % wordhist pepys.tmp > pepys.words.hst % wordhist secret.space.tmp > secret.words.hst 9. As before, compare the two histograms side-by-side. % paste pepys.words.hst secret.words.hst > compare.words.tmp *. From here on out, you're on your own, but here are some suggestions. Use a combination of word and character histograms to try to guess character substitutions. Make selected substitutions using 'tr' to experiment with various decryption alternatives. There's a list of all the words in the 2nd Edition of the Webster Dictionary in /usr/share/dict/. You can use 'egrep' to find words that match a specified pattern (called a regular expression). For example, the invocation % egrep "\" /usr/share/dict/words thrail thrall thrill thripel thronal matches all words in the dictionary that begin with 'thr' followed by a vowel, followed by at least one but not more than two letters, followed by an 'l'. It's tricky making individual substitutions; if you replace each occurrence of 's' with 'g' then you'll confuse an inserted 'g' with a 'g' that was already present in the encrypted text. Figure out a way of handling substitutions that won't confuse things. First person who decodes the file mystery.txt gets a perfect grade on their choice of take-home quizzes. Expanding on this exercise will be one of the projects you can select for your midterm project. There is a lot you can do to build better code-cracking tools. Simple substitution codes are particularly vulnerable to statistical techniques. Think of some of the many regularities in English words that you might exploit. Look for common sequences of words like 'qu', 'ing', 'ed', and 'ie', common words like 'the', 'a', 'I', and 'you'. Sometimes you want to easily test a hypothesis, e.g., do the letters 'jy' map to the letters 'ed'. You might have several such hypotheses. How would you choose between them? How would you write a script to automate simple hypothesis generation and evaluation? You could rank substitutions by the number of words that they reveal in the text. How could you turn this general rule of thumb into a procedure for evaluating substitutions?