Monday, September 20, 2004

Supplementary tips and questions for 'Ciphers and Secrets' 

http://www.cs.brown.edu/~tld/talk/home-Z-H-4.html#node_sec_2.2.2

Shell scripts with lists of commands provide one way of stringing
together a sequence of computations.  The different computations /
commands are connected together using files and variables - the only
memory that shell scripts have.  Variables and files are the means
whereby different computations communicate with one another.  In this
exercise, we'll explore an alternative method for computations to pass
information back and forth namely Unix pipes which you were first
introduced to in Chapter 1.  As demonstrated in Chapter 2, pipes allow
you to do things that you might accomplish using intermediate files,
but without the clutter and overhead of saving, reading and deleting
files.  We'll start by looking at some commands that work particularly
well with pipes.


1. Here are some commands that will come in handy in this exercise.

   # 'translate' all uppercase characters to lowercase characters
   % cat text.txt | tr "A-Z" "a-z" 

   What does 'tr "0-9" "a-z"' do? 
   % echo "3.14" | tr "0-9" "a-z"

   What about 'tr "a-z" "0-9"' do? 
   % cat text.txt | tr "a-z" "0-9" 

   # delete all occurrences of the numeric characters 0-9
   % echo "Pi is approximately 3.14" | tr -d "0-9" 

   What does 'tr -dc "a-z \n"' do? 
   % cat text.txt | tr -dc "a-z \n"
 
   # 'squeeze' consecutive spaces into a single space using -s
   % echo "compresses long   sequences  of  spaces" | tr -s " " 

   What does 'tr -s " " "\n"' do?
   % cat text.txt | tr -s " " "\n"

   # eliminate repeated lines in a sorted list
   % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq 

   What does 'uniq -c' do?
   % echo "a b b c c c z" | tr " " "\n" | tr " " "\n" | uniq -c 

   # 'explode' a file of lowercase words into characters
   % cat text.txt | sed 's/\([a-z ]\)/\1@/g' | tr -s "@" "\n"

   Explain how this script works - \(RE\) 'saves' match in \1 - Why
   not use sed 's/\([a-z ]\)/\1\n/g'?  The -s squeezes the whitespace.

   # 'implode' a file - basically the inverse of 'explode'
   # cat text.txt | tr "\n" " " | sed 's/\([A-Za-z]\) /\1/g' | tr -s " "

   Explain how this script works. 

   # 'sort' in reverse numerical order using the first field in a
   # file, then resolve any ties by sorting on the second field
   % cat text.txt | sort -k1,1nr -k2

   This is the sort of special-purpose application of 'sort' 
   that you have to figure out every time you need to use it.

   What does the following one-line shell script do? 

   % paste abc.txt 123.txt

   where abc.txt and 123.txt are text files.  Can you guess
   what the 'cut' command does? 

   % paste abc.txt 123.txt | cut -f 2

OK. Now we're going to apply what we learned to the task of creating
ciphers or coded messages.  I want you to think in terms of sending
text messages through pipes, encrypting a plain text message using a
secret code at one end of the pipe, sending it through the pipe, and
then decrypting it back into plain text at the other end.  We're going
to experiment with 'cracking' a particular class of coding schemes
called 'substitution codes.'

Samuel Pepys was a high-level government bureaucrat in the mid 17th
Century (he was more or less equivalent to the US Secretary of the
Navy).  He kept a diary for ten years beginning in 1660.  While often
cited for its accounts of Pepys extramarital exploits, it provides a
unique glimpse of 17th Century life in London.  In his diary, Pepys
wrote a detailed account of the great fire of London and many other
important historical events that would be otherwise lost to us. He
also gossiped about anyone and everyone and revealed his most intimate
thoughts.  It's no wonder that he took pains to render his diaries
secret.

2. Here's what I did to create the mystery file. 

   % set code = `cat alpha.txt | explode | shuffle | implode`
   % set file = `ls pepys/*.txt | shuffle | head -1`
   % cat $file | \
     tr "A-Z" "a-z" | \
     tr -dc "a-z \n" | \
     tr "a-z" "$code" > secret.tmp

   This is equivalent to running the 'secret' command as follows:

   % secret pepys

   In the following steps, I create a bunch of temporary files of the
   form *.hst (histograms) and *.tmp.  You can delete the lot of them
   using the 'cleanup' command.

   You might want to scan the 'README' file in this directory to learn 
   a little more about the commands (shell scripts) used in this exercise.

3. Create one big file from all the diary entries in the ./pepys/ directory.

   % cat ./pepys/*.txt > pepys.tmp


4. Create a character histogram of the combined entries in ./pepys/.

   % charhist pepys.tmp > pepys.chars.hst
 

5. Create a character histogram of the mystery text.

   % charhist secret.tmp > secret.chars.hst


6. Put the histograms side-by-side for making comparisons.

   % paste pepys.chars.hst secret.chars.hst > compare.chars.tmp


7. Figure out which character was substituted for the space and
   then use 'tr' to put the spaces back in the right places. 

   % cat secret.tmp | tr "#" "#" > secret.space.tmp

   This isn't quite right!  Figure out what's wrong and fix it.
   Think about using uppercase letters to keep track of the
   substitutions you've made so far.


8. If you correctly identified the space character, you have a 
   chance to exploit statistical regularities in word lengths. 

   % wordhist pepys.tmp > pepys.words.hst 

   % wordhist secret.space.tmp > secret.words.hst


9. As before, compare the two histograms side-by-side.

   % paste pepys.words.hst secret.words.hst > compare.words.tmp


*. From here on out, you're on your own, but here are some suggestions.

   Use a combination of word and character histograms to try to guess
   character substitutions.  Make selected substitutions using 'tr' to
   experiment with various decryption alternatives. 

   There's a list of all the words in the 2nd Edition of the Webster
   Dictionary in /usr/share/dict/. You can use 'egrep' to find words
   that match a specified pattern (called a regular expression).  For
   example, the invocation

   % egrep "\<thr[aeiou][a-z]{1,2}l\>" /usr/share/dict/words
   thrail
   thrall
   thrill
   thripel
   thronal

   matches all words in the dictionary that begin with 'thr' followed
   by a vowel, followed by at least one but not more than two letters,
   followed by an 'l'. 

   It's tricky making individual substitutions; if you replace each
   occurrence of 's' with 'g' then you'll confuse an inserted 'g' with
   a 'g' that was already present in the encrypted text.  Figure out a
   way of handling substitutions that won't confuse things.

   First person who decodes the file mystery.txt gets a perfect grade 
   on their choice of take-home quizzes.


Expanding on this exercise will be one of the projects you can select
for your midterm project. There is a lot you can do to build better
code-cracking tools.  Simple substitution codes are particularly
vulnerable to statistical techniques.  Think of some of the many
regularities in English words that you might exploit.  Look for common
sequences of words like 'qu', 'ing', 'ed', and 'ie', common words like
'the', 'a', 'I', and 'you'.  Sometimes you want to easily test a
hypothesis, e.g., do the letters 'jy' map to the letters 'ed'.  You
might have several such hypotheses.  How would you choose between
them?  How would you write a script to automate simple hypothesis
generation and evaluation?  You could rank substitutions by the number
of words that they reveal in the text.  How could you turn this
general rule of thumb into a procedure for evaluating substitutions?