| Authors: | David McClosky
Micha Elsner |
|---|---|
| Emails: | {dmcc+hogwash,melsner+hogwash}@cs.brown.edu |
| Date: | 10.24.2006 |
| Finished: | Not yet |
| Description: | Introduction to running parallel programs with Hogwash via the simpler command-line and more complex Python interfaces |
Hogwash provides access to clustering systems allowing you to run jobs in parallel. It abstracts the details of the various clustering systems available, allowing you to use them all at once. In addition, it provides a powerful bookkeeping system which spans clustering systems, allowing you to track which of your jobs are complete and what their results were, regardless of how they were run.
Other major features include:
- Output redirection -- the output from each job is saved to a separate file
- Highly scriptable in Python -- Hogwash is written in Python, allowing you to manipulate all Hogwash objects programmatically
- Since some features (output redirection, dependencies, putting jobs on hold) are implemented in Hogwash, you can make use them even if the underlying cluster system does not.
See this page for information about how to set up your environment to work with Hogwash.
Make a batch file, i.e. a list of commands:
command1 command2 command3 ... commandN
Each command must be independent -- it cannot rely on any other jobs being run concurrently (though it can require that it be run after certain jobs, we will not discuss dependencies in this document). Essentially, you must allow for any possible subset of commands (i.e. jobs) to be run concurrently (modulo dependencies). Let's make a simple, if not silly, batch file (example.batch) for demonstration purposes:
echo "Hi Mom!" echo "This is my second Hogwash command." echo "Pipe example" | rev # exit code examples true false
Note how each command is on a separate line and that you may safely use empty lines to space out your commands. Finally, lines beginning with # are considered comments and not counted as commands.
To run a batch file, our first step is to create a Session. A Session keeps track of the state of a set of jobs. Sessions are implemented on disk as directories and must have a filesystem path. All work with a session will involve the Session's filesystem path (sometimes called name or sessiondir).
We create a Session with a program called hogmigrate, called such since it helps people migrate from existing clustering systems that take batch files (and not because it has anything to do with process migration). By default, hogmigrate chooses the session name by taking the batchfile name and prepending hog- to it:
shell> hogmigrate example.batch Migrate: There's already a 'example.batch', so I'm calling your session 'hog-example.batch'. Hogwash: Session directory: /home/dmcc/rev/Hogwash/doc/hog-example.batch Hogwash: Generating Quahog batch file for all jobs Hogwash: Generating Quahog runner file.
If you wish to choose a different session name, use the -n parameter:
shell> hogmigrate example.batch -n hog-example Hogwash: Session directory: /home/dmcc/rev/Hogwash/doc/hog-example Hogwash: Generating Quahog batch file for all jobs Hogwash: Generating Quahog runner file.
We are now ready to enter the session with Sty, the Hogwash shell. Sty is designed to handle most of the tasks you would want to perform on a Session. Sty takes the session name as an argument:
shell> sty hog-example Welcome to Sty [main branch], the Hogwash shell. Type 'help' for help. Hogwash: Session directory: /home/dmcc/rev/Hogwash/doc/hog-example sty>
To see a list of all Sty commands, type help. You will see something like this (note that the actual list of commands may change as Sty develops):
sty> help Documented commands (type help <topic>): ======================================== args find quit status averagejoblength grep rerun thread check help results timeperjobhistogram clear hold rungrid viewlogs clearerror jobcompletiongraph runlocally viewquahoglog dep jobstats runquahog viewstatuslog errorsummary kill sanitycheck watch examine makebatchfile session exit quahoglog shelve Miscellaneous help topics: ========================== joblist
The most common command is status, which displays a list of all jobs in the current session and their state:
sty> status Job report: Queued: 0-4 [5]
We identify jobs by sequential index numbers. This report shows all five jobs in the example session are queued and ready to be run. Each index number corresponds to a line number in the original batch file. To see the command associated with a specific job number, use args:
sty> args 1 Job: 1 Arguments: ['echo "This is my second Hogwash command."']
args by itself displays all jobs:
sty> args 0 ['echo "Hi Mom!"'] 1 ['echo "This is my second Hogwash command."'] 2 ['echo "Pipe example" | rev'] 3 ['true'] 4 ['false']
Now we are ready to run some jobs. Sty provides three ways to run jobs (so far): on the local machine, remotely via quahog, or on the grid. Local operation is the most reliable:
sty> runlocally Tue Oct 24 13:50:04 2006: Running job 4 locally (anquetil) Tue Oct 24 13:50:04 2006: Job 4 finished with error. Tue Oct 24 13:50:04 2006: Running job 3 locally (anquetil) Tue Oct 24 13:50:04 2006: Job 3 finished successfully. Tue Oct 24 13:50:04 2006: Running job 2 locally (anquetil) Tue Oct 24 13:50:04 2006: Job 2 finished successfully. Tue Oct 24 13:50:04 2006: Running job 1 locally (anquetil) Tue Oct 24 13:50:04 2006: Job 1 finished successfully. Tue Oct 24 13:50:04 2006: Running job 0 locally (anquetil) Tue Oct 24 13:50:04 2006: Job 0 finished successfully. sty>
Once a set of jobs have been run, their status changes from queued to finished (or error, if they encounter some problem):
sty> status Job report: Finished: 0-3 [4] Error: 4 [1] sty>
Four of our jobs finished correctly; the fifth had a problem, which we can investigate using results:
sty> results 4 Job: 4 Arguments: ['false'] Logfile is nonempty. Exception: Hogwash.Errors.BadExitCode: Command 'false' died with: Exited with code 1 (exit status 256) Ran for 0s on anquetil:21417 [local] sty>
As you can see, the failure is caused by the nonzero return value of the process. This is the only way for a job from hogmigrate to fail (and even this behavior can be disabled when hogmigrate is invoked -- see hogmigrate -h). results also shows the return values of successful jobs, and other information about them:
sty> results 0 Job: 0 Arguments: ['echo "Hi Mom!"'] Logfile is nonempty. Result: 0 Ran for 0s on anquetil:21423 [local] sty>
This report does not show what you might consider the most useful result of the job -- the output it sent to standard out and standard error. You can see this output using viewlogs:
sty> viewlogs 0
viewlogs pages the output through less(1).
Once a job leaves the queue, it may not be run again (by any method). This ensures that we do not waste effort running any job multiple times. Of course, you may wish to rerun a job for various reasons. The clear command returns a job to queued status so that it can be run again. clear also wipes out all log files and results of the previous run.
To clear more than one job at once, you can use clear with a joblist. In fact, most Sty commands can process joblists. Sty provides help joblist for detailed information. Here are some examples:
Let's clear the jobs in our example session so we can run them again:
sty> clear all Clearing data and logs for 0-4 [5] sty> status Job report: Queued: 0-4 [5] sty>
To run jobs via the department's quahog distribution system, we can use runquahog:
sty> clear all Clearing data and logs for 0-4 [5] sty> runquahog Sty: Performing sanity check... Sty: Checking for unfinished jobs... Hogwash: Running quahog: 'qrun -x -w -c -q -g -b /home/dmcc/rev/Hogwash/doc/hog-example/quahogbatch 1>> /home/dmcc/rev/Hogwash/doc/hog-example/quahoglog 2>> /home/dmcc/rev/Hogwash/doc/hog-example/quahoglog' qrun exited with exit code: Exited with code 2 (exit status 512) (Some of these codes are actually okay and I have yet to make a list of all the safe ones.) sty> status Job report: Finished: 0-3 [4] Error: 4 [1] sty>
Sty appears to print an error message related to the exit code from qrun, but in fact qrun's exit status can be safely ignored. Our status report shows all jobs finished as they should have.
We could also have run these jobs on the grid, using rungrid. Like runlocally and runquahog, rungrid takes a job list. It also takes some grid-specific options -- see help rungrid.