Discourse coherence is a set of structural relationships between the sentences of a text which make later content intelligible and predictable based on the context established by earlier content. Coherent texts are "smooth"; incoherent ones are "choppy". Models of coherence have been applied to multidocument summarization, essay grading and text generation.
The toolkit is a set of C++ libraries and programs for creating and evaluating coherence models. The code is copyright under the Brown license. Please don't steal it or we'll be forced to send out the attack shark.
| Main Authors: | Micha Elsner (melsner@cs.brown.edu) Eugene Charniak (ec@cs.brown.edu) |
|---|---|
| Contributors: | Joseph Austerweil Joseph Browne Mark Johnson |
Unzip the files to a directory.
Edit the file include/common.h : change the definition of DATA_PATH so that it points to the data directory. (No trailing slash!)
Check that you can make things:
% make everything
Check that you can run things. We have no automatic unit tests yet, so see the Using the Toolkit section below and try to run some of the examples.
Depending on your version of the compiler and general system setup, you may need to alter common.h to point to the correct versions of unordered_map and unordered_set. (For instance you may need #include <ext/hash_map> and __gnu_cxx::hash_map. In this case you'll also need to define the hashing function for strings yourself.)
You'll also need the correct path to the GNU scientific library. If you want to use the conditional word-to-word model, you will need a newer version with its included optimization routines.
Some older code uses the TAO optimizer. If you want to use this, install and set up TAO and then edit the Makefile (set TAO to 1). However, the current version has no live code using TAO.
Some (not currently active) code uses the Wordnet stemmer. If you want to use this, install Wordnet and set your $WNHOME environment variable to point to it. Then edit the Makefile (set WORDNET to 1).
The conditional discourse-new model uses Hal Daume III's megam classifier. To use this model, you must download megam, and then change the path in DiscSyntaxConditional.h to point to it.
The Makefile has the debug flag set. If all you're doing is using the tools, it's safe to turn on optimization instead (by setting OPTFLAGS to -O3).
To make any program mentioned below:
% make program
To run an executable:
% bin/program
The "important" output from a program goes to STDOUT; debugging information goes to STDERR.
The toolkit operates on files containing Penn Treebank trees. (The passive-sentence detector expects auxiliary verbs to be marked as AUX. However this function is not currently used.)
Each tree must be printed on a single line. Separate documents are distinguished in one of three ways:
You can configure the system to read pretty-printed trees as well, by setting Trees::singleLineReader to 0 in src/common.cc. If you do this, you may no longer separate documents by a blank line.
TestSent shows documents without syntactic annotation:
% bin/TestSent data/wsj_0204.mrg
data/wsj_0204.mrg
the nation 's largest pension fund , which oversees $ 80 billion for college employees ,
plans to offer two new investment options to its 1.2 million participants . (data/wsj_0204.mrg-0)
*REST OF DOCUMENT*
TestGrid shows the entity grid for a document, which is useful for seeing how the syntactic analysis works:
% bin/TestGrid data/wsj_0204.mrg
data/wsj_0204.mrg
Done
$ O - - - - - - - - - - - -
NATION X - - - - - - - - - - - -
FUND S S - - X S S - X S S - X
*REST OF THE GRID*
Each model type has an associated flag. The flag for the naive entity grid is -n; for the full flag list, see What models are included?.
Train constructs and trains a model, then writes it to standard out:
% bin/Train -n wsj_0204.mrg > 204model Making naive egrid model with 2 history items, max salience 4, smoothing 1, generativity 1. data/wsj_0204.mrg ... Estimating parameters.
The IBM Model family requires lots of training data, so it has a specialized training program, TrainIBM, that does not keep all the documents in memory:
% bin/TrainIBM -w data/wsj_0204.mrg > 204ibmmodel
Making word to word model, smooth 0 bias 2 root 1.
data/wsj_0204.mrg ...
Estimating parameters.
EM: 0
0
Likelihood: -408.277
*MORE EM ITERATIONS*
Done.
Model files are plain text, though they may not be particularly intelligible. For instance, the 204model file created above looks like this:
NAIVE
2 4 1 1
1
0
0 0.25
1 0.25
2 0.25
3 0.25
>>
*MORE PARAMETERS*
The best way to see what's really going on is to use Print; exactly what this output looks like depends on the model type:
% bin/Print -n 204model | head -n 20
Model is: 204model
Loading naive entity grid.
[S S 2]:
S: 0.25
O: 0.25
X: 0.25
-: 0.25
The testing programs are capable of loading multiple models and computing a joint score. To do this, just use multiple flags:
% bin/DiscriminateRand -n 204model -w 204ibmmodel data/wsj_0204.mrg Model is: 204model Loading naive entity grid. Model is: 204ibmmodel Loading word-to-word IBM model. data/wsj_0204.mrg ... *MORE OUTPUT*
The binary discrimination task is the simplest and fastest evaluation. It tests the model's ability to distinguish between a human-authored document in its original order, and a random permutation of that document. This task was first described by [BarzilayLapata05].
To run this test, use DiscriminateRand; this program reads any number of documents and performs the test on each one, using 20 random permutations. (You can change the number by editing the code.) The program prints the raw number of wins, ties and tests to STDOUT:
% bin/DiscriminateRand -n 204model -w 204ibmmodel data/wsj_0204.mrg Model is: 204model Loading naive entity grid. Model is: 204ibmmodel Loading word-to-word IBM model. data/wsj_0204.mrg ... Gold score: -647.805 Score: -918.065 WIN 1 tests, 1 wins, 0 ties. Mean score: -782.935. Total score: -1565.87. Mean gold score: -647.805. Total gold score: -647.805. Average margin: 270.26. Accuracy: 1 F: 1.
You can also do the discrimination test between arbitrary files by using Discriminate, which reads a single gold document, followed by a set of test files:
% bin/Discriminate -n 204model data/wsj_0204.mrg data/wsj_0204.mrg Model is: 204model Loading naive entity grid. Reading the gold file data/wsj_0204.mrg ... Gold score: -254.654 data/wsj_0204.mrg... Score: -254.654 TIE 1 tests, 0 wins, 1 ties. Mean score: -254.654. Total score: -509.309. Average margin: 0.
The insertion test finds the optimal place to insert each sentence into the document, given the correct ordering of the other sentences. It is quadratic in document length, which is still usually pretty fast. It is typically more difficult than discrimination. The program reports two scores: perfect insertions and the positional score described in [ElsnerCharniak07]. The task was described simultaneously in [ChenEtAl07] and [ElsnerCharniak07].
To run this test, use Insert, which prints the raw number of perfect insertions and the mean positional score (averaged by document) to STDOUT:
% bin/Insert -n 204model data/wsj_0204.mrg Model is: 204model Loading naive entity grid. Reading the gold file data/wsj_0204.mrg ... *THE DOCUMENT* the nation 's largest pension fund , which oversees $ 80 billion for college employees , plans to offer two new investment options to its 1.2 million participants . (data/wsj_0204.mrg-0) 0: -254.654: 1 1: -258.503: 0.833333 2: -261.975: 0.666667 3: -262.353: 0.5 4: -261.864: 0.333333 5: -261.167: 0.166667 6: -262.806: 0 7: -266.386: -0.166667 8: -265.153: -0.333333 9: -263.027: -0.5 10: -263.213: -0.666667 11: -264.057: -0.833333 12: -259.863: -1 Removed: 0 Inserted: 0 Score: 1 *12 MORE INSERTIONS* Document mean: 0.746684 Document perfect: 11 of 13 Mean: 0.746684 Perfect: 11 of 13 Perfect (by line): 0.846154 Perfect (by doc): 0.846154 Mean (by line): 0.746684
The trace output shows the line being inserted, and then, for each possible position, the score assigned by the model and the value of the positional metric corresponding to that position.
The ordering test finds the ordering of all the sentences which is maximally coherent according to the model. This search problem is exponential in the document length, and a variety of search algorithms have been proposed; we use simulated annealing. The search schedule is the one used on the AIRPLANE corpus in [ElsnerEtAl07]. You may want to change it for longer documents, in which case you should edit the code.
The code reports Kendall's Tau as an evaluation metric. (The BLEU metric is not provided... sorry!) Both the task and the tau metric were first used in [Lapata03].
Run this test by using Order, which prints the average tau score to STDOUT:
% bin/Order -n 204model data/wsj_0204.mrg Model is: 204model Loading naive entity grid. Reading the gold file data/wsj_0204.mrg ... Gold score: -254.654 Starting search... 0: 1: -276.317: -0.384615: 5 8 6 12 4 10 9 11 7 2 3 0 1 New optimum: -274.21: -0.205128: 5 2 6 12 4 10 9 11 7 8 3 0 1 *LOTS OF ANNEALING* Gold score: -254.654 Solver objective value: -250.338 Solution: 0 1 7 8 9 6 3 4 5 10 2 12 11 the nation 's largest pension fund , which oversees $ 80 billion for college employees , plans to offer two new investment options to its 1.2 million participants . (data/wsj_0204.mrg-0) *THE REST OF THE DOCUMENT, IN THE ORDER FOUND* Document tau: 0.384615 Tau: 0.384615
The trace output shows the iteration number, the temperature, the current log likelihood, the tau value and finally the permutation. When a new optimum is found, the trace shows the likelihood, tau and permutation.
We run our tests in parallel, on our in-house computing cluster. Unfortunately, the software we use for that is unlikely to work anywhere else. You may want to write your own scripts, though.
The toolkit includes a variety of generative coherence models. All these models take various constructor arguments; the defaults are set in Train.cc.
The naive entity grid (flag -n) is the generative entity grid described by [LapataBarzilay05]. For each sentence, it generates the syntactic role (or non-appearance) of each entity in the document, given its syntactic role in the previous sentences. (For a look at this representation, see TestGrid.) It also conditions on how many times the entity occurs in the entire document ("salience"). The model distinguishes different "entities" using the same-head heuristic (two NPs are part of the same entity if they have the same head noun).
The model takes several configuration arguments: The number of previous symbols to condition on (pretty self-explanatory). The maximum salience number (if an entity occurs more than this many times, it is counted as occurring this many times). An additive smoothing constant (added to all counts). Finally a "generativity" setting:
The relaxed entity grid (flag -r) is the variant of the naive grid described in [ElsnerEtAl07], with some minor improvements over the published model (it now generates the number of entities in each sentence and the syntactic roles of all new (never previously mentioned) entities). It assigns a previously-mentioned entity to each remaining role, conditioned on the syntactic roles of all previously-mentioned entities and their saliences. (See the paper for details.)
As mentioned in the corrected version of the paper, it does not work quite as well as the naive model. (The published version of the paper claimed that it works better; this was due to a software bug, plus the idiosyncracies of the AIRPLANE corpus.) Just to be clear: this is a documentary release only, and we recommend that you use the naive entity grid instead.
The underlying entity grid representation is the same as in the naive grid, and so are the constructor arguments, except for the generativity setting.
The consistent entity grid (flag -c) is the same as the relaxed grid, expect that instead of the heuristic parameter estimation described in [ElsnerEtAl07], it uses an estimator we suspect is consistent. This sounds cool, but in actuality the heuristic estimates tend to work very slightly better. A previous version used yet another estimator (and required TAO); the current version is more efficient, and requires the GSL linear algebra routines (see Installing the Toolkit). But there is no real point to using this model in the first place.
The discourse-new model (flag -d) is described in [ElsnerCharniak07]. It models the syntax of first mentions and subsequent mentions of entities. It generates modifiers according to a Markov chain, which can be configured:
The conditional discourse-new model (flag -dc) is described in [ElsnerCharniak08]. It uses a maximum-entropy classifier (learned via Hal Daume's megam system) to classify NPs as new or old based on syntax. It is about as effective as the generative discourse-new model, but simpler to explain.
Since it relies on invoking an external program, the model dumps its data in two external files: one holding the features and the other holding the learned model. To do a new experiment, you should first change the names of these external files-- creating a new model using Train will otherwise overwrite the previous model's data files.
The pronoun model (flag -p) runs the generative pronoun coreference model of [GeEtAl98] and uses the probability of the document as its coherence score. It is described in [ElsnerCharniak08]. The software for this model comes from a separate project, so the model doesn't have an integrated trainer. Instead you must train it using geD/trainPrns. The constructor argument is the name of the resulting file. (The provided file data/probs.txt is the result of training on the hand-annotated WSJ articles in geD/DATA.)
The IBM model (flag -w) uses a modified IBM model 1 alignment system. It was first used for coherence modeling by [SoricutMarcu06]. Our version aligns every noun and verb in each sentence with the nouns and verbs of the previous sentence (plus a null word).
The model smooths the emission distribution from the null word by adding 1 to every count; this ensures that all documents have positive probability. It also has three construction arguments that control additional smoothing:
The conditional word-to-word model (flag -cw) learns associations between pairs of words using logistic regression. It can be considered a variant of [Lapata03], or of the IBM model (which it relies on for a lot of shared code).
Since the model requires negative training instances, it has its own trainer, TrainConditional, which uses a single random permutation of each training document as a negative example.
It takes three construction arguments:
Models are combined using the joint model, which just multiplies all the likelihoods together. Obviously this is mathematically valid only for models which generate different things (for instance, an entity grid and pronouns). There is no flag for the joint model; it is constructed automatically when you specify multiple flags on the command line (see Combining Models). Currently there is no mixture model implementation as in [SoricutMarcu06].
The code uses a fair amount of inheritance, and it's often worthwhile looking at the ancestor classes before trying to figure out how a derived class works. The base class for all models is CoherenceModel. The typical life of a CoherenceModel:
If you want to evaluate the model on many different permutations of a document, you will use permProbability. This is the main method for both training and testing the model; train and logProbability both call it internally. Models are allowed to cache information about a document that doesn't vary with order (for instance, the number of occurrences of a word). Therefore, you must call initCaches before invoking permProbability. When you are done using permProbability, call clearCaches.
| [BarzilayLapata05] | Regina Barzilay and Mirella Lapata. 2005. Modeling local coherence: an entity-based approach. In ACL'05. |
| [ChenEtAl07] | Erdong Chen, Benjamin Snyder and Regina Barzilay. 2007. Incremental text structuring with online hierarchical ranking. EMNLP. |
| [ElsnerEtAl07] | (1, 2, 3) Micha Elsner, Joseph Austerweil and Eugene Charniak. 2007. A unified local and global model for discourse coherence. NAACL'07. |
| [ElsnerCharniak07] | (1, 2, 3) Micha Elsner and Eugene Charniak. A generative discourse-new model for text coherence. Tech report CS-07-04, Brown University. |
| [ElsnerCharniak08] | (1, 2) Micha Elsner and Eugene Charniak. Coreference-inspired coherence modeling. ACL'08. |
| [GeEtAl98] | Niyu Ge, John Hale and Eugene Charniak. 1998. A statistical approach to anaphora resolution. 6th Workshop on Very Large Corpora. |
| [Lapata03] | (1, 2) Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In ACL'03. |
| [LapataBarzilay05] | Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. IJCAI pgs 1085-1090. |
| [SoricutMarcu06] | (1, 2) Radu Soricut and Daniel Marcu. 2006. Discourse generation using utility-trained coherence models. ACL'06. |