Brown Coherence Toolkit (v0.2)

Contents

What is the Coherence Toolkit?

Discourse coherence is a set of structural relationships between the sentences of a text which make later content intelligible and predictable based on the context established by earlier content. Coherent texts are "smooth"; incoherent ones are "choppy". Models of coherence have been applied to multidocument summarization, essay grading and text generation.

The toolkit is a set of C++ libraries and programs for creating and evaluating coherence models. The code is copyright under the Brown license. Please don't steal it or we'll be forced to send out the attack shark.

Main Authors:

Micha Elsner (melsner@cs.brown.edu)

Eugene Charniak (ec@cs.brown.edu)

Contributors:

Joseph Austerweil

Joseph Browne

Mark Johnson

Installing the Toolkit

Unzip the files to a directory.

Edit the file include/common.h : change the definition of DATA_PATH so that it points to the data directory. (No trailing slash!)

Check that you can make things:

% make everything

Check that you can run things. We have no automatic unit tests yet, so see the Using the Toolkit section below and try to run some of the examples.

Depending on your version of the compiler and general system setup, you may need to alter common.h to point to the correct versions of unordered_map and unordered_set. (For instance you may need #include <ext/hash_map> and __gnu_cxx::hash_map. In this case you'll also need to define the hashing function for strings yourself.)

You'll also need the correct path to the GNU scientific library. If you want to use the conditional word-to-word model, you will need a newer version with its included optimization routines.

Some older code uses the TAO optimizer. If you want to use this, install and set up TAO and then edit the Makefile (set TAO to 1). However, the current version has no live code using TAO.

Some (not currently active) code uses the Wordnet stemmer. If you want to use this, install Wordnet and set your $WNHOME environment variable to point to it. Then edit the Makefile (set WORDNET to 1).

The conditional discourse-new model uses Hal Daume III's megam classifier. To use this model, you must download megam, and then change the path in DiscSyntaxConditional.h to point to it.

The Makefile has the debug flag set. If all you're doing is using the tools, it's safe to turn on optimization instead (by setting OPTFLAGS to -O3).

Using the Toolkit

To make any program mentioned below:

% make program

To run an executable:

% bin/program

The "important" output from a program goes to STDOUT; debugging information goes to STDERR.

Document Formats

The toolkit operates on files containing Penn Treebank trees. (The passive-sentence detector expects auxiliary verbs to be marked as AUX. However this function is not currently used.)

Each tree must be printed on a single line. Separate documents are distinguished in one of three ways:

  • One document per file.
  • Documents separated by a single blank line.
  • Documents separated by the tree (S1 (S (NN end-of-article)))

You can configure the system to read pretty-printed trees as well, by setting Trees::singleLineReader to 0 in src/common.cc. If you do this, you may no longer separate documents by a blank line.

Looking at Data

TestSent shows documents without syntactic annotation:

% bin/TestSent data/wsj_0204.mrg
data/wsj_0204.mrg

the nation 's largest pension fund , which oversees $ 80 billion for college employees ,
        plans to offer two new investment options to its 1.2 million participants .  (data/wsj_0204.mrg-0)
*REST OF DOCUMENT*

TestGrid shows the entity grid for a document, which is useful for seeing how the syntactic analysis works:

% bin/TestGrid data/wsj_0204.mrg
data/wsj_0204.mrg

Done
           $ O - - - - - - - - - - - -
      NATION X - - - - - - - - - - - -
        FUND S S - - X S S - X S S - X
        *REST OF THE GRID*

Making Models

Each model type has an associated flag. The flag for the naive entity grid is -n; for the full flag list, see What models are included?.

Train constructs and trains a model, then writes it to standard out:

% bin/Train -n wsj_0204.mrg > 204model
Making naive egrid model with 2 history items,  max salience 4,  smoothing 1,  generativity 1.
data/wsj_0204.mrg ...
Estimating parameters.

The IBM Model family requires lots of training data, so it has a specialized training program, TrainIBM, that does not keep all the documents in memory:

% bin/TrainIBM -w data/wsj_0204.mrg > 204ibmmodel
Making word to word model, smooth 0 bias 2 root 1.
data/wsj_0204.mrg ...
Estimating parameters.
EM: 0
    0
Likelihood: -408.277
*MORE EM ITERATIONS*
Done.

Examining a Model

Model files are plain text, though they may not be particularly intelligible. For instance, the 204model file created above looks like this:

NAIVE
2       4       1       1
1
0
        0       0.25
        1       0.25
        2       0.25
        3       0.25

>>
*MORE PARAMETERS*

The best way to see what's really going on is to use Print; exactly what this output looks like depends on the model type:

% bin/Print -n 204model | head -n 20
Model is: 204model
Loading naive entity grid.
[S S 2]:
        S:      0.25
        O:      0.25
        X:      0.25
        -:      0.25

Combining Models

The testing programs are capable of loading multiple models and computing a joint score. To do this, just use multiple flags:

% bin/DiscriminateRand -n 204model -w 204ibmmodel data/wsj_0204.mrg
Model is: 204model
Loading naive entity grid.
Model is: 204ibmmodel
Loading word-to-word IBM model.
data/wsj_0204.mrg ...
*MORE OUTPUT*

Testing Models

The binary discrimination task is the simplest and fastest evaluation. It tests the model's ability to distinguish between a human-authored document in its original order, and a random permutation of that document. This task was first described by [BarzilayLapata05].

To run this test, use DiscriminateRand; this program reads any number of documents and performs the test on each one, using 20 random permutations. (You can change the number by editing the code.) The program prints the raw number of wins, ties and tests to STDOUT:

% bin/DiscriminateRand -n 204model -w 204ibmmodel data/wsj_0204.mrg
Model is: 204model
Loading naive entity grid.
Model is: 204ibmmodel
Loading word-to-word IBM model.
data/wsj_0204.mrg ...
Gold score: -647.805
Score: -918.065 WIN
1 tests, 1 wins, 0 ties.
Mean score: -782.935.
Total score: -1565.87.
Mean gold score: -647.805.
Total gold score: -647.805.
Average margin: 270.26.
Accuracy: 1 F: 1.

You can also do the discrimination test between arbitrary files by using Discriminate, which reads a single gold document, followed by a set of test files:

% bin/Discriminate -n 204model data/wsj_0204.mrg data/wsj_0204.mrg
Model is: 204model
Loading naive entity grid.
Reading the gold file data/wsj_0204.mrg ...
Gold score: -254.654
data/wsj_0204.mrg...
Score: -254.654 TIE
1 tests, 0 wins, 1 ties.
Mean score: -254.654.
Total score: -509.309.
Average margin: 0.

The insertion test finds the optimal place to insert each sentence into the document, given the correct ordering of the other sentences. It is quadratic in document length, which is still usually pretty fast. It is typically more difficult than discrimination. The program reports two scores: perfect insertions and the positional score described in [ElsnerCharniak07]. The task was described simultaneously in [ChenEtAl07] and [ElsnerCharniak07].

To run this test, use Insert, which prints the raw number of perfect insertions and the mean positional score (averaged by document) to STDOUT:

% bin/Insert -n 204model data/wsj_0204.mrg
Model is: 204model
Loading naive entity grid.
Reading the gold file data/wsj_0204.mrg ...
*THE DOCUMENT*
the nation 's largest pension fund , which oversees $ 80 billion for
college employees , plans to offer two new investment options to its 1.2 million participants .  (data/wsj_0204.mrg-0)
0:      -254.654:       1
1:      -258.503:       0.833333
2:      -261.975:       0.666667
3:      -262.353:       0.5
4:      -261.864:       0.333333
5:      -261.167:       0.166667
6:      -262.806:       0
7:      -266.386:       -0.166667
8:      -265.153:       -0.333333
9:      -263.027:       -0.5
10:     -263.213:       -0.666667
11:     -264.057:       -0.833333
12:     -259.863:       -1
Removed: 0      Inserted: 0
Score: 1
*12 MORE INSERTIONS*
Document mean: 0.746684
Document perfect: 11 of 13

Mean: 0.746684
Perfect: 11 of 13
Perfect (by line): 0.846154
Perfect (by doc): 0.846154
Mean (by line): 0.746684

The trace output shows the line being inserted, and then, for each possible position, the score assigned by the model and the value of the positional metric corresponding to that position.

The ordering test finds the ordering of all the sentences which is maximally coherent according to the model. This search problem is exponential in the document length, and a variety of search algorithms have been proposed; we use simulated annealing. The search schedule is the one used on the AIRPLANE corpus in [ElsnerEtAl07]. You may want to change it for longer documents, in which case you should edit the code.

The code reports Kendall's Tau as an evaluation metric. (The BLEU metric is not provided... sorry!) Both the task and the tau metric were first used in [Lapata03].

Run this test by using Order, which prints the average tau score to STDOUT:

% bin/Order -n 204model data/wsj_0204.mrg
Model is: 204model
Loading naive entity grid.
Reading the gold file data/wsj_0204.mrg ...
Gold score: -254.654
Starting search...
0: 1: -276.317: -0.384615: 5 8 6 12 4 10 9 11 7 2 3 0 1
New optimum: -274.21: -0.205128:  5 2 6 12 4 10 9 11 7 8 3 0 1
*LOTS OF ANNEALING*
Gold score: -254.654
Solver objective value: -250.338
Solution: 0 1 7 8 9 6 3 4 5 10 2 12 11
the nation 's largest pension fund , which oversees $ 80 billion for
college employees , plans to offer two new investment options to its 1.2 million participants .  (data/wsj_0204.mrg-0)
*THE REST OF THE DOCUMENT, IN THE ORDER FOUND*
Document tau: 0.384615

Tau: 0.384615

The trace output shows the iteration number, the temperature, the current log likelihood, the tau value and finally the permutation. When a new optimum is found, the trace shows the likelihood, tau and permutation.

We run our tests in parallel, on our in-house computing cluster. Unfortunately, the software we use for that is unlikely to work anywhere else. You may want to write your own scripts, though.

What models are included?

The toolkit includes a variety of generative coherence models. All these models take various constructor arguments; the defaults are set in Train.cc.

Naive Entity Grid

The naive entity grid (flag -n) is the generative entity grid described by [LapataBarzilay05]. For each sentence, it generates the syntactic role (or non-appearance) of each entity in the document, given its syntactic role in the previous sentences. (For a look at this representation, see TestGrid.) It also conditions on how many times the entity occurs in the entire document ("salience"). The model distinguishes different "entities" using the same-head heuristic (two NPs are part of the same entity if they have the same head noun).

The model takes several configuration arguments: The number of previous symbols to condition on (pretty self-explanatory). The maximum salience number (if an entity occurs more than this many times, it is counted as occurring this many times). An additive smoothing constant (added to all counts). Finally a "generativity" setting:

  • NOT_GEN : produces an "end-of-document" symbol for each entity.
  • SLIGHTLY_GEN : no "end-of-document" symbol.
  • MORE_GEN : if an item has salience number n, and has already occurred n times, it has probability 1.0 of not occurring anymore.
  • REALLY_GEN : replaces the salience number (number of occurrences total) with the number of previous occurrences.

Relaxed Entity Grid

The relaxed entity grid (flag -r) is the variant of the naive grid described in [ElsnerEtAl07], with some minor improvements over the published model (it now generates the number of entities in each sentence and the syntactic roles of all new (never previously mentioned) entities). It assigns a previously-mentioned entity to each remaining role, conditioned on the syntactic roles of all previously-mentioned entities and their saliences. (See the paper for details.)

As mentioned in the corrected version of the paper, it does not work quite as well as the naive model. (The published version of the paper claimed that it works better; this was due to a software bug, plus the idiosyncracies of the AIRPLANE corpus.) Just to be clear: this is a documentary release only, and we recommend that you use the naive entity grid instead.

The underlying entity grid representation is the same as in the naive grid, and so are the constructor arguments, except for the generativity setting.

  • NOT_GEN : assigns probability (because of the smoothing) to an entity occurring when its immediate history contains as many symbols as its total occurrences (eg, [S S 2]).
  • SLIGHTLY_GEN : same as NOT_GEN.
  • MORE_GEN : probability of an entity occurring after [S S 2] is 0.0.
  • REALLY_GEN : replaces the salience number (number of occurrences total) with the number of previous occurrences.

The consistent entity grid (flag -c) is the same as the relaxed grid, expect that instead of the heuristic parameter estimation described in [ElsnerEtAl07], it uses an estimator we suspect is consistent. This sounds cool, but in actuality the heuristic estimates tend to work very slightly better. A previous version used yet another estimator (and required TAO); the current version is more efficient, and requires the GSL linear algebra routines (see Installing the Toolkit). But there is no real point to using this model in the first place.

Discourse-new Model

The discourse-new model (flag -d) is described in [ElsnerCharniak07]. It models the syntax of first mentions and subsequent mentions of entities. It generates modifiers according to a Markov chain, which can be configured:

  • context : controls how many previous modifiers to condition on.
  • adjacent : if set to true, modifiers are produced based on distance from the phrasal head, instead of the actual previous modifiers.
  • smooth : constant-factor discounting.
  • byRef : if true, expects coreference annotations on the trees eg (NP#-1 (DT The) (NNP Federal) (NNP Aviation) (NNP Administration)) . If you want to use this on the MUC corpus, contact me and I'll send you the conversion scripts we used to produce this annotation. If false, uses the same-head heuristic. (Regardless of the value of this parameter during training, the training program sets false once training is complete. If you want to test the model on coreference-annotated data, comment this out.)

Conditional Discourse-new Model

The conditional discourse-new model (flag -dc) is described in [ElsnerCharniak08]. It uses a maximum-entropy classifier (learned via Hal Daume's megam system) to classify NPs as new or old based on syntax. It is about as effective as the generative discourse-new model, but simpler to explain.

Since it relies on invoking an external program, the model dumps its data in two external files: one holding the features and the other holding the learned model. To do a new experiment, you should first change the names of these external files-- creating a new model using Train will otherwise overwrite the previous model's data files.

  • byRef : if true, expects coreference annotations on the trees eg (NP#-1 (DT The) (NNP Federal) (NNP Aviation) (NNP Administration)) . If you want to use this on the MUC corpus, contact me and I'll send you the conversion scripts we used to produce this annotation. If false, uses the same-head heuristic. (Regardless of the value of this parameter during training, the training program sets false once training is complete. If you want to test the model on coreference-annotated data, comment this out.)
  • newHead : if true, adds a feature indicating whether the head of the NP has been seen before. This is useful only if you want to test the model's discourse-new detection on documents in their original order-- for disordered documents, it is merely misleading.
  • stem : the prefix used for the model's internal data files, which are named stem-model and stem-features.

Pronoun Model

The pronoun model (flag -p) runs the generative pronoun coreference model of [GeEtAl98] and uses the probability of the document as its coherence score. It is described in [ElsnerCharniak08]. The software for this model comes from a separate project, so the model doesn't have an integrated trainer. Instead you must train it using geD/trainPrns. The constructor argument is the name of the resulting file. (The provided file data/probs.txt is the result of training on the hand-annotated WSJ articles in geD/DATA.)

IBM Model

The IBM model (flag -w) uses a modified IBM model 1 alignment system. It was first used for coherence modeling by [SoricutMarcu06]. Our version aligns every noun and verb in each sentence with the nouns and verbs of the previous sentence (plus a null word).

The model smooths the emission distribution from the null word by adding 1 to every count; this ensures that all documents have positive probability. It also has three construction arguments that control additional smoothing:

  • smooth : constant factor discounting for all emission probabilities.
  • bias : aligns words to the null word with probability (1 + bias) / (|s| + bias) where |s| is the number of nouns and verbs in the previous sentence.
  • root : the probability is raised to the power of root. This has no effect when the model is used alone, but can help if you are combining it with something else.

Conditional Word-To-Word Model

The conditional word-to-word model (flag -cw) learns associations between pairs of words using logistic regression. It can be considered a variant of [Lapata03], or of the IBM model (which it relies on for a lot of shared code).

Since the model requires negative training instances, it has its own trainer, TrainConditional, which uses a single random permutation of each training document as a negative example.

It takes three construction arguments:

  • prevS : number of previous sentences to use as context
  • maxDepth : ignored
  • lambda : regularization constant

Joint Model

Models are combined using the joint model, which just multiplies all the likelihoods together. Obviously this is mathematically valid only for models which generate different things (for instance, an entity grid and pronouns). There is no flag for the joint model; it is constructed automatically when you specify multiple flags on the command line (see Combining Models). Currently there is no mixture model implementation as in [SoricutMarcu06].

Programming with the Toolkit

The code uses a fair amount of inheritance, and it's often worthwhile looking at the ancestor classes before trying to figure out how a derived class works. The base class for all models is CoherenceModel. The typical life of a CoherenceModel:

If you want to evaluate the model on many different permutations of a document, you will use permProbability. This is the main method for both training and testing the model; train and logProbability both call it internally. Models are allowed to cache information about a document that doesn't vary with order (for instance, the number of occurrences of a word). Therefore, you must call initCaches before invoking permProbability. When you are done using permProbability, call clearCaches.

Changes

version 0.2

  • DiscriminateRand uses more permutations
  • WordToWordIBM has extensive interface changes
  • added conditional discourse-new model
  • added conditional word-to-word model
  • estimator used in SelectionDist rewritten for increased efficiency (write me for details if needed); now uses GSL linear algebra package instead of TAO
  • makefile rewritten to produce 32 or 64-bit software automatically (depending on build environment)

References

[BarzilayLapata05]Regina Barzilay and Mirella Lapata. 2005. Modeling local coherence: an entity-based approach. In ACL'05.
[ChenEtAl07]Erdong Chen, Benjamin Snyder and Regina Barzilay. 2007. Incremental text structuring with online hierarchical ranking. EMNLP.
[ElsnerEtAl07](1, 2, 3) Micha Elsner, Joseph Austerweil and Eugene Charniak. 2007. A unified local and global model for discourse coherence. NAACL'07.
[ElsnerCharniak07](1, 2, 3) Micha Elsner and Eugene Charniak. A generative discourse-new model for text coherence. Tech report CS-07-04, Brown University.
[ElsnerCharniak08](1, 2) Micha Elsner and Eugene Charniak. Coreference-inspired coherence modeling. ACL'08.
[GeEtAl98]Niyu Ge, John Hale and Eugene Charniak. 1998. A statistical approach to anaphora resolution. 6th Workshop on Very Large Corpora.
[Lapata03](1, 2) Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In ACL'03.
[LapataBarzilay05]Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. IJCAI pgs 1085-1090.
[SoricutMarcu06](1, 2) Radu Soricut and Daniel Marcu. 2006. Discourse generation using utility-trained coherence models. ACL'06.