Tech Report CS-98-06
Getting Useful Gender Statistics from English Text
John Hale and Eugene Charniak
May 1998
Abstract:
Gender, understood as a lexical feature, is important for anaphora because it narrows down the number of possible referents involved in a typical pronoun resolution situation. This work describes an automatic method for obtaining reliable guesses about the gender of entities in a corpus using free text. By using a simple but unreliable anaphora algorithm repeatedly over a large corpus, the probable genders of referenced entities can be compiled and given a salience ranking. These statistics are an inexpensive way to add on gender-feature information to a statistical anaphora resolution algorithm.
(complete text in pdf or gzipped postscript)