Abstract
Many problems in information retrieval and
information filtering involve data that can be represented in form of
a sparse matrix with binary values or frequency counts. This includes
document-term frequencies, user ratings on a set of items, and
adjacency matrices encoding the hyperlink graph or citation structure
in document repositories. There are a number of
generic questions that typically occur in this context. Most
prominently, one would like to overcome the sparseness problem, i.e.,
reliably estimate probabilities for unobserved or rare events. In
addition, the derivation of low-dimensional data representations and
the identification of latent factors is often of considerable interest
as a preprocessing step for subsequent processing as well as for
visualization. This talk will introduce and discuss methods for matrix
decomposition and dimension reduction that address these questions.
Several example applications from information
retrieval will be used to illustrate the fruitfulness of this class of
methods and to demonstrate the effectiveness of decomposition
techniques. The latter will include (i) estimating document-specific
language models in ad hoc retrieval, (ii) deriving topic-centered
document representations for document categorization, (iii)
decomposing user preferences for collaborative filtering, (iv)
learning stochastic models for hyperlink and paper citation graphs.
|