20th February 2006 — Hanna Wallach, LDA

Hanna Wallach is visiting to discuss some of her recent work:

Topic Modeling: Beyond Bag-of-Words

Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the “bag-of-words” assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both ngram statistics and latent topic variables, by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm.

On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than topics discovered using unigram statistics, potentially making them more meaningful.