Gatsby Unit | Research

GATSBY COMPUTATIONAL NEUROSCIENCE UNIT

A likelihood-based framework for the analysis of discussion threads

Vicenc Gomez

SNN Adaptive Intelligence, Radboud University

"A likelihood-based framework for the analysis of discussion threads"

On-line discussion threads are conversational cascades in the form of posted messages that can be found in blogs, news aggregators or bulletin boards systems. Unlike other types of information cascades such as the ones corresponding to massively circulated chain letters, Twitter, or diffusion of pages on Facebook, discussion threads comprise a more elaborated interaction between users. At the same time, since threaded discussions are in direct correspondence with the information flow in a social system, understanding their governing mechanisms and patterns plays a fundamental role in contexts like spreading of technological innovations, diffusion of news and opinion or viral marketing.

I will talk about a recent likelihood-based framework we propose to analyze the structure and evolution of on-line discussion threads. We compare several parametric generative models which combine three basic

features: popularity, novelty and a trend (initial popularity) to reply to the thread originator. We show that a model which combines these three ingredients is able to capture many of the statistical properties (widths, degrees, depths) as well as the thread evolution with a surprisingly level of accuracy in four popular and diverse websites:

Slashdot, Barrapunto, Meneame (Digg) and Wikipedia.

Our approach to estimate the model parameters is based on the likelihood of the entire evolution of each single thread of a given dataset. This fact prevents overfitting on particular quantities such as degree distributions and provides the model which globally fits best to the data. We show empirically that the estimation procedure is not biased and does not require very large datasets. Further, parameter estimates have descriptive power, since they allow to characterize the habits and communication patterns of a given dataset in terms of the aforementioned features.