The growing popularity of social media has had an interesting side-effect for language researchers: services such as Twitter have resulted in people having instant-messenger-style conversations using a public medium, where anyone can observe. This creates a unique opportunity to collect, study, and model large-scale conversation data. We present a method for mining conversations from Twitter's public feed. The resulting conversation corpus, which will be made publicly available, has more than 1.3 million conversations, 75 thousand of which have more than 5 turns, providing a rich resource for the study of both Twitter and internet chat. Furthermore, we present several methods that attempt to model the flow of conversation by discovering latent classes over Tweets. We show that a repurposed content model (Barzilay and Lee 2004) can discover meaningful dialogue acts, such as "question" and "comment", which indicate not only the role a Tweet plays in its conversation, but also the sorts of Tweets that are likely to follow. This model is improved and extended by employing a Bayesian sampling-based approach, allowing us to model a conversation's topic, and to introduce sparse priors during learning.
Back to symposium main page