I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That’s good targeting!
Mining and Learning on Graphs Workshop 2011
I had originally planned to attend the 2-day workshop Mining and Learning with Graphs (MLG2011) but I forgot that it started on Saturday and I arrived on Sunday. I attended part of MLG2011 but it was difficult to pay attention considering it was my first time waking up at 7am in a long time. The first talk I arrived for was Networks Spill the Beans by Lada Adamic from the University of Michigan. Adamic’s presented work involved inferring properties of content (the “what”) using network structure alone (using only the “who”: who shares with whom). One example she presented involved questions and answers on a Java programming language forum. The research problem was to determine things such as who is most likely to answer a Java beginner’s question: a guru, or a slightly more experienced user? Another research question asked what dynamic interactions tell us about information flow. For this example, Adamic used data from the virtual world SecondLife. Certain landmarks (such as a bench) can be bookmarked by users and certain gestures (like a kiss) can be studied. This made my ears rise. SecondLife is a treasure trove of cool data. Is there a way to access it? It looks there might be a way to access some of it including monetary valuation, market purchases, and several APIs for different aspects of SecondLife. I will have to look into that later though. Adamic concluded with a discussion of Twitter as a social network, but I was starting to fall asleep from my hectic and early morning. The gist of her talk, and many other talks in this field, was to combine semantic variables (NLP) with topological variables (SNA) to predict som other semantic variables. This talk was very digestible, and very interesting (despite my lack of sleep), but featured some of the worst visualizations I have ever seen (area plots representing correlations across multiple levels of an ordinal variable), but that was minor. Of course, Nathan might disagree ;).
Social network analysis, and network analysis in general, is a field that I really want to sink my teeth into. The difficulty I have is that the discussion of this field seems to involve so much vernacular that is specific to the field that everything seems so much more difficult than it really is.
At this point I took off to lunch. Just across from the Hyatt (a beautiful hotel by the way) is Seaport Village, a beautiful waterfront park containing nice landscaping, shops, restaurants, all with the ocean in the background. There is no beach there — the village backs right up to the water. Across the bay is some type of military complex and Coronado Island. I had a $7 hot dog, followed by a chocolate-covered strawberry and a peanut butter cup from the nearby candy store. It was such a nice day I walked around for a while, grabbed a strawberry shake and then headed back for the next session… the one I had been waiting for!
Afternoon Tutorial: Probabilistic Topic Models
David Blei, Princeton
My dissertation topic is related to Latent Dirichlet allocation (well, topic modeling in general), so I was definitely interested to hear what the father of LDA had to say. Since this was a 3 hour tutorial, I was expecting that Blei would start with the unigram model, and then discuss Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Indexing (pLSI) building up to LDA. Instead, Blei started with LDA and for good reason! In this post, I will not summarize the mechanics of Latent Dirichlet Allocation as that is another post entirely. For some introduction, see here. LDA and its extensions can be used to model the evolution of topics over time, to model the connections among topics, and to predict links among objects in a network. Topic modeling is a case study in machine learning rather than a field in itself; topic modeling draws on several different concepts including Bayesian statistics, time series analysis, hierarchical models, Markov chain monte carlo (MCMC), Bayesian non-parametric statistics and sparsity. In LDA, a document is represented as a mixture of topics (some hypothetical quantity that captures content clustering), and a topic is a distribution over words in a vocabulary.
Again, this is a high-level description of what was discussed. A full mechanical analysis would require dozens of pages. LDA is just a probabilistic model. As such, there are established ways for estimating the parameters of the model as well as the topic assignments. Some of these include mean field variational methods, expectation propagation, Gibbs sampling, collapsed Gibbs sampling, collapsed variational Bayes and online variational Bayes. Each of these estimation methods has its own advantages and disadvantages. Blei showed the LDA and pLSI have a lot in common. Unlike LDA, pLSI uses maximum likelihood estimations (and the EM algorithm) for parameter estimation; pLSI tends to overfit badly. The hyperparameter α adds regularization to the ϴ parameter in the LDA model. [Sorry to refer to these random parameters, but it is difficult to describe without them. See the links mentioned earlier for an overview of LDA.]
Preprocessing. A lot of preprocessing must be performed before computing a topic model. First, we should remove stopwords, which are words that provide absolutely no clues to the content of the text. If we leave stopwords in the corpus when computing the model, we may end up with meaningless topics that are described with only stopwords, due to their high probability. Second, Blei mentioned that stemming is a good idea, but modern stemming algorithms tend to be too aggressive. If resources allow, I think it would be useful to have humans manually strip words to their root words. Multiword phrases such as “black hole” are also an issue. With sufficient resources, one could ask human labelers to identify these phrases and recode them as a single word by replacing the space between words with an underscore. Hanna Wallach (U. Mass) has a paper that describes how to identify multiwork phrases by using n-grams. Blei has a similar paper that discusses an algorithm called TurboTopics. He also mentioned that a standard statistical hypothesis test such as chi-squared, permutation tests, or a nested hypothesis test would also be sufficient, though inefficient. I have not thought of how this would work however. Finally, remove rare words because they can lead to local optima in the likelihood surface probably yielding inefficient computation.
Some hairy details. One of the parameters that makes LDA useful is α. α is a hyperparameter in the LDA model that determines the sparsity of draws from the underlying Dirichlet distribution. α is typically a small number; Blei mentioned that 0.01 is a good a priori value for α. As α gets larger, the distribution of topics tends towards the uniform (each topic equally likely) distribution and as α approaches 0, we get sparser draws, meaning more peaked topic probabilities. Setting α to be ridiculously small (i.e. 0.001) may yield a single topic dominating the model. α can be chosen, or we can fit α to the data using cross-validation or some other method. He also discussed the parameter η.
Open source software. We quickly (flash of an eye) went through a list of some open-source LDA implementations:
- LDA-C (variational EM), Blei.
- HDP (hierarchical Dirichlet processes), Blei
- LDA (R package, collapsed Gibbs), Jonathan Chang, Data Scientist, Facebook.
- Lingpipe, alias-i
- Mallet (collapsed Gibbs), UMass
- Gensim (online and batch LDA), Radim Řehůřek
To my delight, Blei seemed to favor the R package (although Gensim is a nice Python implementation). The R package not only contains LDA, but several other models including RTMs, MMSB and sLDA which will be discussed later. It is supposedly fast as well. The output from the R package can be visualized using the Topic Model Visualizer by Allison Chaney.
The beauty of LDA is that it can be embedded in many more complicated models. Some applications of these extensions include word sense, graphs and hierarchies. Before delving into specifics, there are a couple of changes to the LDA model that motivate the next topics.
- The probability of observing word w given a set of topics β and a set of topic labels z is given by P(w|β,z) which is multinomial. The distribution of P(w|ß,z) can be changed depending on what we are modeling. For example, for count data, P(w|β,z) can be Poisson. This drastically changes the model, however. In LDA, P(w|ß,z) is multinomial which is convenient because it is the conjugate prior of the Dirichlet distribution.
- The characteristic LDA posterior distribution can be used in more creative ways…
Correlated Topic Model. In LDA, all topics are considered independent of each other, and this is usually unrealistic. CTM allows the topics to be correlated. For example, a paper classified as about calculus is more likely to also be classified as about physics, than it is to be classified as about sewing. Blei mentioned that CTM allows for better prediction, likely because it is more realistic. CTM is also more robust to overfitting. The main distinction from LDA is that ϴ follows the logistic normal distribution instead of the Dirichlet distribution.
Dynamic Topic Model. DTM models how each individual topic changes over time. One example Blei showed involved a topic that could be labeled “technology”. In the late 1700s, this topic contained the words “coal”, “steel” (I am making it up from memory…probably badly…bear with me) and in 2011 contained the words “silicon” and “solar”. The main distinction from LDA is two-fold: assuming the topic at time t is normally distributed with the topic at time t-1 as the mean and some variance. That is,
instead of multinomial.
A limitation of DTM is that it does not handle the death of a topic gracefully.
Supervised LDA. In sLDA, we associate each document with an external variable. For example, a document may be a Yelp review containing text. The external variable associated with the Yelp review may be the number of stars in the associated rating. We can use sLDA to use the topics estimated by LDA as regressors to predict this external variable Y. Various types of regression can be performed from standard linear regression to the generalized linear model (GLM).The Yelp example would likely use an ordered logit model for Y.
Relational Topic Models. RTM applies sLDA to every pair of documents in a corpus and attempts to use content to predict connectedness in a graph. For example, given the content on my Facebook profile, one could use sLDA to predict what kind of reaction I would have to an ad (i.e. click or no click) and this could be used for targeted ad serving, or any other type of recommendation engine. Think collaborative filtering! RTM is also good for certains types of data that have spatial/geographic dependencies.
Ideal Point Topic Models were barely touched upon since we were running short on time (although we voted to extend the session by 30 mins and Blei happily obliged). They seem particularly useful in political science for predicting roll call votes.
Bayesian Non-Parametric Models are a hot topic but are too complicated to describe here. In LDA, the number of topics is determined a priori and remains fixed throughout the model. In real life, topics can be “born” and can “die” off and we may not know a priori how many topics to use. One can model the latter situation as a Chinese Restaurant Process where each table is associated with a topic. Furthermore, a Chinese Restaurant Franchise can be used for modeling hierarchies (hLDA). In CRF, there is a corpus level restaurant where each table is a parameter and a topic (called plates). Then, each document has its own Chinese restaurant where each table is associated with a customer in the corpus level Chinese restaurant. Blei recommended a book by Hjort.
Algorithms. The last few minutes were dedicated to discussing inference algorithms for LDA, particularly Gibbs sampling and variational Bayes. Gibbs sampling is very simple to implement, though Blei stated that it does not work for DTM or CTM because the assumptions of conjugacy (multinomial/Dirichlet) are violated. Variational Bayes is more difficult to implement, but handles non-conjugacy in CTM and DTM much better.
The plenary sessions consisted of several thank-yous and awards. The committee provided some humor which gave some humility to the long process of writing and submitting papers. They went over paper acceptance statistics and read some of the funnier comments that reviewers gave, one of which was something like “It is clear that the author did not read this paper before submitting it.” I don’t know how many times I have said that in various situations. The committee handed out awards for best paper and best dissertation. This year’s KDD Cup competition was a contest similar to the Netflix challenge, but involved music recommendation. The winner was the National Taiwan University, for the fourth straight year in a row I am told. The innovation award went to a researcher dear to my heart, Ross Quinlan, who developed the C4.5 decision tree modeling software.