October 22, 2008

Predicting polls with Lexicon

With Facebook Lexicon we’ve been able to aggregate lots of public and semi-public conversations taking place between lots of different types of people in the US.  Several gigabytes of raw text goes through the Lexicon system every day.  It’s a lot of stuff to churn through, and we couldn’t do it without the use of our trusty Hadoop cluster.

Back around February I started to get interested in sentiment analysis.  There’s been a lot of cool work around this problem since Pang & Lee’s seminal 2002 paper.  When I started to dig, it quickly became apparent that our linguistic challenges would be around misspellings (“luvvv”), punctuation/tokenization (“saw your status!!!! awesome”) and word-sense ambiguity (“that movie is the shit”).  Unfortunately, not many of the techniques that grew out of the academic research seemed applicable to the Facebook Wall corpus.  Irony and co-reference aren’t necessarily big problems when you are dealing with short messages between friends.  In fact, I would love to see more academic work on language in social media…check out YouTube comments for a good publicly available data source.

We developed a corpus of 5000 tagged posts labeled positive, negative or neutral about certain objects.  We then started generating synonyms for sentiment words by comparing every word to every other word in a single day of data, ranking by similarity of their immediately neighboring words.  So in the corpus (“I love you”, “I hate you”), “love”:”hate” would score a 1 on the (“I”, “you”) dimension (I got this idea from Dan Yarlett).  The computation is indeed enormous and took 12 hours on our 80-node cluster, producing 10 terabytes of intermediate map data.  I later tried comparing all words and all bigrams, which killed the cluster completely.  But when it was all said and done, we had a nice map of seed words (i.e. “ugly”) to candidate synonyms:

ugly:fugly      0.000688094695339
ugly:skanky     0.000370198809783
ugly:obese      0.000362199539019
ugly:beautifull 0.000310669352933
ugly:sexi       0.00030694876188

etc.  As you can see, there are a lot of antonyms in there too.  The solution was to cut off the list of candidate words for each seed word, dump them in a database, and go through and tag the words manually.  It took a few hours but we generated a really good dictionary to work with.  Had we done things “the right way” and set about extracting the sentiment terms from a labeled corpus, we would have needed a ton more hand-labeled data.  That costs time and money; far better to bootstrap and do things quick, dirty and wrong.

The next step was to try out a bunch of models on our 5000 posts.  I tried different approaches, ranging from counting the total number of sentiment words in the post to looking for proximate words to the topic we were interested in.  Certainly no fancy SVMs or what-have-you.  Well, guess what.  We got > 80% precision with some extremely simple tokenization schemes, negation heuristics, and feature selection (throwing out words which were giving us a lot of false positives).  Sure, our recall sucked, but who cares…we have tons of data!  Want greater accuracy?  Just suck in more posts!

So, what good is this low-recall, overfit, completely hacky model of sentiment?  It turns out it predicts and responds to all kinds of interesting things in the world. I thought it would be interesting to look at how it predicts election polls, for instance.  One would expect that the sentiment expressed in chatter with friends would be a good leading indicator of poll results.  We compared the daily national tracking poll and favorability poll from Rasmussen (Nate Silver from fivethirtyeight.com said that Rasmussen was the poll he’d want with him on a desert island) with the 14 day rolling averages of Lexicon Sentiment.  The correlation was pretty good when you lined them up…but Lexicon was even more predictive a few days before the poll came out, while totally unpredictive a few days later.  That confirmed the hypothesis that polls respond to sentiment, and not vice versa.

Then we started looking at how the poll predicted itself in the future with how Lexicon projected the poll to look like.  The poll was more correlated with itself within a few days - not surprising given that there is an actual overlap with the data within a 3 day rolling window - but Lexicon did well further back in time.  We saw a .41 correlation between Lexicon and the Rasmussen tracking poll a week later, compared to a .32 correlation between Rasmussen and itself a week later.  Kinda cool!

Favorability was also really interesting.  For John McCain, Lexicon was really predictive of favorability ratings, even though his support is underrepresented amongst Facebook users (The awesome and underused Facebook Polls product showed him with 25-30% support a few weeks ago).  Seven days before the poll, Rasmussen had a correlation of .41 while Lexicon was at .61.  Interestingly, Lexicon had little predictive power over Obama’s favorability.  Perhaps the plugged-in Obama campaign workers were most active on Facebook during times when the race was tightening?  Interesting to think about.

Lexicon vs. Polls

It’s worth noting that these Sentiment scores are aggregated across all of Facebook, which has a strong bias: likely voters, international users, and underrepresented demographics are unaccounted for.  These are problems that will need to be ironed out in order to develop more accurate models.  Should you trust Lexicon Sentiment scores when gauging the direction of the Presidential election?  They are certainly different from polls, in that they sample a much larger population, measure favorability and sentiment implicitly, and don’t suffer from cognitive dissonance.  I hope those of you who are curious will play around with the data and share your findings!

You can download a CSV of the current Lexicon Sentiment data here.

Comments (View)
blog comments powered by Disqus

The opinions expressed on this site are mine and do not necessarily represent those of my employer, Facebook. You won’t find any confidential company information here, and while you’re welcome to get in touch with me, I’m afraid I can’t put you in contact with my employer.