October 31, 2008

Primary sources

For most people trying to sell text analysis to marketers, “social media” usually means two things: blogs and Twitter.  Why those two, out of all the “social” text on the internet?  Let’s go through all the possibilities.

- MySpace

advantages: target audience, public profiles

disadvantages: mostly spam, false metadata (“101 years old”), many profiles not crawlable

- Usenet:

advantages: easily obtainable from your local news server, decent metadata

disadvantages: not 1994 any more

- Discussion Boards / Forums:

advantages: lots of diverse topics, relatively popular, target audience segments

disadvantages: need to crawl the net, sparse metadata, messy

- Review Sites:

advantages: the whole site is dedicated to user opinion!

disadvantages: limited to cars and digital cameras

- Facebook:

advantages: target audience, tons of metadata

disadvantages: not public, can’t scrape

- Rest of the Web:

advantages: heaps of data

disadvantages: trying to make sense of anything is impossible

Any sane person trying to start a text analytics company would of course opt for blogs and twitter. Why not?  Blogs are nice and wordy, easily indexed, have timestamps on the posts, there’s a graph structure to examine if you get bored….and Twitter is a hot name, has an API, also has some sort of graph structure….it’s not too hard to organize this text data, stick it in a data warehouse or S3/EC2, and start throwing SVDs and CRFs at that bitch……

The only problem is that very few people use twitter, and almost no one blogs.  For example, the last estimate I can find _anywhere_ on the web is this article from Business Week in April 2007, which pegs the number at 495,000 English posts per day, on a slow decline from six months earlier.  I have no idea what it is now but the lack of statistics from people who would benefit from growth in blogging (i.e. Technorati) tells me that the blogging can’t be in a good way.  Plus a lot of blog posts are spam and those can be hard to detect.

Twitter is in better shape, but here we have a problem with coverage.  The group of people who use Twitter are very self-selecting, limited mostly to the social media experts themselves.  If I mined sentiment from Twitter messages I could really conclude nothing about how my brand or product was perceived by normal people.  (This may change, of course, if Twitter goes mainstream.)

—to be continued—

Comments (View)
blog comments powered by Disqus

The opinions expressed on this site are mine and do not necessarily represent those of my employer, Facebook. You won’t find any confidential company information here, and while you’re welcome to get in touch with me, I’m afraid I can’t put you in contact with my employer.