Primary sources
For most people trying to sell text analysis to marketers, “social media” usually means two things: blogs and Twitter. Why those two, out of all the “social” text on the internet? Let’s go through all the possibilities.
- MySpace
advantages: target audience, public profiles
disadvantages: mostly spam, false metadata (“101 years old”), many profiles not crawlable
- Usenet:
advantages: easily obtainable from your local news server, decent metadata
disadvantages: not 1994 any more
- Discussion Boards / Forums:
advantages: lots of diverse topics, relatively popular, target audience segments
disadvantages: need to crawl the net, sparse metadata, messy
- Review Sites:
advantages: the whole site is dedicated to user opinion!
disadvantages: limited to cars and digital cameras
- Facebook:
advantages: target audience, tons of metadata
disadvantages: not public, can’t scrape
- Rest of the Web:
advantages: heaps of data
disadvantages: trying to make sense of anything is impossible
Any sane person trying to start a text analytics company would of course opt for blogs and twitter. Why not? Blogs are nice and wordy, easily indexed, have timestamps on the posts, there’s a graph structure to examine if you get bored….and Twitter is a hot name, has an API, also has some sort of graph structure….it’s not too hard to organize this text data, stick it in a data warehouse or S3/EC2, and start throwing SVDs and CRFs at that bitch……
The only problem is that very few people use twitter, and almost no one blogs. For example, the last estimate I can find _anywhere_ on the web is this article from Business Week in April 2007, which pegs the number at 495,000 English posts per day, on a slow decline from six months earlier. I have no idea what it is now but the lack of statistics from people who would benefit from growth in blogging (i.e. Technorati) tells me that the blogging can’t be in a good way. Plus a lot of blog posts are spam and those can be hard to detect.
Twitter is in better shape, but here we have a problem with coverage. The group of people who use Twitter are very self-selecting, limited mostly to the social media experts themselves. If I mined sentiment from Twitter messages I could really conclude nothing about how my brand or product was perceived by normal people. (This may change, of course, if Twitter goes mainstream.)
—to be continued—
3 years ago