May 2, 2010

Review: Malcolm Gladwell’s “Outliers”

Malcolm Gladwell’s Outliers is a collection of anecdotes about successful human careers, ranging in domain from computer science to hockey to the law.  In these episodes he reveals hidden environmental precursors to eventual success or failures.  An example is that of hockey players in the elite Canadian junior leagues, who are far more likely to have been born in the first few months of the year than would be expected from a random selection from the population.  Gladwell suggests that the responsibility lays with the birth date cut-off of January 1st that determines team eligibility in the Canadian children’s leagues; older players in each cohort have had more time to develop physically than their autumn-born classmates, and their size gives them an advantage at try-outs.  This sets them on the road to athletic success from an early age with better competition, more ice time, and better coaching at each level.  By the time they hit 20 or 21 years of age, the gap in talent between a January-born and a December-born player of similar natural skill might be the difference between the Detroit Red Wings and the Pensacola Ice Flyers. 

Similarly, Bill Gates was a brilliant person who was fortunate to attend one of the few schools in the country with mainframe computer access in 1968; the Mothers Club had put up the funding, and a connected parent at the school was able to get him access to UW machines when the funds ran out. By developing these skills at an early age, Bill Gates had a head start on the coming computer revolution, and was able to experiment and innovate faster than the IBM career-men of an older generation.  But without the stroke of luck provided to him by the Mothers Club in his environment, the book implies, the world might never have been graced with DOS.

The most coherent point I got from Outliers is “nurture matters.” Success occurs when smart and talented people take advantage of the opportunities given to them.  That point seems inarguable.  I do have an issue with Gladwell’s singular focus on the “freaky” attributes of the environments that nurtured the success of his characters.  I think it leads to an unhealthy understanding of causality and the nature of outliers.  A probabilist would say that all environmental factors predict success, but they do so with varying degrees of predictive power – computer access in school, for instance, might be a better predictor of success than eating Wheaties for breakfast – but so might be birth order, intelligent friends, and inspirational teachers.  By focusing the book on the Strange but True, he slights such factors that have greater predictive power.  This is important, since Outliers is overtly prescriptive…the jacket cover claims the book is “a blueprint for making the most of human potential.”  He sees something unexpected in P(born in January | hockey player), but this leads to incorrect value judgments about P(hockey player | born in January).  For example, P(hockey player | born in January) < P(hockey player | having a hockey coach for a dad), and P(hockey player | born in January) « P(hockey player | being really big and really fast).  The probability of being a Jeff Beukeboom is greater if you are 6’5” and 230 pounds than if you are born in March.

The human eye/brain instinctively saccades to outliers, and its first question is usually “what are they doing there?” – but this comes from a misunderstanding of outliers and what to do with them.  There are two plausible explanations for the presence of outliers.  One is that your data is noisy; you were unable to capture the model accurately due to crappy measurement, so the outliers should be ignored.  The other explanation is that the outliers were generated from the same model that produced the rest of your data; God flipped The Big Nickel and it landed not on heads or tails but on its side…a freak occurrence, but surely one that has some probability, however infinitesimal.  And when She flips it a hundred billion times (the estimated number of humans who have ever lived), there are going to be some Bill Gateses and Barack Obamas and Bobby Orrs out there on the fringes of the distribution.  Certain aspects of ones nature and environment can be strong predictors of the variables people commonly associate with success (power, wealth, fame), but it’s not really fruitful to focus on any one feature more than any other, unless you can say unequivocally “birth date relative to January 1st explains more variance in Canadian hockey players’ success than presence of favorable alleles responsible for building fast-twitch muscle fibers.”  Gladwell says of the hypothetical January-born hockey phenom, “he didn’t start out an outlier.  He started out just a little bit better.”  But the thought begs continuation; he started out just a little bit better than the tiny fraction of Canadian boys good enough to be at the tryout in the first place!  And what caused those boys to be there in the first place?  A whole mess of factors that go unmentioned and unexplored…but their potential contribution is no less important.

Ed Ricketts ruminated on the pitfalls of such “teleological thinking” seventy years ago on a survey of intertidal invertebrates in Mexico, catalogued in Steinbeck’s The Log from the Sea of Cortez.  Take a box of matches, Ricketts says.  Why is the longest match longer than its fellows?  Because, the teleological thinker responds, the machine that created it was depressed slightly longer than usual when it was created, slicing it off at a point further than usual from the end of the match.  Why did that happen?  Because the air in the factory was slightly more humid that instant, affecting the spring mechanism in the machine.  Why was it more humid that instant?  Because a butterfly flapped its wings in Africa….and back it goes, to “causes” that individually are weaker and weaker predictors of match length…but collectively, they completely determined the outcome.

The neo-Rickettsian psycho-probabilist’s response to the question is that the longest match is the longest because it is the longest.  There is some distribution of match length out there, and you are staring agape at the outliers because the human eye/brain is naturally attracted to them.  No one pays second thought to an average-length match, but it comes from the same mother generating-model as long ones, fat ones, short ones, blunt ones, double-headed freak ones… and there’s really nothing left to say, except to calculate the probability of generating a given length of a match, either globally or given specific information about air humidity, temperature, the age and consistency of the oil in the machine, the make and model of the butterfly…

The joint probability was determined, The Big Nickel was flipped, and one peculiarly long match was deposited in a box.  Q.E.D.

If you subscribe to this approach, then you are left with a bitter taste in your mouth as Gladwell leads people to focus on the “gotcha,” the factoid, instead of teaching people to think holistically about probability and causality.  People are the way they are because of the gajillion factors that led them there, in their DNA and in their environment.  Some flips of the coin are more important than others, but focusing too much on a single flip misses the forest for the trees.  Bill Gates is Bill Gates, and we should write his history and marvel at his achievements.  To create more like him, we should tackle the challenge of calculating marginal probabilities of all factors that predict the success metrics we collectively agree are worthy of pursuit, and focusing our energies on the best predictors we are able to affect.  Unfortunately, we can only do this if people are taught how to approach these problems critically, and Malcolm Gladwell does his readers no favor in this respect.  This ( http://www.ted.com/talks/arthur_benjamin_s_formula_for_changing_math_education.html ) is where I would start.

Comments (View)
April 2, 2010
The CDC commits a data visualization atrocity.  The content of the image suggests that 1 out of every 13 people has HIV.  This is more like 1 out of every 300 in reality, according to the CDC&#8217;s own website ( http://www.cdc.gov/hiv/topics/surveillance/basic.htm#hivest )

The CDC commits a data visualization atrocity.  The content of the image suggests that 1 out of every 13 people has HIV.  This is more like 1 out of every 300 in reality, according to the CDC’s own website ( http://www.cdc.gov/hiv/topics/surveillance/basic.htm#hivest )

Comments (View)
May 11, 2009

Can Lexicon predict unemployment trends?

Haven’t dug too deep into this dataset but I thought these were interesting.  Looks like Lexicon begins to “overshoot” in late January…perhaps because the phrase “laid off” refers not only to new layoffs, but also layoffs that happened in the previous months.

I’m sure some hedge fund can use this data to make some quick chalupes before the weekly Bureau of Labor Statistics announcements.  As these highly predictive real-time signals from Facebook, Google, Twitter, etc. become better instrumented and trusted, the slower government and corporate statistics become far less important…and markets can avoid the sort of volatility that comes after unemployment and earnings announcements.

Comments (View)
March 3, 2009

Gallup’s Mood-Tracker

Check out the Gallup Daily US Mood Tracker:

http://www.gallup.com/poll/106915/Gallup-Daily-US-Mood.aspx

This chart comes from the Gallup-Healthways Well-Being Index, from a poll of Americans a day (claiming 98% coverage.)  The survey contains questions about health, diet, well-being, stress, and economic indicators (“Although it’s not very likely that you did, could you tell me if you happened to purchase or lease a motor vehicle yesterday, such as a car, truck, or SUV?”)   More info here: http://www.well-beingindex.com/.  The mood tracking data is generated from the following question:

Q. Did you experience the following feelings during A LOT OF THE DAY yesterday? How about?

* Enjoyment
* Physical Pain
* Worry
* Sadness
* Stress
* Anger
* Happiness

There are two data series in the mood tracker, one for the percentage of respondents who experienced “a lot of happiness/enjoyment without a lot of stress/worry” (no indication if the / represents an AND or an OR) and another for the respondents who experienced stress/worry but not happiness/enjoyment.  The two scores have an inverse correlation of -.82.  There is a visible drop in the happiness index starting in early September with the stock market crash that is sustained until the present.  The statistically significant drop reduces the average index score of 49.2 in January-August to 46.0 in September-Febuary.

hell yea!

But……how does data from a random sample of American households reflecting on their mood of the previous day compares to that from a heavily opt-in, self-selected group of HappyFactor users reporting their happiness moment by moment?

Here’s the intraweek comparison – with the Gallup data normalized to the HappyFactor mean.



What’s going on here? The HappyFactor intraweek data is pretty flat in comparison to Gallup.  HappyFactor shows significant happiness bumps over the weekend, but nowhere near as large as Gallup.

First, maybe HF coverage bias…if HF users enjoy their work more than average Americans, or are spending their weekends drooling on the couch while the rest of the country engages in Bacchic hedonism…..yes, certainly possible.  Other possibilities?

Well, we know that our memory of yesterday is blurred by the passage of time and influenced by our mood today.  Suppose someone sitting at home at 8 p.m. on a Monday night picks up the phone and considers how she felt on Sunday.  After a crappy day at work, she might report that Sunday was peachy indeed… even though her mood on Sunday was “actually” close to her mood on Monday.

Third, the internal stereotypes for what we are supposed to feel might also affect our memory of past moments differently than our real-time experience of those moments.  By responding to a question about a certain day of the week (Yesterday…what day was that?  Friday? Monday?) there could be activated representations (bear with me) of related concepts (I know).  Case of the Mondays, TGIF.  Those activations probably aren’t going to happen in a response to HappyFactor’s “how happy are you right now?”

Also, there’s always the issue of social desirability bias when you’re talking to other humans.  Especially if she has the dulcet vocal chords of a trained sociologist…..perhaps with “extensive” public-sector experience at the Census Bureau….

…and skin like Neve Campbell….

Spicy.



This graph warrants a post by itself but it’s also interesting in the context of this discussion.  It’s average HappyFactor by hour of the day.  Besides the outlier at 8 AM (commute?  Caffeine rush?) there are 3 pretty distinct clusters of happiness: morning to noon, noon to 5, and 5 to bedtime.

Yeah, the error bars are big, but isn’t this cool?  Basically we’re miserable in the morning and stoked in the evening after work is over.  This also reminds me of human cortisol levels, which are highest in the morning and drop throughout the day.  I am grateful to work for Facebook, where I can skip out on some of those mood-raping morning hours!

If Gallup is always polling in the evening, they are actually sampling a much happier population than HappyFactor is.  I’m not sure how that would muck with the data but it would be intriguing to dial people in the morning and what it looked like.

Hey, maybe you could predict the stock market….

Comments (View)
February 17, 2009

Theme extraction wrong

Following up on the New York Times rant (I only knock it because I love it), here’s a look at Time.  To boost pageviews on Time.com, they elect to insert internal links right within the content of the page.  To find a relevant link to show, they use some sort of theme extraction algorithm on the paragraph and search for articles that also contain that theme.

The article is a satiric list of why elderly people like Facebook: Why Facebook is for Old Fogies.  There were a total of 3 paragraphs out of 11 that had links within the content.  Here’s the first one:

1. Facebook is about finding people you’ve lost track of. And, son, we’ve lost track of more people than you’ve ever met. Remember who you went to prom with junior year? See, we don’t. We’ve gone through multiple schools, jobs and marriages. Each one of those came with a complete cast of characters, most of whom we have forgotten existed. But Facebook never forgets. (See the best social networking applications.)

The extracted theme is Facebook, and the target page is an article about social networking.  Seems relevant enough.

3. We never get drunk at parties and get photographed holding beer bottles in suggestive positions. We wish we still did that. But we don’t. (See pictures of Denver, Beer Country.)

The extracted theme is “beer” and the target is a slideshow of microbreweries in Denver.  Generously, the connection is tenuous.  The algorithm failed to take into account the negation “never…get photographed holding beer bottles…” and so the added content looks random.

6. We’re old enough that pictures from grade school or summer camp look nothing like us. These days, the only way to identify us is with Facebook tags. (See pictures of a diverse group of American teens.)

The extracted theme is “school” or “kids,” I suppose.  Somehow that connects it to a series of photos of random people talking about themselves.  The problem here is that the tags are too broad.  You could tag almost any content with “living people” or “earth” or “published in 2009” and make these coarse connection, but at the risk of confusing users.

On the other hand, the slideshow was oddly engaging….some natural human voyeurism I suppose, descended from monkeys peering through the brush to discover who was grooming whom.

The lessons here are:

  1. Don’t classify it unless you’re sure.  Theme extraction ain’t easy.
  2. Be smart about simple things like negation.
  3. If you need to, show something broadly interesting…preferably photos, which are consumed quickly and generate tons of page views (== $).
  4. Content and advertising are often the same.
Comments (View)
January 26, 2009

Do NYTimes.com readers actually read the news?

nytimes top 10 list

1. pop health article about coffee. NOT NEWS
2. sexy pop psychology article. NOT NEWS
3. republican-bashing op-ed. NOT NEWS
4. article about profanity on signs (“butt hole road”, “crapstone, england.”) NOT NEWS
5. article about nationalization of banks. NEWS
6. empty personal finance editorial. “participants in 401(k)’s are in greater danger than ever of coming up short in retirement.” NOT NEWS
7. human interest article about skateboarding. NOT NEWS
8. article on lobbying efforts by pro-arts groups. KINDA NEWS
9. warm & fuzzy op-ed about obama. NOT NEWS
10. review of trends in tech sector. NEWS

So that’s 2.5/10 in the most-emailed list. These high-CTR articles generate the most page views => ad inventory => cash for the Times. Print newspapers allow for honest, unbiased reporting of facts because you only have to “sell out” the front page to sell papers on the newsstand (you don’t need to sell out anything for subscriptions, which are locked in.) On the internet they get to fight over pageviews with everyone else. What to do? Surface the popular content box prominently on every page, creating a massive positive feedback loop that highlights partisan op-eds, human interest fluff, pop culture reviews, personal health and finance, and mild pornography.

Jon Stewart had a great page in “America” where he considered the newsworthiness of various events:

2,000 Massacred Congolese = 500 Drowned Bangladeshies = 45 Fired-bombed Iraqis = 12 Car-bombed Europeans = 1 Snipered American.

< One fabulous no-knead bread recipe.

Expect the denominator to drop even further as the NYT’s debt problems hit the fan.

Comments (View)
January 3, 2009
The power of aggregate data.  With many individuals reporting their own happiness in their own context, we can infer larger trends about the mood of the world, day by day (and eventually hour by hour, minute by minute, etc.)
Check out the post holiday blues drop on December 26.
Happy Factor: Global Happiness

The power of aggregate data.  With many individuals reporting their own happiness in their own context, we can infer larger trends about the mood of the world, day by day (and eventually hour by hour, minute by minute, etc.)

Check out the post holiday blues drop on December 26.

Happy Factor: Global Happiness

Comments (View)
shit.

shit.

Comments (View)
November 21, 2008

Text analytics for democracy

The Obama transition team recently put up a web site, change.gov, that features several contact points for Americans to provide feedback directly to the future administration.  One is called “Share Your Story” and another is “Share Your Vision”:

Share with us your concerns and hopes. – the policies you want to see carried out in the next four years.

There are also individual forms for each agenda item listed on the site.  These are called “By the People, For the People.”  For example, on the economy:

Tell us how the economy has affected you, what you’d like to see an Obama-Biden administration do, or where you’d like the country to go.

The other day, as a responsible citizen, I submitted a short idea about national service.  In a conversation several days later with Matt Schwieger, we came upon the idea of using text mining techniques to sort through all the direct feedback given by citizens to change.gov…and eventually to the administration.  There are not nearly enough underpaid public servants to go through all of these letters individually; but in aggregate, it’s a wonderfully rich source of data for an elected official seeking to understand the concerns of his or her constituents!

Just think about the possibilities:

  • Sentiment analysis on individual Congressional bills, spending programs, Supreme Court nominees
  • Geo-IP maps to cluster the concerns of the populace by state and county…and tailoring speeches accordingly
  • Co-occurrence analysis combined with sentiment analysis to find out which specific parts of a bill or program are well-received, and which are unpopular
  • Providing White House pollsters with a complementary stream of data with a much larger sample size

Assuming there was appropriate spam and duplicate detection, this is a way to have your voice heard, and make our democracy a little more direct.  Send a positive note about something the government is doing that you like, and that bumps the needle up a little bit.  Cuss them out, and that moves the needle back down.  Transparent (well, except for the specifics of the models) and dead easy to do.

Obama won’t (and shouldn’t) make decisions based solely on public opinion, of course; but since his administration is already collecting the data, doesn’t it make sense to organize it intelligently and use it as an input to his decisions as president?

Comments (View)
October 31, 2008

Primary sources

For most people trying to sell text analysis to marketers, “social media” usually means two things: blogs and Twitter.  Why those two, out of all the “social” text on the internet?  Let’s go through all the possibilities.

- MySpace

advantages: target audience, public profiles

disadvantages: mostly spam, false metadata (“101 years old”), many profiles not crawlable

- Usenet:

advantages: easily obtainable from your local news server, decent metadata

disadvantages: not 1994 any more

- Discussion Boards / Forums:

advantages: lots of diverse topics, relatively popular, target audience segments

disadvantages: need to crawl the net, sparse metadata, messy

- Review Sites:

advantages: the whole site is dedicated to user opinion!

disadvantages: limited to cars and digital cameras

- Facebook:

advantages: target audience, tons of metadata

disadvantages: not public, can’t scrape

- Rest of the Web:

advantages: heaps of data

disadvantages: trying to make sense of anything is impossible

Any sane person trying to start a text analytics company would of course opt for blogs and twitter. Why not?  Blogs are nice and wordy, easily indexed, have timestamps on the posts, there’s a graph structure to examine if you get bored….and Twitter is a hot name, has an API, also has some sort of graph structure….it’s not too hard to organize this text data, stick it in a data warehouse or S3/EC2, and start throwing SVDs and CRFs at that bitch……

The only problem is that very few people use twitter, and almost no one blogs.  For example, the last estimate I can find _anywhere_ on the web is this article from Business Week in April 2007, which pegs the number at 495,000 English posts per day, on a slow decline from six months earlier.  I have no idea what it is now but the lack of statistics from people who would benefit from growth in blogging (i.e. Technorati) tells me that the blogging can’t be in a good way.  Plus a lot of blog posts are spam and those can be hard to detect.

Twitter is in better shape, but here we have a problem with coverage.  The group of people who use Twitter are very self-selecting, limited mostly to the social media experts themselves.  If I mined sentiment from Twitter messages I could really conclude nothing about how my brand or product was perceived by normal people.  (This may change, of course, if Twitter goes mainstream.)

—to be continued—

Comments (View)


The opinions expressed on this site are mine and do not necessarily represent those of my employer, Facebook. You won’t find any confidential company information here, and while you’re welcome to get in touch with me, I’m afraid I can’t put you in contact with my employer.