My issues with Big Data: Sentiment

Big Data seems to be at the peak of its hype cycle these days and I have some issues with it. In the “My issues with Big Data” series I will explore a couple of these. First up: Sentiment.

Sentiment analysis concerns itself about discovering customers’ feelings about something we care about, such as a brand. One of the selling points of Big Data has been that this analysis can be done by machines on massive data amounts.

Apart from the fact that I suspect its far more cost efficient to simply do a good old survey on how the brand / marketing campaign / product is perceived, I have some very practical concerns about the feasibility of the whole concept. Being a simple guy, I think the best way to illustrate this is by a practical example. Let us try to manually “mine” customer sentiment about a well known brand: Coca Cola. Our Big Data source will be Twitter.

Doing a search for “Coca Cola” yields, at the time of this writing, the following first eleven results:

cocacolaThe only way I can think of to discover sentiment in these tweets is to look for positively and negatively charged words / phrases and do a count. As far as I can tell these are the tweets with words that can be interpreted positively:

  • Jump as in “jumping as a move done in happiness” in Coca-Cola’s Thailand sales jump 24%
  • Amazing and 🙂 in Amazing Coke wall clock 🙂
  • Crush as in “being in love” and 🙂 in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Brilliant in This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand 
  • Highlights as in “The highlights of the evening were…” in Coca-Cola debuts “Life” brand, highlights deadlines for regular coke
  • Cool in A cool Coca Cola delivery truck in Knoxville, 1909
  • Honest in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

In other words: Seven of eleven tweets contain words that have a positive ring to them. The first thing that comes to mind when seeing this is: Is this good or bad? I have no idea. Maybe if we create some kind of ratio between posts with positive words versus negative words we will get a feeling for whether or not the public feels good about Coca Cola. So lets count the negative ones:

  • Drunk as “Intoxicated” in 12% of all the Coca-Cola in America is drunk at breakfast
  • Crush as in “I will crush you” in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Lose in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

Three negative tweets right? Wait a minute. Two of those posts are also in the positive list! The first one because crush can be interpreted both positively and negatively and the second one because the tweet contains both a positive and a negative word. We need to refine our algorithm to deal with this. The solution is quite simple. For each tweet we need to keep a score of positive and negative words. Ambiguous words can be removed because they would add to both the positive and negative scores. Tweets with ties need to be removed as they are neutral. The effect on our sample is that both “You have a crush..” and “But you know why @Honest..” tweets have to be removed from the count. The end result is that of the eleven tweets two have to be taken out due to the above ambiguity and three tweets need to removed because they contain neither positive  nor negative words. So our ratio would be 5 positive / (5 positive + 1 negative) = 83% of tweets are favorable towards the Coca Cola brand. Right?

Of course not. Lets stop thinking like a machine now and look at the tweets with our human cognitive sense:

  • 12% of all the Coca-Cola in America is drunk at breakfast: Obviously this has nothing to do with being drunk but rather a depressing health statistic.
  • Coca-Cola’s Thailand sales jump 24%: This is not a sentiment, its a positive financial news flash.
  • Amazing Coke wall clock :): Does this have something to do with liking the Coca Cola brand or liking the clock? Probably the latter.
  • You have a crush? — Nope, I don’t have a crush but I have coca-cola :): This might actually be positive (but remember it was removed due to ambiguity)
  • This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand: At first I thought this would be a perfect sentiment tweet. An unambiguous positive term tightly linked to the Coca Cola brand. However I did not know anything about “New Coke” so I did a quick search. Uh oh. The author of the tweet seems to be ironic. Good luck interpreting that correctly, machine learning algorithm!
  • Coca-Cola debuts “Life” brand, highlights deadlines for regular coke: “Highlights” is not used as we thought. Its used as “emphasize”, a neutral term, not a positive one.
  •  A cool Coca Cola delivery truck in Knoxville, 1909: Same problem as with the clock. Is the tweet positive about Coca Cola or about the physical truck? Probably the latter.
  • But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose: I am not sure what to think of this. I do not know who or what either @Honest or @HonestTea are/is. I doubt a machine would know better.

While my “algorithm” and output in this example are quite simplistic it still illustrates my point: Sentiment analysis is very tricky. As far as I can tell this analysis has invalidated every single tweet from my (admittedly very limited) sample. Add to this the tweets that did not contain any words indicating sentiment and you have a pretty bleak picture of what automated sentiment analysis can do.

Disagree? Feel free to comment!

Some additional reading on sentiment analysis:

  • Here is a research paper detailing a more sophisticated algorithm than the one I exemplify the challenges with sentiment analysis with. The findings seem encouraging but I am still not convinced of the viability of this commercially.
  • Here are instructions of how to use Google’s infrastructure and API’s for sentiment analysis.
  • Here is a piece in The Guardian that looks at this a little more broadly.
Advertisements

7 thoughts on “My issues with Big Data: Sentiment

  1. Great blog, first time here and I sure will be back often.

    Have worked a few years with database and struck interest in data mining, I will just babble out my first thoughts.

    We should be glad that we are no longer in the 90s where the popular slang used to say “It’s bad!” as a compliment 🙂

    Maybe before jumping straight into the conclusion of positive and negative sentiments, one really should first identify the type of the message in the tweet

    – Sentimental tweet (One that truely can give indication of positive or negative sentiment)
    – Fact (like the one about sales jump 24%, which could be either positive or negative, depending on the sales year before or sales by competitor)
    – Questions (It’s definitely a separate type of tweet, but I don’t know what can be done with them)
    – Retweets (could count as a sentimental tweet, but not all retweets because they agree with the sentiment)

    Of the four types above, only the first one really gives some meaningful insight to the sentiment. I would say it is a good start and from there the system can start try to identify keywords and maybe even some ironies given enough understanding of the language and culture.

    Language processing is a deep field and even deeper as language changes over time. It sure will be a continous process for the system to get better at identifying sentiments.

    • Thanks for your comment Erik. I would suspect that its just as hard to classify a tweet as one carrying sentiment as figuring out what that sentiment actually is. The beautiful thing about language is that it has so many nuances and layers. For example: “The house is on fire!” can mean widely different things based on context. It can be a sentiment about a party or it can be a distress call about an actual fire in an actual house.

  2. Two points:

    First, sentiment analysis is useful in many applications besides marketing. It’s a very simplistic (sometimes excruciatingly so) reduction of textual data, but people have found it gives useful results in a number of contexts. I did some work with applying sentiment analysis in composition pedagogy a couple of years ago, for example.

    Second, if the only sentiment-analysis algorithm you can imagine is a trivial bag-of-words + counting approach with a fixed set of sentiment-bearing terms and binary sentiment (“The only way I can think of to discover sentiment in these tweets is to look for positively and negatively charged words / phrases and do a count”), then you don’t know anything about sentiment analysis. (Or, apparently, about natural language processing, or its cognate fields. Here’s a tip: you’re not the first person to note that natural languages are ambiguous.) Perhaps doing a smidge of research before dismissing an entire field would be a good idea.

    Sigh. Kids these days. Where’s my shaking cane?

    • Michael,

      Thanks for your candid comments. I hope these points clear things up:

      1. I am talking about sentiment analysis in the specific context of the Big Data hype. The material I have seen on this is heavily marketing related. I do not doubt that sentiment analysis can be done in other settings, perhaps with success, but this is besides the point.
      2. I am not dismissing the entire field of sentiment analysis. I am questioning its ability to deliver on the promises made by vendors (again in the context of Big Data). I have seen very few true success stories (apart from the Romney / Obama analysis and a couple of stock analysis examples).
      3. Yes, my sample, algorithm and output are all simplistic. I used them to illustrate the complexities you are no doubt familiar with when it comes to text analysis. I would be thrilled if you could illustrate how a different algorithm would give better results from the sample.
      4. While I have a background in data mining, I am in no way an expert on text analysis. What I DO know is that the “trivial-bag-of-words + counting approach” is pretty common in commercial sentiment analysis. This is what I am talking about.
      5. The argument that I am uninformed, therefore sentiment analysis must be real is logically flawed.

  3. Great Post! I think the problem with sentiment analysis is that behind any algorithm that try to capture feeling behind words is based on assumptions that are questionable. It’s a question of semantic (e.g.: Crush as in “being in love” or Crush as in “I will crush you”).

    Moreover, different culture mindsets might generate mislead interpretations: some cultures are more direct and fact based while others are more context based and less keen to express feelings directly.

    Other algorithms (see googlesearch) try to find correlations between content (text) and context such as network presence, relationships and behaviors. Therefore, if I tweet Crush and then I buy a Coke, that more likely means Crush as in “being in love”.

    Anyhow, assumptions are needed in any case, when you do sentiment analysis with a traditional marketing research or with other sentiment tools. The problem with sentiments is that the questions are not defined clearly before and thus there are more questionable inferences.

    So, I think that sentiments are very tricky, in BI and in real life as well and a well grounded cautiousness is needed.

    • Thanks for the comment. In my analysis I only did a very simple example to illustrate how difficult this is. You give some great examples of issues that makes it even worse. Off the top of my head I would add: Multilingualism and sociological dialects. I especially liked your last comment which made me realize that even for humans it can be hard to analyze sentiment in statements.

      • “Even for humans it can be hard to analyze sentiment in statements” I would say, that is what makes life so interesting, with the assumption that humans don’t behave like machines, of course! Can you immagine if analytics can predict anything?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s