Is the datawarehouse going the way of the Dodo?

dodoThe Dodo went extinct because it could not adapt. Maybe data warehouses are going down the same route?

For the decade or so I’ve been working with data warehouses they have basically all looked the same: At the core a relational database modelled for fast load and retrieval of data batches and an ETL tool that does all the data integration heavy lifting. This model is quite mature and there have been few surprises over the last couple of years. It does have it challenges though. The relational theory and technology were originally designed for quite different workloads than those found in datawarehousing. The technology might have changed to accomodate other types of data profiles but the fundamentals around transactions and relational integrity remain as a bottleneck. The ETL paradigm has also been under fire for a number of years. In a typical DW project a majority of the work goes into shuffling data around; applying transformations and in some cases business logic to the data. This often leads to a disconnect between the reality in the data warehouse and the real world in the line of business. Data duplication is also an issue and its hard to argue against the notion that data proliferation carries with it a burden both in term of governance and data quality and also in pure costs.

On the storage side much has happened over the last years. Alternatives to relational storage are rapidly maturing and a flurry of new ideas are coming out from the NoSQL “movement”. Take append-only databases for instance. They share many of the same characteristics as data warehouses but are built from the ground up for that kind of data storage. Additionally these kinds of databases scale very nicely. Need more space? Just add a node.

There are also new thoughts on how to integrate data that depart radically from the established ETL paradigm. Data virtualization is one of these. It basically ditches the whole ETL / data warehouse concept and replaces it with modelleling and real time access to sources with some caching thrown in. In effect it brings the promise of top-down, rapid data integration. Seems like a dream? At least forrester does not think so.

These are only two examples of new ways to think about old problems. There are many more. I for one, have a lot of reading to do!

My issues with Big Data: Distraction

While reading the economist on the plane from Switzerland every other ad was somehow Big Data related. After a while it would not have surprised me if Lexus advertised for a car powered by Big Data. This made me think about the enormous amounts of effort and resources that goes into this concept. With Cloud being a household name, Big Data is perceived as the next thing to drive growth. Venture capital is funding Big Data start-ups while existing companies are re-branding or extending their product lines into the Big Data space. With money and time being scarce resources they have to be reallocated from somewhere else.

distraction

 

Looking at this from a information management perspective there are still big unsolved challenges and untapped opportunities that deserve all the attention they can get. Here are some examples:

  • Data quality is an issue in almost every organization. Big Data will only make this worse when attention is shifted to integrating vast amounts of noisy data with data of already poor quality.
  • Organizations are still introvert in their reporting and analysis. External data sources to benchmark and enrich internal information are underutilized. Big Data may be one such source, but there are far more mature and less costly alternatives from market research providers, governmental agencies and so forth. Some are calling this Wide Data.
  • Companies are not effectively utilizing their existing data let alone Big Data. Every consulting company worth their salt has some kind of BI or Information Management strategy offering. The logical reason is that there must be a big market for helping companies become more mature in this space.
  • Big Data is a solution looking for a problem. A lot of effort is going into finding this problem both among providers and customers. Good to know there are other avenues to follow.

On the positive side, Big Data is associated with Business Intelligence and related fields. Some of the effort put into it will surely trickle down into better offerings for the good old “small data” solutions. I just hope we do not get too distracted from the main purpose of our field: Help customers make better decisions.

Disagree? Feel free to discuss!

My issues with Big Data: Sentiment

Big Data seems to be at the peak of its hype cycle these days and I have some issues with it. In the “My issues with Big Data” series I will explore a couple of these. First up: Sentiment.

Sentiment analysis concerns itself about discovering customers’ feelings about something we care about, such as a brand. One of the selling points of Big Data has been that this analysis can be done by machines on massive data amounts.

Apart from the fact that I suspect its far more cost efficient to simply do a good old survey on how the brand / marketing campaign / product is perceived, I have some very practical concerns about the feasibility of the whole concept. Being a simple guy, I think the best way to illustrate this is by a practical example. Let us try to manually “mine” customer sentiment about a well known brand: Coca Cola. Our Big Data source will be Twitter.

Doing a search for “Coca Cola” yields, at the time of this writing, the following first eleven results:

cocacolaThe only way I can think of to discover sentiment in these tweets is to look for positively and negatively charged words / phrases and do a count. As far as I can tell these are the tweets with words that can be interpreted positively:

  • Jump as in “jumping as a move done in happiness” in Coca-Cola’s Thailand sales jump 24%
  • Amazing and 🙂 in Amazing Coke wall clock 🙂
  • Crush as in “being in love” and 🙂 in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Brilliant in This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand 
  • Highlights as in “The highlights of the evening were…” in Coca-Cola debuts “Life” brand, highlights deadlines for regular coke
  • Cool in A cool Coca Cola delivery truck in Knoxville, 1909
  • Honest in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

In other words: Seven of eleven tweets contain words that have a positive ring to them. The first thing that comes to mind when seeing this is: Is this good or bad? I have no idea. Maybe if we create some kind of ratio between posts with positive words versus negative words we will get a feeling for whether or not the public feels good about Coca Cola. So lets count the negative ones:

  • Drunk as “Intoxicated” in 12% of all the Coca-Cola in America is drunk at breakfast
  • Crush as in “I will crush you” in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Lose in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

Three negative tweets right? Wait a minute. Two of those posts are also in the positive list! The first one because crush can be interpreted both positively and negatively and the second one because the tweet contains both a positive and a negative word. We need to refine our algorithm to deal with this. The solution is quite simple. For each tweet we need to keep a score of positive and negative words. Ambiguous words can be removed because they would add to both the positive and negative scores. Tweets with ties need to be removed as they are neutral. The effect on our sample is that both “You have a crush..” and “But you know why @Honest..” tweets have to be removed from the count. The end result is that of the eleven tweets two have to be taken out due to the above ambiguity and three tweets need to removed because they contain neither positive  nor negative words. So our ratio would be 5 positive / (5 positive + 1 negative) = 83% of tweets are favorable towards the Coca Cola brand. Right?

Of course not. Lets stop thinking like a machine now and look at the tweets with our human cognitive sense:

  • 12% of all the Coca-Cola in America is drunk at breakfast: Obviously this has nothing to do with being drunk but rather a depressing health statistic.
  • Coca-Cola’s Thailand sales jump 24%: This is not a sentiment, its a positive financial news flash.
  • Amazing Coke wall clock :): Does this have something to do with liking the Coca Cola brand or liking the clock? Probably the latter.
  • You have a crush? — Nope, I don’t have a crush but I have coca-cola :): This might actually be positive (but remember it was removed due to ambiguity)
  • This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand: At first I thought this would be a perfect sentiment tweet. An unambiguous positive term tightly linked to the Coca Cola brand. However I did not know anything about “New Coke” so I did a quick search. Uh oh. The author of the tweet seems to be ironic. Good luck interpreting that correctly, machine learning algorithm!
  • Coca-Cola debuts “Life” brand, highlights deadlines for regular coke: “Highlights” is not used as we thought. Its used as “emphasize”, a neutral term, not a positive one.
  •  A cool Coca Cola delivery truck in Knoxville, 1909: Same problem as with the clock. Is the tweet positive about Coca Cola or about the physical truck? Probably the latter.
  • But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose: I am not sure what to think of this. I do not know who or what either @Honest or @HonestTea are/is. I doubt a machine would know better.

While my “algorithm” and output in this example are quite simplistic it still illustrates my point: Sentiment analysis is very tricky. As far as I can tell this analysis has invalidated every single tweet from my (admittedly very limited) sample. Add to this the tweets that did not contain any words indicating sentiment and you have a pretty bleak picture of what automated sentiment analysis can do.

Disagree? Feel free to comment!

Some additional reading on sentiment analysis:

  • Here is a research paper detailing a more sophisticated algorithm than the one I exemplify the challenges with sentiment analysis with. The findings seem encouraging but I am still not convinced of the viability of this commercially.
  • Here are instructions of how to use Google’s infrastructure and API’s for sentiment analysis.
  • Here is a piece in The Guardian that looks at this a little more broadly.

Are maps the new gauges?

Over the past couple of years most  data visualization vendors have been adding spatial / mapping related functionality to their product suites. The first iterations were cumbersome to use with special geographic data types that needed to be projected onto custom maps. Today it is much, much simpler with capabilities to automatically map geography related attributes (such as state and zip code). This lets existing data sets be plotted onto maps without the need for spatial references such as longitude/latitude or complex vector shapes. When doing this for the first time it is almost magical. You select a measure, specify some geographical attributes and presto: Bars appear on the map in the right places. For us data enthusiasts this leads to a mapping frenzy where we take every data set in our repository and project it onto maps in more and more intricate ways. This was the exact same thing that happened when I first started playing around with gauges  (speedometers, thermometers  etc.) and other “fancy” visualizations when they became available oh so many years ago. Today I roll my eyes at that kind of wasted “artistry”:  So many pixels, so little information. So after having cooled down from my initial childish joy over a new way to display data I started thinking about its value.

When it comes to data visualizations I always ask myself: Does this add value to the data compared to displaying it in a simple table? With gauges its pretty easy to answer that one: No. With maps? A little more difficult.  The thing is: Maps encode information that is useful in itself and is universally understood. Information such as location, distance and area are all easily grasped by basically anyone looking at a map. Plotting data points into a map can add value by leveraging this. Here are some examples:

  • Highlight clusters through color coding.
  • Give a scale of density of some occurrence.
  • Show the distance between occurrences of something.

However the data itself must be of a kind where this information is not readily apparent. For instance a map of the US with states color coded by the percentage they contribute to total sales (who has not seen this?) does not add any value compared to a table. The map is not adding any context to the data, it is basically there for show. Much like the good old gauges.  My point is that the data needs to be geographically relevant. What we show has to relate to the information inherently present in geographic encoding.  The volume of data also has to be big enough so that these relationships are not obvious or significant work would need to be done to categorize them in order for them to make sense. A good example of this is the “Chicago Crime Data” sample data set provided with the public preview of GeoFlow for Excel (scroll down a bit on the page). Here we see how the map adds a lot of understanding to a data set that is geographically relevant. Deducting the insights we get from simply looking at the clustering in the map would be impossible by simply scrolling through the data set.  If we were to present this in tabular form we would have a very hard time conveying the spatial information a map gives us. A lot of upfront work would need to be done to create the kind of clusters and spatial information the map gives us.

So in short: Are maps the new gauges? I would say not really. There is true value to be exploited by projecting data points onto a map. But as always, the right  tool should be used for the job at hand.