There are some pretty awesome BI jobs

I started out my career as a programmer. Why did I become a programmer? To make games of course! This is something everyone knows: Every programmer has a secret (or not so secret) craving to make a game. Now you can get the best of two worlds! Check out these job openings from Riot, makers of League of Legends: 

BUSINESS INTELLIGENCE

Reminds me of a Kimball seminar I was on where they talked about how they analyzed game patterns in Call of Duty. 

 

Is the datawarehouse going the way of the Dodo?

dodoThe Dodo went extinct because it could not adapt. Maybe data warehouses are going down the same route?

For the decade or so I’ve been working with data warehouses they have basically all looked the same: At the core a relational database modelled for fast load and retrieval of data batches and an ETL tool that does all the data integration heavy lifting. This model is quite mature and there have been few surprises over the last couple of years. It does have it challenges though. The relational theory and technology were originally designed for quite different workloads than those found in datawarehousing. The technology might have changed to accomodate other types of data profiles but the fundamentals around transactions and relational integrity remain as a bottleneck. The ETL paradigm has also been under fire for a number of years. In a typical DW project a majority of the work goes into shuffling data around; applying transformations and in some cases business logic to the data. This often leads to a disconnect between the reality in the data warehouse and the real world in the line of business. Data duplication is also an issue and its hard to argue against the notion that data proliferation carries with it a burden both in term of governance and data quality and also in pure costs.

On the storage side much has happened over the last years. Alternatives to relational storage are rapidly maturing and a flurry of new ideas are coming out from the NoSQL “movement”. Take append-only databases for instance. They share many of the same characteristics as data warehouses but are built from the ground up for that kind of data storage. Additionally these kinds of databases scale very nicely. Need more space? Just add a node.

There are also new thoughts on how to integrate data that depart radically from the established ETL paradigm. Data virtualization is one of these. It basically ditches the whole ETL / data warehouse concept and replaces it with modelleling and real time access to sources with some caching thrown in. In effect it brings the promise of top-down, rapid data integration. Seems like a dream? At least forrester does not think so.

These are only two examples of new ways to think about old problems. There are many more. I for one, have a lot of reading to do!

My issues with Big Data: Distraction

While reading the economist on the plane from Switzerland every other ad was somehow Big Data related. After a while it would not have surprised me if Lexus advertised for a car powered by Big Data. This made me think about the enormous amounts of effort and resources that goes into this concept. With Cloud being a household name, Big Data is perceived as the next thing to drive growth. Venture capital is funding Big Data start-ups while existing companies are re-branding or extending their product lines into the Big Data space. With money and time being scarce resources they have to be reallocated from somewhere else.

distraction

 

Looking at this from a information management perspective there are still big unsolved challenges and untapped opportunities that deserve all the attention they can get. Here are some examples:

  • Data quality is an issue in almost every organization. Big Data will only make this worse when attention is shifted to integrating vast amounts of noisy data with data of already poor quality.
  • Organizations are still introvert in their reporting and analysis. External data sources to benchmark and enrich internal information are underutilized. Big Data may be one such source, but there are far more mature and less costly alternatives from market research providers, governmental agencies and so forth. Some are calling this Wide Data.
  • Companies are not effectively utilizing their existing data let alone Big Data. Every consulting company worth their salt has some kind of BI or Information Management strategy offering. The logical reason is that there must be a big market for helping companies become more mature in this space.
  • Big Data is a solution looking for a problem. A lot of effort is going into finding this problem both among providers and customers. Good to know there are other avenues to follow.

On the positive side, Big Data is associated with Business Intelligence and related fields. Some of the effort put into it will surely trickle down into better offerings for the good old “small data” solutions. I just hope we do not get too distracted from the main purpose of our field: Help customers make better decisions.

Disagree? Feel free to discuss!

My issues with Big Data: Sentiment

Big Data seems to be at the peak of its hype cycle these days and I have some issues with it. In the “My issues with Big Data” series I will explore a couple of these. First up: Sentiment.

Sentiment analysis concerns itself about discovering customers’ feelings about something we care about, such as a brand. One of the selling points of Big Data has been that this analysis can be done by machines on massive data amounts.

Apart from the fact that I suspect its far more cost efficient to simply do a good old survey on how the brand / marketing campaign / product is perceived, I have some very practical concerns about the feasibility of the whole concept. Being a simple guy, I think the best way to illustrate this is by a practical example. Let us try to manually “mine” customer sentiment about a well known brand: Coca Cola. Our Big Data source will be Twitter.

Doing a search for “Coca Cola” yields, at the time of this writing, the following first eleven results:

cocacolaThe only way I can think of to discover sentiment in these tweets is to look for positively and negatively charged words / phrases and do a count. As far as I can tell these are the tweets with words that can be interpreted positively:

  • Jump as in “jumping as a move done in happiness” in Coca-Cola’s Thailand sales jump 24%
  • Amazing and 🙂 in Amazing Coke wall clock 🙂
  • Crush as in “being in love” and 🙂 in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Brilliant in This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand 
  • Highlights as in “The highlights of the evening were…” in Coca-Cola debuts “Life” brand, highlights deadlines for regular coke
  • Cool in A cool Coca Cola delivery truck in Knoxville, 1909
  • Honest in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

In other words: Seven of eleven tweets contain words that have a positive ring to them. The first thing that comes to mind when seeing this is: Is this good or bad? I have no idea. Maybe if we create some kind of ratio between posts with positive words versus negative words we will get a feeling for whether or not the public feels good about Coca Cola. So lets count the negative ones:

  • Drunk as “Intoxicated” in 12% of all the Coca-Cola in America is drunk at breakfast
  • Crush as in “I will crush you” in You have a crush? — Nope, I don’t have a crush but I have coca-cola 🙂
  • Lose in But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose

Three negative tweets right? Wait a minute. Two of those posts are also in the positive list! The first one because crush can be interpreted both positively and negatively and the second one because the tweet contains both a positive and a negative word. We need to refine our algorithm to deal with this. The solution is quite simple. For each tweet we need to keep a score of positive and negative words. Ambiguous words can be removed because they would add to both the positive and negative scores. Tweets with ties need to be removed as they are neutral. The effect on our sample is that both “You have a crush..” and “But you know why @Honest..” tweets have to be removed from the count. The end result is that of the eleven tweets two have to be taken out due to the above ambiguity and three tweets need to removed because they contain neither positive  nor negative words. So our ratio would be 5 positive / (5 positive + 1 negative) = 83% of tweets are favorable towards the Coca Cola brand. Right?

Of course not. Lets stop thinking like a machine now and look at the tweets with our human cognitive sense:

  • 12% of all the Coca-Cola in America is drunk at breakfast: Obviously this has nothing to do with being drunk but rather a depressing health statistic.
  • Coca-Cola’s Thailand sales jump 24%: This is not a sentiment, its a positive financial news flash.
  • Amazing Coke wall clock :): Does this have something to do with liking the Coca Cola brand or liking the clock? Probably the latter.
  • You have a crush? — Nope, I don’t have a crush but I have coca-cola :): This might actually be positive (but remember it was removed due to ambiguity)
  • This is about as brilliant as “New Coke” was years ago. Coca-Cola Debuts “Life” Brand: At first I thought this would be a perfect sentiment tweet. An unambiguous positive term tightly linked to the Coca Cola brand. However I did not know anything about “New Coke” so I did a quick search. Uh oh. The author of the tweet seems to be ironic. Good luck interpreting that correctly, machine learning algorithm!
  • Coca-Cola debuts “Life” brand, highlights deadlines for regular coke: “Highlights” is not used as we thought. Its used as “emphasize”, a neutral term, not a positive one.
  •  A cool Coca Cola delivery truck in Knoxville, 1909: Same problem as with the clock. Is the tweet positive about Coca Cola or about the physical truck? Probably the latter.
  • But you know why @Honest isn’t coming for @HonestTea? B/c Honest Tea is owned by Coca Cola and they know they’d lose: I am not sure what to think of this. I do not know who or what either @Honest or @HonestTea are/is. I doubt a machine would know better.

While my “algorithm” and output in this example are quite simplistic it still illustrates my point: Sentiment analysis is very tricky. As far as I can tell this analysis has invalidated every single tweet from my (admittedly very limited) sample. Add to this the tweets that did not contain any words indicating sentiment and you have a pretty bleak picture of what automated sentiment analysis can do.

Disagree? Feel free to comment!

Some additional reading on sentiment analysis:

  • Here is a research paper detailing a more sophisticated algorithm than the one I exemplify the challenges with sentiment analysis with. The findings seem encouraging but I am still not convinced of the viability of this commercially.
  • Here are instructions of how to use Google’s infrastructure and API’s for sentiment analysis.
  • Here is a piece in The Guardian that looks at this a little more broadly.

Blog update

I have started updating the design and structure of the blog so expect things to change around over the next couple of weeks.

I am also working on a couple of larger posts / series on the topics of BI strategy, BI trends and big data. Stay tuned.

Finally, ITCentralStation added a profile in my name  and has “syndicated” some of my posts from this blog. While it is a bit annoying to have my content in more than one place, some interesting comments have been made regarding my Microsoft BI tips series. I will address these in a coming blog update.

My 30 tips for building a Microsoft Business Intelligence solution, Part VI: Tips 26-30

This is the last part in my series of things I wished I knew about before starting a Microsoft BI project. I’ll be taking my summer vacation now so the blog will be quiet the next month. After the break I will revise a couple of the tips based on feedback so stay tuned. 

#26: Decide how to source your data in Analysis Services and stick with it.

Ideally you will source your data from a correctly modeled star schema. Even then you may need to massage the source data before feeding it into SSAS. There are two ways of accomplishing this: Through views in the database or through data source views (dimensional) or queries (tabular). Unless you are unable to create views in your database (running on a prod system etc) I would strongly suggest using them. This will give you a clean separation of logic and abstraction between the SSAS solution and the data source. This means that clients connecting to the data warehouse directly will see the same data model as the SSAS solution. Also migrating between different front-ends (like dimensional and tabular) will become much simpler. In my solutions I never connect to tables directly I always bind to views for everything and never implement any logic in the DSV or via queries.

#27: Have some way of defining “current” time periods in your SSAS solution

Most SSAS solutions have a time dimension with dates, months, years, etc. In many ways its the most important dimension in your solution as it will be included in most reports / analyses as well as form the basis for a lot of calculations (see previous tips). Having a notion of what is the current period in your time dimension will greatly improve the usability of your solution: Reports will automatically be populated with the latest data without any user interaction. It can also simplify  ad-hoc analysis by setting the default members to the most current date / month / year so that when users do not put these on one of the axes it will default to the most recent time period. There are a number of ways of implementing this including calculated members and named sets (for dimensional) and calculations for Tabular and the internet is abundant with sample solutions. Some of them are fully automated (using VBA time functions) and some require someone to manually set the current period. I prefer to use the latter if possible to avoid reports showing incorrect data if something went wrong in the ETL.

#28: Create a testable solution

This is a really big topic so I will emphasize what I have found most important. A BI solution has a lot of moving parts. You have your various source systems, your ETL pipeline, logic in the database, logic in your SSAS solution and finally logic in your reporting solution. Errors happen in all of these layers but your integration services solution is probably the most vulnerable part. Not only do technically errors occur, but far more costly are logic errors where your numbers don’t match what is expected. Luckily there are a lot of things you can do to help identify when these errors occur. As mentioned in tips #6 and #7 you should use a framework. You should also design your solution to be unit testable. This boils down to creating lots of small packages that can be run in isolation rather than large complex ones. Most importantly you should create validation queries that compares the data you load in your ETL with data in the source systems. How these queries are crafted varies from system to system but a good starting point would be comparisons of row counts, sums of measures (facts) and number of unique values. The way I do it is that I create the test before building anything. So if I am to load customers that have changed since X, I first create the test query for the source system (row counts, distinct values etc.) then the query for the data warehouse together with a comparison query and finally I start building the actual integration. Ideally you will package this into a SSIS solution that logs the results into a table. This way you can utilize your validation logic both while developing the solution but also once its deployed. If you are running SQL Server 2012 you might want to look into the data tap features of SSIS that lets you inspect data flowing through your pipeline from the outside.

#29: Avoid the source if you are scaling for a large number of users

Building a BI solution to scale is another very large topic. If you have lots of data you need to scale your ETL, Database and SSAS subsystems. But if you have lots of users (thousands) your bottleneck will probably be SSAS. Concurrently handling tens to hundreds of queries with acceptable performance  is just not feasible. The most effective thing is to avoid this as much as possible. I usually take a two pronged approach. Firstly I implement as much as possible as standard (“canned”) reports that can be cached. Reporting Services really shines in these scenarios. It allows for flexible caching schemes that in most circumstances eliminates all trips to the data source. This will usually cover around 70-80% of requirements. Secondly I deploy an ad-hoc cube specifically designed and tuned for exploratory reporting and analysis. I talked about this in tip #17. In addition you need to consider your underlying infrastructure. Both SSRS and SSAS can be scaled up and out. For really large systems you will need to do both, even with the best of caching schemes.

#30: Stick with your naming standards

There are a lot objects that need to be named in a solution. From the more technical objects such as database tables and SSIS packages to objects exposed to users such as SSAS dimensions and measures. The most important thing with naming conventions is not what they are, but that they are implemented. As I talked about in tip #24 changing a name can have far reaching consequences. This is not just a matter of things breaking if you change them but consider all of the support functionality in the platform such as logging that utilize object  names.  Having meaningful, consistent names will make it a heck of a lot easier to get value out of this.  So at the start of the project I would advise to have a “naming meeting” where you agree upon how you will name your objects. Should dimension tables be prefixed with Dim or Dim_? Should Dimension names be plural (CustomerS) or singular (Customer), etc.

My 30 tips for building a Microsoft Business Intelligence solution, Part V: Tips 21-25

I might just get all 30 done before summer vacation!

#21: Avoid using discretization buckets for your dimension attributes

Discretization buckets lets you group numerical attributes into ranges. Say you have a customer dimension including the age of the customer you can use this feature to group them into age clusters such as 0-5, 6-10 and so on. While you can tweak how the algorithm creates groups and even provide naming templates for the groups you still have relatively limited control over them. Worst case scenario: A grouping is removed / changed by the algorithm which is referenced in a report. A better way of grouping these attributes is by doing it yourself either in the data source view or a view in the database (there will be a separate tip on this). This way you have complete control over the distribution of values into groups and the naming of the groups.

#22: Do not build a SSAS solution directly on top of your source system

SSAS has a couple of features that enable it to source data directly from a normalized data model typically found in business applications such as ERP systems. For instance you can “fake” a star schema through queries in the data source view. You can also utilize proactive caching to eliminate any ETL to populate your cube with data. This all sounds very tempting but unfortunatly I have never seen this work in reality. Unless you are working with a very small source system with impeccable data quality and few simultanous users you should avoid the temptation for all the usual reasons: Proactive caching will stress your source system, data quality will most likely be an issue, integrating new data sources will be nearly impossible,etc. There is a reason BI projects spend 70-80% of their time working with modelling and integrating data.

#23: Deploy SSAS cubes with the deployment tool

If you are working with multiple environments (dev/test/prod) do not use the deployment functionality of visual studio to deploy to another environment. This will overwrite partitions and roles that may be different between the environments. Use the deployment wizard.

#24: Remember that your SSAS cubes are a single point of failure

Keep in mind that most client tools do not cope well with changes to SSAS data models. Any renames or removals you do in the model will most likely cause clients that reference those entities to fail. Make sure you test all your reports against the changed model before deploying it to production. Also, if you allow ad-hoc access to your SSAS solution be aware that users may have created reports that you do not know about. Query logging may help you a little here (it gives you an indication of which attribute hierarchies are in use). The best way to avoid all of this is to thoughtfully design your cube and the naming of your SSAS objects so that there is no need to change or remove anything in the first place.

#25: Avoid “real time”

“Real time” means different things to different people. Some interpret  it as “simultaneous to an event occurring” while others have more leeway and have various levels of tolerance for delays.  I prefer the term “latency”: How old can the data in the BI solution get before it needs to be refreshed?. The lowest latency I have ever implemented is two hours. That is hours not minutes. I know this does not sound very impressive but that is honestly the best I have been able to do at a reasonable cost. When doing “real time” you need to consider a lot of factors: Partitioning, changes to dimensions, ROLAP vs MOLAP / direct query vs xVelocity, source system access, how to administer it, etc., etc. These things add up quickly to a point where the value simply does not justify the cost.

SQL Server 2014 product guide available

From a BI perspective I do not see much new stuff except a lot of emphasis on Hadoop (“Big Data”) integration. One interesting thing I noted was that they actually mention PerformancePoint, which they have not talked about in a long time. I had my money on the service being killed or merged into something else like PowerView. Guess I was wrong, in the short term at least. And perhaps columnstore indexes + tabular in directquery mode is something to explore?

Get it here: http://www.microsoft.com/en-us/download/details.aspx?id=39269

My 30 tips for building a Microsoft Business Intelligence solution, Part IV: Tips 16-20

A note about the SSAS tips: Most tips are valid for both dimensional and tabular models. I try to note where they are not. 

#16: Implement reporting dimensions in your SSAS solution

Reporting dimensions are constructs you use to make the data model more flexible for reporting purposes. They usually also simplify the management and implementation of common calculation scenarios. Here are two examples:

  • A common request from users is the need to select which measure to display for a given report in Excel through a normal filter. This is not possible with normal measures / calculations. The solution is to create a measure dimension with one member for each measure. Expose a single measure in your measure group (I frequently use “Value”) that you assign the correct measure to in your MDX script / DAX calculation based on the member selected in the measure dimension. The most frequently used measure should be the default member for this dimension. By doing this you not only give the users what they want, but you also simplify a lot of calculation logic such as the next example.
  • Almost all data models require various date related calculations such as year to date, same period last year, etc. It is not uncommon to have more than thirty such calculations. To manage this effectively create a separate date calculation dimension with one member for each calculation. Do your time based calculations based on what is selected in the time calculation dimension. If you implemented the construct in the previous example this can be done generically for all measures that you have in your measure dimension. Here is an example for how to do it tabular. For dimensional use the time intelligence wizard to get you started.

#17: Consider creating separate ad-hoc and reporting cubes

Analysis Services data models can become very complex. Fifteen to twenty dimensions connected to five to ten fact tables is not uncommon. Additionally various analysis and reporting constructs (such as a time calculation dimensions) can make a model difficult for end users to understand. There are a couple of features that help reduce this complexity such as perspectives, role security and default members (at least for dimensional) but often the complexity is so ingrained in the model that it is difficult to simplify by just hiding measures / attributes / dimensions from users. This is especially true if you use a “reporting cube” which I talked about in tip #16. You also need to consider the performance aspect of exposing a large, complex model to end user ad-hoc queries. This can very quickly go very wrong. So my advice is that you consider creating a separate model for end users to query directly. This model may reduce complexity in a variety of ways:

  • Coarser grain (Ex: Monthly numbers not daily).
  • Less data (Ex: Only last two years, not since the beginning of time).
  • Fewer dimensions and facts.
  • Be targeted at a specific business process (Use perspectives if this the only thing you need).
  • Simpler or omitted reporting dimensions.

Ideally your ad-hoc model should run on its own hardware. Obviously this will add both investment and operational costs to your project but will be well worth it when the alternative is an unresponsive model.

#18: Learn .NET

A surprisingly high number of BI consultants I have met over the years do not know how to write code. I am not talking about HTML or SQL here but “real” code in a programming language. While we mostly use graphical interfaces when we build BI solutions the underlying logic is still based on programming principles. If you don’t get these, you will be far less productive with the graphical toolset. More importantly .Net is widely used in Microsoft based  solutions as “glue” or to extend the functionality of the core products. This is especially true for SSIS projects where you quite frequently have to implement logic in scripts written in C# or VB.net but also applies to most components in the MS BI stack. They all have rich API’s that can be used for extending their functionality and integrating them into solutions.

#19: Design your solution to utilize Data Quality Services

I have yet to encounter an organization where data quality has not been an issue. Even if you have a single data source you will probably run into problems with data quality. Data quality is a complex subject. Its expensive to monitor and expensive to fix. So you might as well be proactive from the get-go. Data Quality Services is available in the BI and Enterprise versions of SQL Server. It allows you to define rules for data quality and monitor your data for conformance to these rules. It even comes with SSIS components so you can integrate it with your overall ETL process. You should include this in the design stage of your ETL solution because implementing it in hindsight will be quite costly as it directly affects the data flow of your solution.

#20: Avoid SSAS unknown members

Aside from the slight overhead they cause when processing, having unknown members means that your underlying data model has issues. Fix them there and not in the data model.

My 30 tips for building a Microsoft Business Intelligence solution, Part III: Tips 11-15

#11: Manage your own surrogate keys.

In SQL Server it is common to use an INT or BIGINT set as IDENTITY to create unique, synthetic keys. The number is a sequence and a new value is generated when we execute an insert. There are some issues with this. Quite often we need this value in our Integration Services solution to do logging and efficient loads of the data warehouse (there will be a separate tip on this). This means that sometimes we need the value before an insert and sometimes after. You can obtain the last value generated by issuing a SCOPE_IDENTITY command but this will require an extra trip to the server per row flowing through your pipeline. Obtaining the value before an insert happens is not possible in a safe way. A better option is to generate the keys yourself through a script component. Google for “ssis surrogate key” and you will find a lot of examples.

#12: Excel should be your default front-end tool.

I know this is a little bit controversial. Some say Excel lacks the power of a “real” BI tool. Others say it writes inefficient queries. But hear me out. Firstly, if you look at where Microsoft is making investments in the BI stack, Excel is right up there at the top. Contrast that to what they are doing with PerformancePoint and Reporting Services and its pretty clear that Excel is the most future proof of the lot. Microsoft  have added lot of BI features over the last couple of releases and continue to expand it through new add-ins such as data explorer and geoflow. Additionally, the integration with SharePoint gets tighter and tighter. The Excel web client of SharePoint 2013  is pretty on par with the fat Excel client when it comes to BI functionality. This means that you can push out the new features to users who have not yet upgraded  to the newer versions of Excel. When it comes to the efficiency with which Excel queries SSAS a lot has become better. But being a general analysis tool it will never be able to optimize its queries as you would if you wrote them specifically for a report.Please note that I am saying “default” not “best”. Of course there are better, pure bred,  Business Intelligence front-ends out there. Some of them even have superior integration with SSAS.  But its hard to beat the cost-value ratio of Excel if you are already running a Microsoft shop. If you add in the fact that many managers and knowledge workers already do a lot of work in Excel and know the tool well the equation becomes even more attractive.

#13: Hug  an infrastructure expert  that knows BI workloads.

Like most IT solutions, Microsoft BI solutions are only as good as the hardware and server configurations they run on. Getting this right is very difficult and requires deep knowledge in operating systems, networks, physical hardware, security and the software that is going to run on these foundations. To make matters worse, BI solutions have workloads that often differ fundamentally from line of business applications in the way they access system resources and services. If you work with a person that knows both of these aspects you should give him or her a hug every day because they are a rare breed. Typically BI consultants know a lot about the characteristics of BI workloads but nothing about how to configure hardware and software to support these. Infrastructure consultants on the other hand know a lot about hardware and software but nothing about the specific ways BI solutions access these. Here are three examples: Integration Services is mainly memory constrained. It is very efficient at processing data as a stream as long as there is enough memory for it. The instant it runs out of memory and starts swapping to disk you will see a dramatic decrease in performance. So if you are doing heavy ETL, co-locating this with other memory hungry services on the same infrastructure is probably a bad idea. The other example is the way data is loaded and accessed in data warehouses. Unlike business systems that often do random data access (“Open the customer card for Henry James”) data warehouses are sequential. Batches of transactions are loaded into the warehouse and data is retrieved by reports / analysis services models in batches. This has a significant impact on how you should balance the hardware and configuration of your SQL Server database engine and differs fundamentally from how you handle workloads from business applications.  The last example may sound extreme but is something I have encountered multiple times. When businesses outsource their infrastructure to a third party they give up some of the control and knowledge in exchange for an ability to “focus on their core business”. This is a good philosophy with real value. Unfortunately if you do not have anyone on the requesting side of this partnership that knows what to ask for when ordering infrastructure for your BI project what you get can be pretty far off from what you need. Recently a client of mine made such a request for a SQL Server based data warehouse server. The hosting partner followed their SLA protocol and supplied a high availability configuration with a mandatory full recovery model for all databases. You can imagine the exploding need for disk space for the transaction logs when loading batches of 20 million rows each night. As these examples illustrate, it is critical for a successful BI implementation to have people with infrastructure competency on your BI team that also understand how BI solutions differ from “traditional” business solutions and can apply the right infrastructure configurations.

#14: Use Team Foundation Server for your BI projects too.

A couple of years ago putting Microsoft BI projects under source control was a painful experience where the benefits drowned in a myriad of technical issues.  This has improved a lot. Most BI artifacts now integrate well with TFS and BI teams can greatly benefit from all the functionality provided by the product such as source control, issue tracking and reporting. Especially for larger projects with multiple developers working against the same solution TFS is the way to go in order to be able to work effectively in parallel. As an added benefit you will sleep better at night knowing that you can roll back that dodgy check-in you performed a couple of hours ago. With that said there are still issues with the TFS integration. SSAS data source views are a constant worry as are server and database roles. But all of this (including workarounds) is pretty well documented online.

#15: Enforce your attribute relationships.

This is mostly related to SSAS dimensional but you should also keep it in mind when working with tabular. Attribute relationships define how attributes of a dimension relate to each other (roll up into each other). For example would products roll up into product subgroups which would again roll into product groups. This is a consequence of the denormalization process many data warehouse models go through where complex relationships are flattened out into wide dimension tables. These relationships should be definied in SSAS  to boost general performance. The magic best-practice analyzer built into data tools makes sure you remember this with its blue squiggly lines. Usually it takes some trial and error before you get it right but in the end you are able to process your dimension without those duplicate attribute key errors. If you still don’t know what I am talking about look it up online such as here. So far so good. Problems start arising when these attribute relationships are not enforced in your data source, typically a data warehouse. Continuing with the example from earlier over time you might get the same product subgroup referencing different product groups (“parents”). This is not allowed and will cause a processing of the dimension to fail in SSAS (those pesky duplicate key errors). To handle this a bit more gracefully than simply leaving your cube(s) in an unprocessed state (with the angry phone calls this brings with it) you should enforce the relationship at the ETL level, in Integration Services. When loading a dimension you should reject / handle cases where these relationships are violated and notify someone that this happened. The process should make sure that the integrity of the model is maintained by assigning “violators” to a special member of the parent attribute that marks it as “suspect”. In this way your cubes can still be processed while highlighting data that needs attention.

My 30 tips for building a Microsoft Business Intelligence solution, Part II: Tips 6-10

# 6: Use a framework for your Integration Services solution(s) because data is evil

I know how it is. You may have started your ETL project using the SQL Server import / export wizard or you may have done a point integration of a couple of tables through data tools. You might even have built an entire solution from the ground up and been pretty sure that you thought of everything. You most likely have not. Data is a tricky thing. So tricky in fact that I over the years have built up an almost paranoid distrust against it. The only sure thing I can say is that it will change (both intentionally and unintentionally) over time and your meticulously crafted solution will fail. Best case scenario is that it simply will stop working. Worst case scenario is that this error / these errors have not caused a failure technically but have done faulty insert / update / delete operations against your data warehouse for months. This is not discovered until you have a very angry business manager on the line who has been doing erroneous reporting up the corporate chain for months. This is the most likely scenario. A good framework should have functionality for recording data lineage (what has changed) and the ability to gracefully handle technical errors. It won’t prevent these kinds of errors from happening but it will help you recover from them a lot faster. For inspiration read The Data Warehouse ETL Toolkit.

#7: Use a framework for your Integration Services solution(s) to maintain control and boost productivity

Integration Services is a powerful ETL  tool that can handle almost any data integration challenge you throw at it. To achieve this it has to be very flexible. Like many of Microsoft’s products its very developer oriented. The issue with this is that there are as many ways of solving a problem as there are Business Intelligence consultants on a project. By implementing a SSIS framework (and sticking with it!) you ensure that the solution handles similar problems in similar ways. So when the lead developer gets hit by that bus you can put another consultant on the project who only needs to be trained on the framework to be productive. A framework will also boost productivity. The up-front effort of coding it, setting it up and forcing your team to use it is dwarfed by the benefits of templates, code reuse and shared functionality. Again, read  The Data Warehouse ETL Toolkit for inspiration.

#8: Test and retest your calculations.

Come into the habit of testing your MDX and DAX calculations as soon as possible. Ideally this should happen as soon as you finish a calculation, scope statement, etc. Both MDX and DAX get complicated really fast and unless you are a Chris Webb you will loose track pretty quickly of dependencies and why numbers turn out as they do. Test your statements in isolation and the solution as a whole and verify that everything works correctly. Also these things can have a severe performance impact so remember to clear the analysis services cache and do before and after testing (even if you have cache warmer). Note that clearing the cache means different things to tabular and dimensional as outlined here.

#9: Partition your data and align it from the ground up.

Note that you need the enterprise version of SQL Server for most of this. If you have large data sets you should design your solution from the ground up to utilize partitioning.  You will see dramatic performance benefits from aligning your partitions all the way from your SSIS process to your Analysis Services cubes / tabular models. Alignment means that if you partition your relational fact table by month and year, you should do the same for your analysis services measure group / tabular table. Your SSIS solution should also be partition-aware to maximize its throughput by exploiting your partitioning scheme.

#10: Avoid using the built-in Excel provider in Integration Services.

I feel a bit sorry for the Excel provider. It knows that people seeing it will think “Obviously I can integrate Excel data with my SSIS solution, its a MS product and MS knows that much of our data is in Excel”. The problem is that Excel files are inherently unstructured. So for all but the simplest Excel workbooks the provider will struggle to figure out what data to read. Work around this by either exporting your Excel data to flat files or look at some third party providers.

My 30 tips for building a Microsoft Business Intelligence solution, Part I: Tips 1-5

Having worked with Microsoft BI for more than a decade now here are the top 30 things I wished I knew before starting development of a solution. These are not general BI project recommendations such as “listen to the business” or “build incrementally” but specific lessons I have learned (more often than not the hard way) designing and implementing Microsoft based Business Intelligence solutions. So here are the first five:

#1: Have at least one SharePoint expert on the team.

The vast majority of front-end BI tools from Microsoft are integrated with SharePoint. In fact, some of them only exist in SharePoint (for instance PerformancePoint). This means that if you want to deliver Business Intelligence with a Microsoft solution, you will probably deliver a lot of it through SharePoint. And make no mistake: SharePoint is very complex. You have farms, site collections, lists,  services, applications, security… the list goes on and on. To make matters worse you may have to integrate your solution with an already existing SharePoint portal. There is a reason there are professional SharePoint consultants around, so use them.

#2: Do not get too excited about Visio integration with Analysis Services.

Yes, you can query and visualize Analysis Services data in Visio. You may have seen the supply chain demo from Microsoft which looks really flashy.  You might think about a hundred cool visualizations you could do. Before you spend any time on this or start designing your solution to utilize it, try out the feature. While its a great feature, it requires a lot of work to implement (at least for anything more than trivial). Also, it (currently) only supports some quite specific reporting scenarios (think decomposition trees).

#3: Carefully consider when to use Reporting Services.

Reporting Services is a great report authoring environment. It allows you to design and publish pixel perfect reports with lots of interactivity.  It also provides valuable services such as caching, subscriptions and alerts. This comes at a cost though. The effort needed to create SSRS reports is quite high and needs a specialized skill set. This is no end user tool. There are also  issues with certain data providers (especially Analysis Services). But if you need any combination of multiple report formats , high scalability (caching, scale-out),  subscriptions or alerts, you should seriously consider Reporting Services.

#4: Use Nvarchar / unicode strings throughout the solution.

Unless you live in the US (and are pretty damn sure you will never have “international data”) use unicode. Granted, varchars are more efficient but you do not want to deal with collations / codepages. Ever. Remember this is not only an issue with the database engine but also with other services such as Integration Services.

#5: Check if it exists on codeplex.

Do not build anything before you have checked codeplex. Chances are someone has already done the same or something similar that can be tweaked. If you are skeptical of including “foreign” code in your solution (like me) use the codeplex code as a cheat-sheet and build your own based on it. There is a lot stuff there including SSAS stored procedures, SSIS components and frameworks and much more.

Bring back the default member in SSAS tabular

Even though tabular models are a lot less complex than dimensional we still have the need to simplify the model for the end user for ad-hoc reporting and analysis. One of the more helpful tools we had for doing this was using a default member referencing  the most commonly used member in an attribute hierarchy so the user did not have to select this unless he explicitly wanted to see something else. Please bring it back!

Are maps the new gauges?

Over the past couple of years most  data visualization vendors have been adding spatial / mapping related functionality to their product suites. The first iterations were cumbersome to use with special geographic data types that needed to be projected onto custom maps. Today it is much, much simpler with capabilities to automatically map geography related attributes (such as state and zip code). This lets existing data sets be plotted onto maps without the need for spatial references such as longitude/latitude or complex vector shapes. When doing this for the first time it is almost magical. You select a measure, specify some geographical attributes and presto: Bars appear on the map in the right places. For us data enthusiasts this leads to a mapping frenzy where we take every data set in our repository and project it onto maps in more and more intricate ways. This was the exact same thing that happened when I first started playing around with gauges  (speedometers, thermometers  etc.) and other “fancy” visualizations when they became available oh so many years ago. Today I roll my eyes at that kind of wasted “artistry”:  So many pixels, so little information. So after having cooled down from my initial childish joy over a new way to display data I started thinking about its value.

When it comes to data visualizations I always ask myself: Does this add value to the data compared to displaying it in a simple table? With gauges its pretty easy to answer that one: No. With maps? A little more difficult.  The thing is: Maps encode information that is useful in itself and is universally understood. Information such as location, distance and area are all easily grasped by basically anyone looking at a map. Plotting data points into a map can add value by leveraging this. Here are some examples:

  • Highlight clusters through color coding.
  • Give a scale of density of some occurrence.
  • Show the distance between occurrences of something.

However the data itself must be of a kind where this information is not readily apparent. For instance a map of the US with states color coded by the percentage they contribute to total sales (who has not seen this?) does not add any value compared to a table. The map is not adding any context to the data, it is basically there for show. Much like the good old gauges.  My point is that the data needs to be geographically relevant. What we show has to relate to the information inherently present in geographic encoding.  The volume of data also has to be big enough so that these relationships are not obvious or significant work would need to be done to categorize them in order for them to make sense. A good example of this is the “Chicago Crime Data” sample data set provided with the public preview of GeoFlow for Excel (scroll down a bit on the page). Here we see how the map adds a lot of understanding to a data set that is geographically relevant. Deducting the insights we get from simply looking at the clustering in the map would be impossible by simply scrolling through the data set.  If we were to present this in tabular form we would have a very hard time conveying the spatial information a map gives us. A lot of upfront work would need to be done to create the kind of clusters and spatial information the map gives us.

So in short: Are maps the new gauges? I would say not really. There is true value to be exploited by projecting data points onto a map. But as always, the right  tool should be used for the job at hand.

SQL Server Analysis Services StressTester Beta 1.1.3

Fixes numerous bugs including:

  • Thread related issues.
  • Wrong timing of queries.
  • Only one instance of the server / client is allowed to run at a time on the same machine.
  • Query counts not updating correctly.
  • Network code made a little more efficient by sending query results from the client to the server in batches of five.
  • Overall memory consumption lowered significantly in the Server.

SQL Server Analysis Services StressTester Beta 1.1.2

Update: There is an issue with how the test progress is displayed in both the server and client. This is due to the new multithreading support which wreaks havoc with my variables and / or events. This does not affect the execution of the test itself, just the progress report.

Note that I have changed the version scheme to align with the codeplex scheme.

This release fixes some nasty bugs and adds a couple of features:

  • Fixed a bug where the server would crash if clients could not connect to the target.
  • Added tool-tips to most controls.
  • Added a log window to the server that displays server and client activity.
  • Added the option to run multiple threads on clients.
  • Expanded the delay feature so that client threads pause a random number of milliseconds between a low and high value before issuing the next query.
  • Made the server a lot less resource (CPU) intensive.

Note that existing tests (.test) will break with this release. 

SQL Server Analysis Services Stress Tester Alpha 2.1 release

Some minor stuff:

  • Query editor now accepts newline
  • Added option to clear SSAS cache before executing test

If you have saved tests in the previous version these will break unless you add the following to the xml under the <test> root:
<test>
….
<clearcache>1</clearcache>
….
</test>

Yes, I am still planning to do some documenting 😉

stresstester.codeplex.com