My 30 tips for building a Microsoft Business Intelligence solution, Part VI: Tips 26-30

This is the last part in my series of things I wished I knew about before starting a Microsoft BI project. I’ll be taking my summer vacation now so the blog will be quiet the next month. After the break I will revise a couple of the tips based on feedback so stay tuned. 

#26: Decide how to source your data in Analysis Services and stick with it.

Ideally you will source your data from a correctly modeled star schema. Even then you may need to massage the source data before feeding it into SSAS. There are two ways of accomplishing this: Through views in the database or through data source views (dimensional) or queries (tabular). Unless you are unable to create views in your database (running on a prod system etc) I would strongly suggest using them. This will give you a clean separation of logic and abstraction between the SSAS solution and the data source. This means that clients connecting to the data warehouse directly will see the same data model as the SSAS solution. Also migrating between different front-ends (like dimensional and tabular) will become much simpler. In my solutions I never connect to tables directly I always bind to views for everything and never implement any logic in the DSV or via queries.

#27: Have some way of defining “current” time periods in your SSAS solution

Most SSAS solutions have a time dimension with dates, months, years, etc. In many ways its the most important dimension in your solution as it will be included in most reports / analyses as well as form the basis for a lot of calculations (see previous tips). Having a notion of what is the current period in your time dimension will greatly improve the usability of your solution: Reports will automatically be populated with the latest data without any user interaction. It can also simplify  ad-hoc analysis by setting the default members to the most current date / month / year so that when users do not put these on one of the axes it will default to the most recent time period. There are a number of ways of implementing this including calculated members and named sets (for dimensional) and calculations for Tabular and the internet is abundant with sample solutions. Some of them are fully automated (using VBA time functions) and some require someone to manually set the current period. I prefer to use the latter if possible to avoid reports showing incorrect data if something went wrong in the ETL.

#28: Create a testable solution

This is a really big topic so I will emphasize what I have found most important. A BI solution has a lot of moving parts. You have your various source systems, your ETL pipeline, logic in the database, logic in your SSAS solution and finally logic in your reporting solution. Errors happen in all of these layers but your integration services solution is probably the most vulnerable part. Not only do technically errors occur, but far more costly are logic errors where your numbers don’t match what is expected. Luckily there are a lot of things you can do to help identify when these errors occur. As mentioned in tips #6 and #7 you should use a framework. You should also design your solution to be unit testable. This boils down to creating lots of small packages that can be run in isolation rather than large complex ones. Most importantly you should create validation queries that compares the data you load in your ETL with data in the source systems. How these queries are crafted varies from system to system but a good starting point would be comparisons of row counts, sums of measures (facts) and number of unique values. The way I do it is that I create the test before building anything. So if I am to load customers that have changed since X, I first create the test query for the source system (row counts, distinct values etc.) then the query for the data warehouse together with a comparison query and finally I start building the actual integration. Ideally you will package this into a SSIS solution that logs the results into a table. This way you can utilize your validation logic both while developing the solution but also once its deployed. If you are running SQL Server 2012 you might want to look into the data tap features of SSIS that lets you inspect data flowing through your pipeline from the outside.

#29: Avoid the source if you are scaling for a large number of users

Building a BI solution to scale is another very large topic. If you have lots of data you need to scale your ETL, Database and SSAS subsystems. But if you have lots of users (thousands) your bottleneck will probably be SSAS. Concurrently handling tens to hundreds of queries with acceptable performance  is just not feasible. The most effective thing is to avoid this as much as possible. I usually take a two pronged approach. Firstly I implement as much as possible as standard (“canned”) reports that can be cached. Reporting Services really shines in these scenarios. It allows for flexible caching schemes that in most circumstances eliminates all trips to the data source. This will usually cover around 70-80% of requirements. Secondly I deploy an ad-hoc cube specifically designed and tuned for exploratory reporting and analysis. I talked about this in tip #17. In addition you need to consider your underlying infrastructure. Both SSRS and SSAS can be scaled up and out. For really large systems you will need to do both, even with the best of caching schemes.

#30: Stick with your naming standards

There are a lot objects that need to be named in a solution. From the more technical objects such as database tables and SSIS packages to objects exposed to users such as SSAS dimensions and measures. The most important thing with naming conventions is not what they are, but that they are implemented. As I talked about in tip #24 changing a name can have far reaching consequences. This is not just a matter of things breaking if you change them but consider all of the support functionality in the platform such as logging that utilize object  names.  Having meaningful, consistent names will make it a heck of a lot easier to get value out of this.  So at the start of the project I would advise to have a “naming meeting” where you agree upon how you will name your objects. Should dimension tables be prefixed with Dim or Dim_? Should Dimension names be plural (CustomerS) or singular (Customer), etc.

My 30 tips for building a Microsoft Business Intelligence solution, Part V: Tips 21-25

I might just get all 30 done before summer vacation!

#21: Avoid using discretization buckets for your dimension attributes

Discretization buckets lets you group numerical attributes into ranges. Say you have a customer dimension including the age of the customer you can use this feature to group them into age clusters such as 0-5, 6-10 and so on. While you can tweak how the algorithm creates groups and even provide naming templates for the groups you still have relatively limited control over them. Worst case scenario: A grouping is removed / changed by the algorithm which is referenced in a report. A better way of grouping these attributes is by doing it yourself either in the data source view or a view in the database (there will be a separate tip on this). This way you have complete control over the distribution of values into groups and the naming of the groups.

#22: Do not build a SSAS solution directly on top of your source system

SSAS has a couple of features that enable it to source data directly from a normalized data model typically found in business applications such as ERP systems. For instance you can “fake” a star schema through queries in the data source view. You can also utilize proactive caching to eliminate any ETL to populate your cube with data. This all sounds very tempting but unfortunatly I have never seen this work in reality. Unless you are working with a very small source system with impeccable data quality and few simultanous users you should avoid the temptation for all the usual reasons: Proactive caching will stress your source system, data quality will most likely be an issue, integrating new data sources will be nearly impossible,etc. There is a reason BI projects spend 70-80% of their time working with modelling and integrating data.

#23: Deploy SSAS cubes with the deployment tool

If you are working with multiple environments (dev/test/prod) do not use the deployment functionality of visual studio to deploy to another environment. This will overwrite partitions and roles that may be different between the environments. Use the deployment wizard.

#24: Remember that your SSAS cubes are a single point of failure

Keep in mind that most client tools do not cope well with changes to SSAS data models. Any renames or removals you do in the model will most likely cause clients that reference those entities to fail. Make sure you test all your reports against the changed model before deploying it to production. Also, if you allow ad-hoc access to your SSAS solution be aware that users may have created reports that you do not know about. Query logging may help you a little here (it gives you an indication of which attribute hierarchies are in use). The best way to avoid all of this is to thoughtfully design your cube and the naming of your SSAS objects so that there is no need to change or remove anything in the first place.

#25: Avoid “real time”

“Real time” means different things to different people. Some interpret  it as “simultaneous to an event occurring” while others have more leeway and have various levels of tolerance for delays.  I prefer the term “latency”: How old can the data in the BI solution get before it needs to be refreshed?. The lowest latency I have ever implemented is two hours. That is hours not minutes. I know this does not sound very impressive but that is honestly the best I have been able to do at a reasonable cost. When doing “real time” you need to consider a lot of factors: Partitioning, changes to dimensions, ROLAP vs MOLAP / direct query vs xVelocity, source system access, how to administer it, etc., etc. These things add up quickly to a point where the value simply does not justify the cost.