Data Quality – The Silent Killer…
In this post, I am going to talk about how Data Quality can kill an Omniture (or other Web Analytics) implementation. I will share some of the problems I have seen and show some ways that you can help improve Data Quality…
Sound Familiar?
So you have been managing an Omniture implementation for a while. You have your KPI’s lined up. You have been sharing some dashboards and reports with people throughout your company. People are starting to realize that they should talk to you before making website business decisions. Suddenly, you find yourself in the executive suite to answer some key website questions. Then, just as you are wrapping up your web traffic overview, an executive starts to calculate some numbers on a notepad and determines that the increase you show in Paid Search traffic doesn’t look right given other data they have seen from the SEM team. She also questions the rise in traffic data for EMEA, knowing that his VP in the region told you traffic has been down over the last few months. Suddenly, you are in a web analytics death spiral. In a split second, you have to decide, do you defend your Omniture data and risk your reputation or do you back-pedal saying you will re-check the web analytics data and live to fight another day?
Hopefully this hasn’t happened to you, but it has happened to most of us who have been around the web analytics space for long enough. Unfortunately, you only get so many chances to be wrong about data you are presenting and even if your data is right, if you aren’t confident enough to stand by it, it might as well be wrong.
Minimizing Data Quality Risk
So how do you avoid this situation? The first step is to realize that there is no way to be sure that all of your web analytics data is correct. 100% Data Quality is not only unattainable, but also not worth the time and effort it would take to achieve. Therefore, I use a philosophy of risk minimization in which I try various techniques to minimize the key things that cause data quality issues. The following will show you some of the ways to do this:
Ensure all Pages are Tagged
This is easier said than done. As we all know, IT is usually used to deploy JavaScript tags and they often have more important things to do than to guarantee that every website page has a the [correct] JavaScript tag. Fostering a good relationship with IT helps, but at the end of the day, new website pages are created all the time, and tags will be missing.
Use Technology
As you can imagine, where there is a need, there are technology vendors. The main vendors that I have worked with or heard the most about are WASP and ObservePoint. Not completely coincidental, ObservePoint was founded by John Pestana who was one of the co-founders of Omniture. In a great blog post, John Pestana talked about getting rid of asterik’s in web analytics reports. I am sure there are many other vendors out there offering similar products, but the gist of the technology is that it can spider your website and let you know which pages are missing JavaScript tags so eliminate any obvious omissions.
Blood, Sweat & Tears
Unfortunately, the main way that I have minimized web analytic data loss is by downloading data and looking for anomalies. I normally do this by taking advantage of the Omniture SiteCatalyst Excel Client and downloading key data blocks by day or week and then using formulas to compare yesterday to the same day last week or last week to the week prior. Once you have the data in Excel, you can do any type of statistical analysis you want on the data to see if anything looks “fishy.” One thing I like to do is to use Excel conditional formatting to spot data issues.
The following is a screenshot example of using Excel to spot potential data issues. In this example, I am looking at Page Views from one week prior to each day and if there is a change of more than 20%, I highlight it in red:
Uh-oh… It looks like our daily data quality report indicates that we may have lost a tag on Friday for the Login page and something suspicious took place related to the Search Results page the same day. Obviously, the downsides of this approach are that it is extremely manual and that it is in arrears. As you know, once you miss a time slot of data in SiteCatalyst, there is no easy way to get that data back. While this approach can minimize the data loss to a day, it won’t help you get the Login Page data back in the example above.
Therefore, the way I employ this approach is to focus on the top items within each variable. This means, I focus on the pages with the most Page Views, the Form ID’s with the most Form Completions, the Orders for the most popular products, etc…With the Excel Client, you can download multiple data blocks at once and then use conditional formatting to easily spot the issues. Done intelligently, Data Quality for 80% of your data can be done in under a few hours each day. By doing this, you can feel more confident when your VP questions your data knowing that if something were significantly off, you would have known about it ahead of time.
Special Cases
I have found that there are a few other situations that commonly lead to missing or bad data so I quickly wanted to bring them to your attention so you can apply some additional effort to ensure they are tagged correctly:
- “Lightbox” pages where a new HTML page is often not loaded. These often times are created as a window within a window and many times developers forget to put SiteCatalyst code within them.
- Flash/AJAX pages where the page changes dynamically or you have an entire site/page developed in Flash. By extra careful around these as they often are missing tracking code (especially when done by an outside agency!).
- Dynamically generated content, such as a page that shows historical stock price data after a user enters a ticker symbol. Often times, these dynamic pages are tagged as one single page, but might be better as unique pages from a web analytics viewpoint.
SiteCatalyst Alerts
If you have read my previous blog post on Alerts, you may figure out that you can use Alerts to help with Data Quality as well. Alerts can be used to look for changes in key metrics by Month, Week, Day (or Hour in some cases). These alerts can be handy to be notified when data is off by more than x%. However, I have found that if you want to look a more granular data (as in the example above), the current Alert functionality can be a bit limiting. You can set alerts for specific sProp and eVar values, but not as easily as you can by using Excel. Therefore, I would use Alerts as an early warning system an employ the previously mentioned techniques as your main defense against missing data.
Classification Data
Finally, when thinking about data quality/completeness, don’t forget about SAINT Classifications. If you have key reports that rely on SAINT Classifications, even if you have the source data collected perfectly, if you are missing key SAINT Classifications for that source data, your reports will be incorrect and indistinguishable from poor data quality in the eyes of your users. You will know if you are missing SAINT Classification data if your classified reports have a large “None” row. So how do you ensure your SAINT Classification data is complete? What I do is create Excel data blocks for each Classification and isolate the “None” row for key metrics.
In the screenshot below, you can see that I have created a data block that looks for “Unspecified” Site Locale Full Names (the Excel Client doesn’t use None, but it uses “Unspecified” instead for some reason). In this scenario, I store a 2-digit website country identifier in an eVar and use a SAINT Classification to provide a full name. I filter on “Unspecified” where the metric is Visits, Form Views and Form Completes.
After running, you will see a succinct report that looks like this:
In this case, there are no Form View or Form Complete Success Events missing a Full Site Locale SAINT Classification, but there are some Visits missing the classification. You can then easily go into SiteCatalyst or Discover, open the Full Locale Name report and break it down by its source to find out what values are left to be classified.
Finally, if you want to earn “extra credit” you can do this for all of your SAINT Classifications in one Excel workbook and make a summary screen like the one below which pulls the percentages that are unclassified into one screen so you can see how you are doing overall. What is cool about this is that you can use the “Refresh All” feature of the Excel Client to check all of your Classifications while you get coffee and when you get back, you have a fully updated view of your SAINT Classifications. In the above below, I have shaded some items in black that are OK if they aren’t fully classified, items in green that are acceptable and items in red that require attention:
Final Thoughts
As you can see, Data Quality is a HUGE topic so it is hard to cover it all in one post, but hopefully some of the pointers here will get you thinking about how you can improve in this area. One last thing I will mention is that like most things related to web analytics, tools are good, but qualified people are better! Therefore, I think that any serious web analytics team will have a resource who has Data Quality as one of their primary performance objectives. Without this, Data Quality tends to fall by the wayside. Try to do whatever you can to convince your management that having a full or part-time person devoted to Data Quality will pay hefty dividends in the future…