All Web Analytics Tools Are the Same (when it comes to data capture)
I started to write a post on using web analytics tools — Google Analytics, specifically, but with a nod to Webtrends as well — to track traffic to custom tabs and interactive elements on Facebook pages. But, as I started thinking through that content, I realized that I needed to back up and make sure I had a good, clean explanation of a key aspect of the mechanics of page tag-based web analytics tools. I poked around on the interweb a bit and found some quick explanations that were accurate, but that really weren’t as detailed as I was hoping to find.
Regardless of whether you’re trying to track Facebook or not, it’s worth having a good, solid understanding of these underlying mechanics:
- If you’re a web analyst, understanding this is like understanding gravity if you’re a human being — there are some immutable laws of the internet, and knowing how those laws drive the data you are seeing will open up new possibilities for capturing activity on your site
- If you’re a developer, then this will be a quick read, but understanding it will make you the hero to both your web analysts and (assuming they’re not glory hogs) the people they support with their analysis, because you will be able to suggest some clever ways to capture useful information
By the end of this post, you should understand both the title and why the URLs I listed below are what make it so:
- Google Analytics = http://www.google-analytics.com/__utm.gif
- Webtrends = http://statse.webtrendslive.com/<ID>/dcs.gif
- Sitecatalyst = https://<custom domain>/b/ss/<account name>/1/<code version>/<random ID>
- Coremetrics = http://<custom domain>/cm or http://<custom domain>/eluminate
I’ve been deep under the hood with both Google Analytics and Webtrends for this, but the same principles apply to all tools (because they’re all bounded by the Physics of the Internet). I’m going to talk about Google Analytics the most in-depth, because it has the largest market share (measured by number of sites tagged with it), and I’ll try to call out key differences when appropriate.
Let’s start with a simple picture of how all of these tools work. When a visitor comes to a page on your site, the following sequence of events happens:
Steps 2 and 3 are really the crux of the biscuit, but we need to make sure we’re all clear on the first step, too, before getting to the fun there.
- Site = www.gilliganondata.com
- Page title = The Fun of Facebook Measurement
- Page URL = /index.php/2010/01/11/the-fun-of-facebook-measurement/
- Browser language = en-us
Converting that info into a single string is pretty straightforward. Let’s start by pretending we’re going to put it into a single row in a pipe-delimited file. It would look like this:
Site (hostname) = www.gilliganondata.com | Page name = The Fun of Facebook Measurement | Page URL = /index.php/2010/01/11/the-fun-of-facebook-measurement/ | Browser language = en-us
Now, rather than using the pretty, readable names for each of the four characteristics of the page view, let’s use some variable names (these are the Google Analytics variable names, but the documentation for any web analytics tool will provide their specific variable names for these same things):
- Site (hostname) –> utmhn
- Page title –> utmdt
- Page URL –> utmp
- Browser language –> utmul
So, now our string looks like:
utmhn = www.gilliganondata.com | utmdt = The Fun of Facebook Measurement | utmp = /index.php/2010/01/11/the-fun-of-facebook-measurement/ | utmul = en-us
We used pipes to separate out the different variables, but there’s nothing really wrong with using something different, is there? Let’s go with using “&” instead and eliminate the spaces around equal signs and the delimiters. The single string now looks like this:
utmhn=www.gilliganondata.com&utmdt=The Fun of Facebook Measurement&utmp=/index.php/2010/01/11/the-fun-of-facebook-measurement/&utmul=en-us
Now, we’ve still got some “special” characters that aren’t going to play nice in the Step 3 — namely spaces and “/”s, so let’s replace those characters with the appropriate URL encoding (%20 for the spaces and %2F for the “/”s):
It looks a little messy, but it’s a single, portable string that has the exact information that was listed in the four bullets that started this section. While it might be painful to reverse-engineer this string into a more reader-friendly format by hand, it’s a snap to do programmatically (which is exactly what web analytics tools do…as we’ll discuss in step 4) or in Excel.
- It just generally makes the process more robust because it reaffirms to the server exactly what each value means at the point the server receives the information; the internet is messy, so hiccups can happen
- Most “advanced” features when it comes to capturing web analytics data rely on tacking on additional parameters to the master string — by including both the key and the value for every parameter, that fanciness doesn’t have to worry about the order the parameters are passed in, AND it means the custom parameters get viewed/processed exactly the same way that the basic parameters do
- The “key-value pairs separated by the & sign” are standard on the internet. Go to any online retail site and poke around, and you will see them in the URL. It’s kind of a standard way to transmit a series of variables onto the back end of a web page or image request, and that’s really all that’s going to happen in step 3
We’ve got our string, so now let’s do something with it!
Somehow, we need to pass that string back to the web analytics server. We do that by making an image call. In the case of Google Analytics that image request is always, always, always exactly the same, no matter the site using Google Analytics:
Just like we covered in the “online retail site” URL structure discussion at the end of the last section, we’re going to tack some parameters on the end of the __utm.gif request. The standard way to take a base URL and tack on parameters is to add a “?” followed by one or more key-value pairs that are separated by an “&” sign. Lucky for us, the “&” sign is what we used when we were building our string in the last section! So:
Wow, that looks messy, but it just looks messy — it’s actually quite clean! In reality, there are way more than five parameters tacked onto the image request. As a matter of fact, the request above would really look more like this:
You can get a complete list of the Google Analytics tracking variables from Google (if you’re really into this, check out the utmcc value — that actually is a single parameter that includes multiple sub-parameters, which are separated by “%3D” — a URL-encoded semicolon — instead of an “&”; these are the user cookie values, which you can find towards the end of the long string above if you look for it). You can inspect the specific calls using any number of tools. I like to use the Firebug plugin for Firefox, but Fiddler is another free tool, and Charles is the standard tool used at my company. And, there’s always WASP to provide the “clean” view of the parameters (I use WASP heavily…unless I’m trying to reverse-engineer the specific calls being made for some reason).
- The request for the image is what matters — while the 1×1 image will get delivered back, by the time www.google-analytics.com actually sends out the image, the page view has already been counted. As a matter of fact, if there was no __utm.gif image, the traffic would still get counted simply by virtue of the fact that the Google Analytics server received the image request. As it happens, some other little user experience hiccups can happen if there’s no actual image, but the existence of the file matters ‘nary at all from a data capture perspective!
- Yes, you can actually just request the image directly from your browser. Go ahead — here’s the URL as a hyperlink: http://www.google-analytics.com/__utm.gif (yeah, it’s something of a letdown, but now you can say you’ve done it)
And one more point: While we’ve been talking about “image requests” here, it doesn’t have to be an image request per se. In the case of Google Analytics, it is. In the case of Webtrends, it is, too (the image is called dcs.gif). In the case of other web analytics packages, it’s not necessarily an image request, but it is a request to the web analytics server. What matters is understanding that there are a bunch of key-value pairs tacked on after a “?” in the request, and that’s where all of the fun information about the visit to the page gets recorded and passed.
4 – Web analytics tool reads the string and puts the information into a database
So, the web analytics server has been getting bombarded with the requests from Step 3. Can you see how straightforward it is for software to take those requests and split them back out into their component parts? That’s the easy part. Where the tools really differentiate themselves is how exactly they store all of that data — the design of their database and then how that data is made available for queries and reports by analysts.
Back in the day (and I assume it’s still an option), Webtrends would make the raw log files available to their customers as an add-on service. That was handy — once we understood the basics of this post and the Webtrends query parameters, we were able to sift through for some juicy nuggets to supplement our “traditional” web analytics (these were in the days before Webtrends had their “warehouse” solution, which would have made the same information available).
5 – Web analyst queries the database for insights
Like step 4, this is an area where web analytics tools really differentiate themselves. In the case of Google Analytics, there is the web-based tool and the API. In the case of paid, enterprise-class tools, there are similar tools plus true data warehouse environments that allow much more granular detail, as well as two-way integration with other systems.
Why Understanding This Matters
As I read back through this post before publishing it, I was struck by how far into the tactical mechanics of web analytics it is. The overwhelming majority of web analytics blog posts focus on step 5 and beyond — how to use the data to be an analysis ninja rather than a report monkey. Understanding the mechanics described here is a foundational step that will support all of that analysis work. I was incredibly fortunate, early in my web analytics career, to have an opportunity to run the migration from a log-based web analytics package to a tag-based solution. I was triply fortunate that I worked on that migration with two brilliant and patient IT folk: Ernest Mueller as the web admin (who regularly shares his knowledge these days as a contributor to http://www.webadminblog.com/) supporting the effort, and Ryan Rutan, the developer supporting the effort — he was hacking the Webtrends page tag before the consultant who we had on-site to help implement it had finished his first day. Ernest drew countless whiteboard diagrams to explain to me “how the internet works” (those “immutable laws” I mentioned early in this post), while Ryan repeated himself again and again until I understood this whole “image request with parameters” paradigm.
If you’re a web analyst, seek out these types of people in IT. A hearty collaboration of cross-discipline skills can yield powerful results and be a lot of fun. I had similar collaborations when I worked at Bulldog Solutions, and the last two weeks saw the same thing happening at my current gig at Resource Interactive. Those are pretty energizing experiences that leave me scratching my head as to why so many companies wind up with an adversarial relationship between “the business” and “IT.” But THAT is a topic for a whoooollllle other post that I may never write…