Capturing Web Traffic Data — Two Methods That Suck
We’re working with a client who is simultaneously deploying Eloqua for marketing automation while also switching from Urchin 5 to Google Analytics. All three of these tools provide some level of web traffic data. And, right out of the chute, the client was seeing 40% lower traffic being reported by Eloqua than was reported by Urchin 5. That raised questions…as the deployment of concurrent web analytics tools always does! Having put myself through this wringer several times, and having seen it crop up as a recurring theme on the webanalytics Yahoo! group, it seemed worth sharing some material I put together a couple of years ago on the subject.
First off, it is largely a waste of time to try to completely reconcile data from two different web analytics tools. This post really isn’t about that. Mark Twain, Lee Segall, or perhaps someone else coined the saying, “A man with one watch knows what time it is; a man with two watches is never quite sure.” The same is true for web analytics. Thanks to different data capture methods, different data processing algorithms, different data storage schemas, and different definitions, no two tools running concurrently will ever report the same results. The good news, though, is that most tools will show very similar trends. WebTrends preaches, “in web analytics, it’s the trends that matter — that’s why it’s part of our name!” But, even in the broader web analytics community, this is widely accepted. Avinash Kaushik had a great post titled Data Quality Sucks, Let’s Just Get Over It way back in 2006, but it still applies. Read more there!
This post, rather, is more the basics of “log files” versus “page tagging,” which are the two dominant methods of capturing web data. Page tagging has been much more in vogue of late, but its got its drawbacks. In the case of our client, their Urchin 5 implementation is log-based, while Google Analytics and Eloqua are tag based. And, not surprisingly, Google Analytics and Eloqua are providing traffic data that is fairly similar. But, even when two tools use the same basic data capture method, there is no guarantee that they will present identical results.
The following diagram tells the basic story of how the two methods differ (click on the image to see a larger version):
“But, wait!” you exclaim! “How come both of these have ‘log file processed’ in them? I thought one method was log file-based and the other was not!” <sigh> As it turns out, both methods are, in the end, parsing log files. With page tag solutions, the log file being parsed/processed is the page tag server’s log file. In theory, your main web server(s) could be the page tag server…but then the tool would be stuck having to sift through a lot more clutter to get to the page tag-generated requests.
I’m getting ahead of myself, but go ahead and file that little bit of information as a handy cocktail party conversation…um…killer (unless the cocktail party is a Web Analytics Wednesday event — it’s all about your target audience, isn’t it?).
In a log file-based solution — the left diagram above — the “hit” is recorded as soon as the user’s browser manages to get a request for a page to your web server. It doesn’t matter if the page is successfully delivered and rendered on the user’s machine. This is good and bad, as we’ll cover in a bit.
In a page tag-based solution — the right diagram above — the “hit” is recorded much, much later in the process. The user’s browser requests the page, the page gets downloaded to the browser, the browser renders the page and, as part of that rendering, executes a bit of Javascript. The Javascript usually picks up some additional information beyond the basic stuff that is recorded in a standard web request (such as screen resolution, maybe some meta tag values from the page, and so on). It then tacks all of that supplemental information onto the end of an image request to the page tag server. The page tag server log file, then, only has those image requests, but it has some really rich information included in them.
Got all that? Well, there are obvious pros and cons to both approaches.
Log File-Based Tools Pros and Cons
The good things about a log file-based approach
- They (more) accurately reflect the actual load on your web servers — your IT department probably cares about this a lot more than your Marketing department doe
- They captures data very early in the process — as soon as you could possibly know someone is trying to view a page, they record it
But, it’s not all sweetness and light. There are some cons to log files that are nontrivial:
- They miss hits to cached pages (by browser, by proxy) — this can make for some rather nonsensical clickstreams
- They are limited to data captured in the Web server log file — this can be a fairly severe limitation if, for instance, you have rich meta data in the content of your pages and you want to use that meta data to group your content for analysis
- They capture a lot of useless data — I just went to the Microsoft home page, and watched 65 discrete requests hit their web servers to render the page (images, stylesheets, Javascript include files, etc.); this is fairly typical, and means you wind up pre-processing the log file to strip out all of the crud that you don’t really care about
- It is difficult for them to filter out spiders/bots — there is a “long tail” of spiders crawling the web, so this is not simply a matter of knocking out Google’s bot, Yahoo’s bot, and Baidu’s bot; there is an unmanageable, constantly changing list of known bots and spiders…and many bots mask themselves, which is extremely difficult to detect (this was actually the far-and-away biggest culprit with the client who spawned this post)
Page Tag-Based Tools Pros and Cons
Alas! Although page tags address the bigger negatives of log file-based solutions, they have their own downsides. But, let’s start with the positives:
- Because they are Javascript-based, they are able to capture lots of juicy supplemental data about the visitor and the content
- Most (not all, mind you) spiders/bots do not execute Javascript, so they are automatically omitted from the data
- The Javascript “forces” the page tag to fire…even on cached pages
There are some downsides, though:
- They requires the page tag Javascript to be deployed on every page you want tracked — even if you have a centrally managed footer that gets deployed to all pages…chances are there are still some important corner case pages where this is not the case; and, even if that is not the case now, that could happen in the future; we had a pretty robust system that was undermined when the design of a key landing page was completely overhauled…and the page tag was nuked in the process
- They do not record a hit to the page until the page has been at least partly delivered to the client — if you have visitors that bounce off of your site very quickly, you may never see that they hit the site at all
- If Javascript is disabled by the client, then you have to put in some sort of clunky workaround to capture the traffic…and what you capture will not be nearly as rich as what you capture for visitors who have Javascript enabled
So, What’s the Answer?
The obvious answer may seem to be to employ both approaches in a hybrid system. And, that is obvious if you or your management is so aggressively compulsive that you are willing to deploy major time and resources to try to pull this off (and very likely fail and create confusion in the process).
Let’s toss the obvious answer out then, shall we?
The answer is more simple, actually:
- Understand the pros/cons of both approaches
- Be clear on what your objectives are — what do you care about?
- Determine which approach will more effectively help you meet your objectives and go with that
Now, if you are a Marketer, there’s a pretty good chance that you’ll wind up settling on a page tag-based solution. If that’s the case, then it might still make sense to figure out where your log files are and to do a little snooping around in them. I’ve found log files to be very handy when the page tags throw some sort of anomaly. If you can narrow down the anomaly, the log file can be a good way to get to the bottom of what is going on. Page tags…with log files to supplement. Does that sound like a tasty recipe or what?