A Primer on Cookies in Web Analytics
Some of you may have noticed that I don’t blog as much as some of my colleagues (not to mention any names, but this one, this one, or this one). The main reason is that I’m a total nerd (just ask my wife), but in a way that is different from most analytics professionals. I don’t spend all day in the data – I spend all data writing code. And it’s often hard to translate code into entertaining blog posts, especially for the folks that tend to spend a lot of time reading what my partners have to say.
What is a cookie?
From the most reliable and trusted source on the Internet, a cookie is defined as:
a small piece of data sent from a website and stored in a user’s browser while the user is browsing that website
I disagree with a few key points in that definition:
- A cookie can be sent from a website (i.e. a web server), but it can also be set or manipulated in the browser once the page has loaded in the browser and communication has more-or-less stopped with the server.
- Cookies are stored in a user’s browser – but the duration of this storage can (and usually does) last beyond the time the user browses that website.
So let’s start by redefining what a cookie is – or at least my opinion of what a cookie is: A cookie is a small piece of data about a website, as well as the user browsing that website. It is stored in the user’s browser for an amount of time determined by the code used to set the cookie, and can be set and modified either on the server or in the browser. There are several important attributes to a cookie:
- A name: This is a friendly name for the cookie, usually reflecting what data it is storing (like a username, or a preferred content language).
- A value: This is the actual data to be stored, and is always a string – but it could contain a numeric string, or JSON data converted to a string, or any type of data that can easily be converted from a string to some other format.
- An expiration date/time: This is a readable UTC date string (not a timestamp). If this date is omitted, the cookie expires when the browser closes (i.e. a session cookie). If the date is included, the cookie is persistent.
- A domain: This is the website in which the cookie has scope (meaning where it can be read and written). For example, on my blog I can set a cookie that has scope of webanalyticsdemystified.com or josh.webanalyticsdemystified.com. I cannot set a cookie that has scope of adam.webanalyticsdemystified.com, or for any other website.
- A path: Once the domain has been specified, I can also choose to restrict it to a certain path or directory on my website – like maybe only anything in /products. The path is rarely specified.
A day in the life of a cookie
You may use a tool like Firebug to inspect your cookies, and be familiar with seeing cookies represented like this:
However, you may be less familiar with what a cookie actually “looks” like. But it’s just a string that is saved as a text file somewhere in your browser’s application data (you can find these text files if you search for where your browser’s cache is located on your computer). For the same example above, the text string looks something like this:
So lets take a deep dive into this particular cookie and how it ends up being available for your analytics tools to use. This cookie gives me a unique user ID on my blog. It is set to expire on the last day of 2014, and has scope anywhere on my blog domain. But what led to this cookie showing up in the list of cookies in my browser?
This issue of cookie scope is a particularly tricky topic, so let’s take a closer look. Because I set my cookie with the scope of josh.webanalyticsdemystified.com, it will be included any time one of my blog posts is served. But if I set it with the scope of webanalyticsdemytified.com, it will be included for any of my partners’ posts as well – because we all share that top-level domain. I cannot set my cookie with any other scope – I can use my subdomain or the top-level domain – but other subdomains or other domains (like google.com) are not allowed. And my code will never be allowed to read cookies on those other domains, either. The same is true of the “path” component of a cookie as well – so you can see how path and domain combine to limit which cookies a a site can read, and which ones it cannot read. They essentially act as a bouncer, controlling which parts of this big party called the web my site has access to.
First-party vs. Third-party
However, the average page normally includes many assets – like images or scripts – that are hosted on a different domain than my site. For example, my blog loads the jQuery library from Google’s CDN, ajax.googleapis.com, and our Google Analytics tracking requests are sent back to www.google-analytics.com. Any cookies my browser has for either of those domains (note: if you look, you will not see any – this is just an example) will be sent back to each of those servers, because they have scope there – but when the requests come back to my browser, the page waiting to receive the assets has a different domain – so it can neither read from nor write to any of those cookies. They are what we call third-party.
Third-party cookies have a bad reputation, because some companies use them maliciously, and also because it can be annoying to have cookies you never asked for cluttering up your browser cache and passing data back to places you never knew about. But in most cases, cookies – even third-party cookies – contribute to making the most positive user-experience possible. However, because maintaining privacy is important to everyone, most modern browsers offer users the ability to choose whether third-party cookies should be allowed or not. Most browsers allow third-party cookies by default, and most users never modify those settings. But if third-party cookies are disabled, it means that the server will try to set a cookie, and the browser will immediately reject or delete it. And even when third-party cookies are allowed, they can never be used by the site in the browser – nor can the third-party site read any first-party cookies, either. For example, when this page loaded, a cookie was set by disqus.com, the tool we use for comments and discussions. My blog cannot read or modify that disqus.com cookie – and disqus.com can’t read any webanalyticsdemystified.com cookies, either. Developers reading this post are now shaking their heads, as there are obviously exceptions to this – but most readers probably aren’t interested in cross-site scripting and other hacking techniques that are the exceptions to this rule.
Some browsers like Safari even allow you to select an option that says “Only accept cookies from sites I visit,” which blocks some, but not all, third-party cookies. For example, when I worked at salesforce.com, www.salesforce.com was the entry point to all our websites, but we had other websites (like www.force.com) that we linked to on that website. Because www.force.com included images from www.salesforce.com, those cookies with www.salesforce.com scope would not be deleted, even though the domain didn’t match that of the page in the browser – because I had previously been to that site. The browser used my history as an implicit acceptance of those third-party cookies. However, even though the cookies would not be deleted, they couldn’t be used by www.force.com, either.
I often have clients ask me how they can take advantage on one site of a cookie they set on another one of their sites. Unfortunately, even though they “own” both sites, and the same developers manage both sites, there is no way of doing this without taking the risk that those cookies will be deleted by your users blocking third-party cookies. There are some “hacks” to accomplish this, but they rarely outweigh the risks you’re taking.