Avoiding PII in Google Analytics
Avoiding PII in Google Analytics
Disclaimer: I am not a lawyer, and this blog post does not constitute legal advice. I recommend seeking advice from legal counsel to confirm the appropriate policies and steps for your organization.
With the launch of Google’s Universal Analytics a few years ago, companies were suddenly able to do more with GA than had been available previously. For example, upload demographic data, or track website or app behavior by a true customer ID. Previously, Google Analytics had been intended to track completely anonymous website behavior.
However, one thing remains strict: Google’s policy against storing Personally Identifiable Information (PII.) Google Analytics’ Terms of Service clearly states “You will not and will not assist or permit any third party to, pass information to Google that Google could use or recognize as personally identifiable information.”
Unfortunately, few companies seem to realize the potential consequences of breaching this. In short: If you are found with PII in your Google Analytics data, Google reserves the right to wipe all of your data during the timeframe that the PII was present. (If this is years worth of data, so be it!) I have, in fact, worked with a client for whom this happened, and spotted many sites who are collecting PII, and may not even be aware.
[Case in point: I am a wine club member at a Napa winery (who happen to be GA users.) This winery often sends me promotional emails. Upon clicking through an email, I noticed they were appending my email address (clear PII!) into the URL. I quickly contacted them and let them know that’s a no-no, and was rewarded with a magnum of a delicious wine for my troubles! It turned out it was their email vendor who was doing this. In truth, this makes me more nervous, since this vendor is likely doing the same thing with all their clients!]
Want to know more? Here are a few things worth noting:
Google defines PII quite broadly. The current TOS does not actually contain a definition of PII, however previous versions of the TOS included a (non-comprehensive) list of examples like “name, email address or billing information.” In discussions with senior executives on the GA team, I have been told that Google actually considers ZIP Code to be PII. This is because, in a few small rural areas in the United States, ZIP Code can literally identify a single house. So keep in mind that Google will likely have a pretty strict view of what constitutes “PII.” If you think there’s a chance that something is PII, assume it is and err on the safe side.
It doesn’t matter if it’s ‘your fault’. In the case of my client, whose data was wiped due to PII, it was not actually them sending the data to Google Analytics! A third party was sending traffic to their site, with query string parameters containing PII attached. (Grrrrr!) Query string parameters are the most common culprit for PII “slipping” in, which could include Source/Medium/Campaign query string parameters, or other, non-GA-specific query string parameters. Unfortunately, this can happen without any wrongdoing on your part, since you can’t control what parameters others append.
Now, technically, the TOS say, “You … will not assist or permit any third party to…” so a client would technically not be in breach of TOS if they were unaware of the third party’s actions. However, Google may still need to “deal” with the issue (aka, remove the PII) and thus, you can end up in the same position, facing data deletion. I argue it’s worth being vigilant about third party PII infiltration, to avoid suffering the consequences!
The wipe will be at the Property level. If PII is found, the data wipe is at the Property level. This means that all Views under that Property would be affected – even if an individual View didn’t actually contain the PII! For example: You have www.siteA.com and www.siteB.com, but you track them both in the same Property. If Site A is found to have PII, while Site B doesn’t, Site B will be affected too, since the entire property will be wiped.
Filters don’t help. Let’s say you have noticed PII in your data. Perhaps, email address in a query string. You think, “Oh that sucks. I’ll just add a filter to strip the email address and presto, problem fixed!” Not so fast… Google’s issue is with the fact that you ever sent it to their servers in the first place. The fact that you have filtered it out of the visible data doesn’t remove it from their servers, and thus, doesn’t fix the problem.
So what can you do?
1. Work with your legal team. Your legal counsel may already have rules in place for what your company does (and doesn’t do) with PII. It’s good to discuss the risks and safeguards you are using with respect to Google Analytics, and seek their advice wherever necessary.
2. Train your analysts, developers and marketers. To prevent intentionally passing PII, you’ll want to be sure that your marketers know what they can (and can not!) track in GA for their marketing campaigns. On top of that, your analysts and developers should also be well-versed in acceptable tracking, and be on the lookout for PII, to raise a red flag before it goes live.
3. Use a tag manager to prevent the PII breach. Ideally, no PII would ever make it into your implementation. However, mistakes do happen, and third parties can “slip” PII in to your GA data without you even knowing it. While view filters aren’t much help, a tag management system can save the day, by preventing the data ever being sent to Google in the first place.
You have several options of how to implement this.
First, you’ll want your tag manager rule(s) to look for common examples where PII could be passed. For example, looking for email addresses, digits the length of credit card numbers, words like “name” (to catch first name, last name, business name etc. being passed), ZIP codes, addresses, etc.
Since query string parameters (including utm_ parameters) are the most common culprits, you would definitely want to set up rules around Page, Source, Medium and Campaign dimensions, but you may want to be more diligent and consider other dimensions as well.
Next, you need to decide what to do if PII is found. There are a three main options:
- Use your tag manager to rewrite the data. (For example, to replace the query string email=michele@analyticsdemystified.com with email=REMOVED). However, correctly rewriting the data requires knowing exactly what format it will come to you in. Since we are also trying to avoid inadvertent PII slipping in, it’s unlikely you’ll know the exact format it could appear in. There’s a risk your rewrite could be unsuccessful, and not actually fix the issue.
- Prevent Google Analytics firing. This solves the problem of PII in GA, but at the cost of losing that data, and possibly not being aware that it ever happened. (After all, if GA doesn’t track it, how would you know?) It would be preferable to…
- Use your tag manager to send hits with suspected PII to a different Property ID. This keeps the PII from corrupting your main data set, and allows you to easily set alerts for whenever that Property receives traffic. Since any wipe would be at the Property level, it is safest to isolate even suspected PII from your main data until you can evaluate it. If it turns out to be a false alarm, you may need to refine your tag manager rules. If, however, it is actually PII, you can then figure out where it is coming from, and ideally stop it at the source. (Keep in mind, there is no way to move “false alarm” data back to your main data set, but at least this keeps the bulk of your data safe from deletion!)
4. Set alerts to look for PII. You will want to set alerts for your “PII Property”, but I would also recommend having the same alerts in place for your main Property also. Just in case something slips through the cracks.
An example alert could search Page Name for characters matching the email address format:
5. On top of your automated alerts, consider doing some manual checks from time to time. Unfortunately, once the PII is in your GA data set, there is no way to remove it. However, it is far better to catch it earlier. That way, if you did face a potential data wipe, at least the wipe would be for a shorter timeframe.
The above are just a few suggestions on how to deal with PII, to comply with Google Analytics TOS. However, there may be some other creative ideas folks have. Please feel free to add them in the comments!