# The Trouble (My Troubles) with Statistics

**Tim Wilson**on

**July 3, 2017**

Okay. I admit it. That’s a linkbait-y title. In my defense, though, the only audience that would be successfully baited by it, I think, are digital analysts, statisticians, and data scientists. And, that’s who I’m targeting, albeit for different reasons:

**Digital analysts**— if you’re reading this then, hopefully, it may help you get over an initial hump on the topic that I’ve been struggling mightily to clear myself.**Statisticians and data scientists**— if you’re reading this, then, hopefully, it will help you understand why you often run into blank stares when trying to explain a t-test to a digital analyst.

If you are comfortably bridging both worlds, then you are a rare bird, and I beg you to weigh in in the comments as to whether what I describe rings true.

## The Premise

I took a college-level class in statistics in 2001 and another one in 2010. Neither class was particularly difficult. They both covered similar ground. And, yet, I wasn’t able to apply a lick of content from either one to my work as a web/digital analyst.

Since early last year, as I’ve been learning R, I’ve also been trying to “become more data science-y,” and that’s involved taking another run at the world of statistics. That. Has. Been. HARD!

From many, many discussions with others in the field — on both the digital analytics side of things and the more data science and statistics side of things — I think I’ve started to identify why and where it’s easy to get tripped up. This post is an enumeration of those items!

*As an aside, my eldest child, when applying for college, was told that the fact that he “didn’t take any math” his junior year in high school might raise a small red flag in the admissions department of the engineering school he’d applied to. He’d taken statistics that year (because the differential equations class he’d intended to take had fallen through). THAT was the first time I learned that, in most circles, statistics is not considered “math.” See how little I knew?!*

## Terminology: Dimensions and Metrics? Meet Variables!

Historically, web analysts have lived in a world of dimensions. We combine multiple dimensions (channel + device type, for instance) and then put one or more metrics against those dimensions (visits, page views, orders, revenue, etc.)

Statistical methods, on the other hand, work with “variables.” What is a variable? I’m not being facetious. It turns out it can be a bit a mind-bender if you come at it from a web analytics perspective:

- Is device type a variable?
- Or, is the number of
*visits by device type*a variable? - OR, is the number of
*visits from mobile devices*a variable?

The answer… is “Yes.” Depending on what question you are asking and what statistical method is being applied, defining what your variable(s) are, well, *varies*. Statisticians think of variables as having different types of *scales*: nominal, ordinal, interval, or ratio. And, in a related way, they think of data as being either “metric data” or “nonmetric data.” There’s a good write-up on the different types — with a digital analytics slant — in this post on dartistics.com.

It may seem like semantic navel-gazing, but it really isn’t: different statistical methods work with specific types of variables, so data has to be transformed appropriately before statistical operations are performed. Some day, I’ll write that magical post that provides a perfect link between these two fundamentally different lenses through which we think about our data… but today is not that day.

## Atomic Data vs. Aggregated Counts

In R, when using `ggplot`

to create a bar chart that uses underlying data that looks similar to how data would look in Excel, I have to include a parameter that is `stat="identity"`

. As it turns out, that is a symptom of the next mental jump required to move from the world of digital analytics to the world of statistics.

To illustrate, let’s think about how we view traffic by channel:

- In web analytics, we think: “this is how many (a count) visitors to the site came from each of referring sites, paid search, organic search, etc.”
- In statistics,
*typically*, the framing would be: “here is a list (row) for each visitor to the site, and each visitor is identified as being visiting from referring sites, paid search, organic search, etc.” (or, possibly, “each visitor is flagged as being yes/no for each of: referring sites, paid search, organic search, etc.”… but that’s back to the discussion of “variables” covered above).

So, in my bar chart example above, R *defaults* to thinking that it’s making a bar chart out of a sea of data, where it’s aggregating a bunch of atomic observations into a summarized set of bars. The `stat="identity"`

argument has to be included to tell R, “No, no. Not this time. I’ve already counted up the totals for you. I’m telling you the height of each bar with the data I’m sending you!”

When researching statistical methods, this comes up time and time again: statistical techniques often expect a data set to be a collection of *atomic observations*. Web analysts typically work with *aggregated counts*. Two things to call out on this front:

- There
*are*statistical methods (a cross tabulation with a Chi square test for independence is one good example) that work with aggregated counts. I realize that. But, there are many more that actually expect greater fidelity in the data. - Both Adobe Analytics (via data feeds, and, to a clunkier extent, Data Warehouse) and Google Analytics (via the GA360 integration with Google BigQuery) offer much more atomic level data than the data they provided historically through their primary interfaces; this is one reason data scientists are starting to dig into digital analytics data more!

The big, “Aha!” for me in this area is that we often want to introduce pseudo-granularity into our data. For instance, if we look at orders by channel for the last quarter, we may have 8-10 rows of data. But, if we pull orders *by day* for the last quarter, we have a much larger set of data. And, by introducing granularity, we can start looking at the *variability* of orders *within* each channel. That is useful! When performing a 1-way ANOVA, for instance, we need to compare the variability *within* channels to the variability *across* channels to draw conclusions about where the “real” differences are.

This actually starts to get a bit messy. We can’t just add dimensions to our data willy-nilly to artificially introduce granularity. That can be dangerous! But, in the absence of truly atomic data, some degree of added dimensionality is required to apply some types of statistical methods. *<sigh>*

## Samples vs. Populations

The first definition for “statistics” I get from Google (emphasis added) is:

“the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of

.”inferring proportions in a whole from those in a representative sample

Web analysts often work with “the whole.” Unless we consider historical data the sample and the “whole” including future web traffic. But, if we view the world that way — by using time to determine our “sample” — then we’re not exactly getting a random (independent) sample!

We’ve also been conditioned to believe that *sampling is bad!* For years, Adobe/Omniture was able to beat up on Google Analytics because of GA’s “sampled data” conditions. And, Google has made any number of changes and product offerings (GA Premium -> GA 360) to allow their customers to avoid sampling. So, Google, too, has conditioned us to treat the word “sampled” as having a negative connotation.

To be clear: GA’s sampling *is an issue*. But, it turns out that working with “the entire population” with statistics *can be an issue, too*. If you’ve ever heard of the dangers of “overfitting the model,” or if you’ve heard, “if you have enough traffic, you’ll always find statistical significance,” then you’re at least vaguely aware of this!

So, on the one hand, we tend to drool over how *much* data we have (thank you, digital!). But, as web analysts, we’re conditioned to think “always use all the data!” Statisticians, when presented with a sufficiently large data set, like to pull a *sample* of that data, build a model, and then *test* the model with another sample of the data. As far as I know, neither Adobe nor Google have an, “Export a sample of the data” option available natively. And, frankly, I have yet to come across a data scientist working with digital analytics data who is doing this, either. But, several people have acknowledged this is something that *should* be done in some cases.

I think this is going to have to get addressed at some point. Maybe it already has been, and I just haven’t crossed paths with the folks who have done it!

## Decision Under Uncertainty

I’ve saved the messiest (I think) for last. Everything on my list to this point has been, to some extent, mechanical. We should be able to just “figure it out” — make a few cheat sheets, draw a few diagrams, reach a conclusion, and be done with it.

But, this one… is different. This is an issue of fundamental understanding — a fundamental perspective on both data and the role of the analyst.

Several statistically-savvy analysts I have chatted with have said something along the lines of, “You know, really, to ‘get’ statistics, you have to start with probability theory.” One published illustration of this stance can be found in The Cartoon Guide to Statistics, which devotes an early chapter to the subject. It actually goes all the way back to the 1600s and an exchange between Blaise Pascal and Pierre de Fermat and proceeds to walk through a dice-throwing example of probability theory. Alas! This is where the book lost me (although I still have it and may give it another go).

Possibly related — although quite different — is something that Matt Gershoff of Conductrics and I have chatted about on multiple occasions across multiple continents. Matt posits that, really, one of the biggest challenges he sees traditional digital analysts facing when they try to dive into a more statistically-oriented mindset is understanding the scope (and limits!) of their role. As he put it to me once in a series of direct messages really boils down to:

- It’s about decision-making under uncertainty
- It’s about assessing how much uncertainty is reduced with additional data
- It must consider, “What is the value in that reduction of uncertainty?”
- And it must consider, “Is that value greater than the cost of the data/time/opportunity costs?”

The list *looks* pretty simple, but I think there is a deeper mindset/mentality-shift that it points to. And, it gets to a related challenge: even if the digital analyst views her role through this lens, do her stakeholders think this way? Methinks…almost certainly not! So, it opens up a whole new world of communication/education/relationship-management between the analyst and stakeholders!

For this area, I’ll just leave it at, “There are some deeper fundamentals that are either critical to understand *or* something that can be kicked down the road a bit.” I don’t know which it is!

## What Do You Think?

It’s taken me over a year to slowly recognize that this list exists. Hopefully, whether you’re a digital analyst dipping your toe more deeply into statistics or a data scientist who is wondering why you garner blank stares from your digital analytics colleagues, there is a point or two in this post that made you think, “Ohhhhh! Yeah. THAT’s where the confusion is.”

If you’ve been trying to bridge this divide in some way yourself, I’d love to hear what of this post resonates, what doesn’t, and, perhaps, what’s missing!

Donal PhippsJuly 5th, 2017Enjoyed this post, Tim. Being a simple, pragmatic sort, the biggest gap for me right now is the establishment & sharing of repeatable, practical applications of stats to digital analytics. Exactly the kind of thing you’re looking to establish with dartistics.com.

I spend far too long hunting down the right applications for methods. A shared ‘cookbook’ of approaches, similar to programming patterns would, I believe, get stats into the mainstream of digital analytics and allow crowdsourcing of resolutions to challenges in data structure, types and the like.

Alexandros PapageorgiouJuly 5th, 2017Enjoyed reading the post, thank you for sharing your thoughts.

I think as web analysts we need to be able to judge when to use the aggregate quantities – which do have the advantage of being more easy to get and communicate- but also think when it makes more sense to look at user or session level data instead for additional insight. Just like you say most of the statistical methods require individual observations, and that’s also the case for most of the machine learning models. With aggregate data you leave out all the underlying variability which is key ingredient to how most of these methods work.

Regarding sampling, its true that statisticians use it and refer to it all the time, but I think this is more a question of necessity rather than choice. Think polls, or quality control systems where examining at the whole population would either be impossible or too expensive. In this regard we are lucky that we work in a digital world where “the whole” is available. That said, working with a sample can make sense as well, for example if you want to train a ml model to predict user conversion on your site, using a year’s worth of server log data would likely be an overkill, why not sample instead ?

Another thing to consider when using stats in a web analytics context is that many of the traditional stat tests are considered invalid when the observations depend on time, due to things such as autocorrelation and periodicity….enter time series analysis!

Tim WilsonJuly 5th, 2017Author's ReplyThanks for chiming in!

Time-series! Time-series! I did sort of leave that out, didn’t I? As web analysts, we just accept that all of our data has a time component, and then we get confused (scared) when we learn that that introduces a whole other level of complexity. Not that we’re not used to doing comparisons to the same period last year and other means of trying to simply “account for time.” But, yeah — my mind was blown when I first got exposed to and understood time-series decomposition… and I don’t think that’s something most web analysts think about.

On the “sample vs. population,” I think I’ve always thought the way you describe it — we’re inherently in “better” shape with digital data because we have more of a population. But, isn’t that overly simplistic? Specifically, when it comes trying to build a model, isn’t there a case of having too much data? Isn’t that what leads to overfitting, which can actually wind up being both unnecessarily complicated (an artifact) and less useful than a model that is trained on one sample and then tested on another?

Tim WilsonJuly 5th, 2017Author's ReplyThat’s where I lean as well: “Let’s compile recipes that have very clearly delineated applications and boundaries… and worry about truly grokking the fundamental concepts after we’ve exhausted what we can do there.” The kicker is that there are a several folks whose opinions I deeply respect who take the opposite view!

Alexandros PapageorgiouJuly 6th, 2017I am not a statistician, but just like you I have taken courses and done my own research trying to find some answers. Anyway here’s my take on the sampling question.

First thing to ask ourselves, what ‘s the objective ?

If it’s about making inference which I have it in my mind as trying to find the key factors contributing to an outcome, and trying to interpret the whole process, in that case I wouldn’t worry about training, testing and overfitting. If my data are not of massive scale I would use them as they are. But I am conscious that my model only serves me in explaining the current state of things and I am not going to pretend that I know what will happen tomorrow.

This leads to the other objective, which about prediction. Now what I care about is to get the prediction right, I don’t know and cannot answer how and why the model works, but as long as it does I am happy. This is why I need to prove that this is the case using unseen, test data, which also help me fine tune my model.

Would I sample in that case first and then split in test and train ? Probably yes. I believe that as the sample size increases after some point there will only be minimal improvements in the quality of the model, while time to train the model increases on a high rate..so time to think about that opportunity cost!

Alexandros PapageorgiouJuly 6th, 2017Thanks for surfacing those interesting topics in your blog posts and also for the dartistics work.

Tim WilsonJuly 7th, 2017Author's ReplyI like this framing! It has me thinking there are a couple of other issues (or sub-issues) that “we” need to figure out how to address:

– What is data of “massive scale?” I think I wind up with “data that is NOT” and “data that IS,” but I don’t have a clear idea as to where the line between the two is. AND… when I need to care. 🙂

– The “what is the objective” is a great point, but I does that not get to confusion that analysts need to be clear on first, but then stakeholders often are not clear on themselves? My sense is that marketers (and analysts — myself include) often treat “explaining the current state” as being roughly the same as “prediction” (“If I understand the drivers of the current state, then I can expect that, if I influence those drivers, then I can predict the future state.”). I think this is actually another lens through which to view my final point in the original post — that we (analysts and marketers) don’t have some clarity around what we’re doing with the data and what we can reasonably expect from it.

Thanks for your thoughtful comments. They help!

Stefan PoninghausJuly 12th, 2017Great to see this conversation happening. I think that the major challenge in Digital Analytics is the lack of formalized statistical approaches to solving problems, so its great to see this post as well as DARTISTICS and the effort of all involved. I personally have had formal data science training (a masters), but my experience is within Digital Analytics. I did my masters thesis on atomic data from Google Analytics to develop predictive models (with out of sample testing) for understanding user behaviour.

Regarding your comment about exporting samples, I find that it is most effective to export all the data (in an atomic format), and create train, test and validation samples with the analysis tool (e.g. from sklearn.model_selection, there is train_test_split (Python)).

Google Analytics’ problem is that it does not sample, it samples with imputation. For example, a transaction report that is based on a sample of 25% will return 25% of the transaction ID’s all with 4 transactions, so that the aggregate sums stay near the same. This can make Google Analytics sampling extremely misleading, however the choice for this makes sense in a KPI frame of mind.

In my experience I have found a major misalignment in the paradigm of the data. Digital Analytics is extremely KPI driven, and does not have a great deal of descriptive data (besides pageviews, time on page, session duration, etc.). In contrast, statistics primarily use descriptive statistics to find relationships against a given outcome. One of the most famous datasets is Fisher’s Iris dataset, which contains 4 measurements of a flower, and its species.

The problem with this misalignment is that it can lead to difficulty with interpreting results and taking action. If an add to cart has a stronger effect on conversion than homepage, what is the outcome? Get more people to add to cart? We knew that already. At this point of time I think this is our biggest challenge to overcome.