Should Digital Analysts Become More Data Scientific-y?
The question asked in this post started out as a simpler (and more bold) question: “Is data science the future of digital analytics?” It’s a question I’ve been asking a lot, it seems, and we even devoted an episode of the Digital Analytics Power Hour podcast to the subject. It turns out, it’s a controversial question to ask, and the immediate answers I’ve gotten to the question can be put into three buckets:
- “No! There is a ton of valuable stuff that digital analysts can and should do that is not at all related to data science.”
- “Yes! Anyone who calls themselves an analyst who isn’t using Python, R, SPSS, or SAS is a fraud. Our industry is, basically, a sham!”
- “‘Data science’ is just a just buzzword. I don’t accept the fundamental premise of your question.”
I’m now at the point where I think the right answer is…all three.
What Is Data Science?
It turns out that “data science” is no more well-defined than “big data.” The Wikipedia entry seems like a good illustration of this, as the overview on the page opens with:
Data science employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, operations research, information science, and computer science, including signal processing, probability models, machine learning, statistical learning, data mining, database, data engineering, pattern recognition and learning, visualization, predictive analytics, uncertainty modeling, data warehousing, data compression, computer programming, artificial intelligence, and high performance computing.
Given that definition, I’ll insert my tongue deeply into my cheek and propose this alternative:
Data science is a field that is both broad and deep and is currently whatever you want it to be, as long as it involves doing complicated things with numbers or text.
In other words, the broad and squishy definition of the term itself means it’s dangerous to proclaim with certainty whether the discipline is or is not the future of anything, including digital analytics.
But Data Science Is Still a Useful Lens
One way to think about digital analytics is as a field with activity that falls across a spectrum of complexity and sophistication:
I get that “Segmentation” is a gross oversimplification of “everything in the middle,” but it’s not a bad proxy. There are many, many analyses we do that, in the end, boil down to isolating some particular group of customers, visitors, or visits and then digging into their behavior, right? So, let’s just go with it as a simplistic representation of the range of work that analysts do.
Traditionally, web analysts have operated on the left and middle of the spectrum:
We may not love the “Basic Metrics” work, but there is value in knowing how much traffic came to the site, what the conversion rate was, and what the top entry pages are. And, in the Early Days of Web Analytics, the web analysts were the ones who held the keys to that information. We had to get it into an email or a report of some sort to get that information out to the business.
Over the past, say, five years, though, business users and marketers have become much more digital-data savvy, the web analytics platforms have become more accessible, and digital analysts have increasingly built automated (and, often, interactive) reports and dashboards that land in marketers’ inboxes. The result? Business users have become increasingly self-service on the basics:
So, what does that mean for the digital analyst? Well, it gives us two options:
- Just do more of the stuff in “the middle” — this is a viable option. There is plenty of work to be done and value to be provided there. But, there is also a risk that the spectrum of work that the analyst does will continue to shrink as the self-service abilities of the marketers (combined with the increasing functionality of the analytics platforms) grow.
- Start to expand/shift towards “data science — as I’ve already acknowledged, there are definitional challenges with this premise, but let’s go ahead and round out the visual to illustrate this option:
So…You ARE Saying We Need to Become Data Scientists?
No. Well…not really. I’m claiming that there are aspects of what many people would say are aspects of data science where digital analysts should consider expanding their skills. Specifically:
- Programming with Data — this is Python or R (or SPSS or SAS). We’re used to the “programming” side of analytics being on the data capture front — the tag management / jQuery / JavaScript / DOM side of things. Programming with data, though, means using text-based scripting and APIs to: 1) efficiently get richer data out of various systems (including web analytics systems), 2) combining that data with data from other systems when warranted, and 3) performing more powerful manipulations and analyses on that data. And…being more equipped to resuse, repurpose, and extend that work on future analytical efforts.
- Statistics — moving beyond “% change” to variance, standard deviation, correlation, t tests (one-tailed and two-tailed), one-way ANOVA, factorial ANOVA, repeated-measures ANOVA (which, BTW, I think I understand to be a potentially powerful tool for pre-/post- analyses), regression, and so on. Yes, the analytics and optimization platforms employ these techniques and try to do the heavy lifting for us, but that’s always seemed a little scary to me. It’s like the destined-to-fail analyst who, 2-3 years into their role, still doesn’t understand the basics of how a page tag captures and records data. Those analysts are permanently limited in their ability to analyze the data, and my sense is that the same can be said for analysts who rattle off the confidence level provided by Adobe Target without an intuitive understanding of what that means from a statistical perspective.
- (Interactive and Responsive) Data Visualization — programming (scripting) with data, provides rich capabilities for visualizations to react to the data that it is fed. A platform like R can take in raw (hit-level or user-level) data and determine how many “levels” a specific “factor” (dimension) has. If the data has a factor with four levels, that’s four values for a dimension of a visualization. If that factor gets refreshed and suddenly has 20 levels, then the same visualization — certainly much richer than anything available in Excel — can simply “react” and re-display with that updated data. I’m still struggling to articulate this aspect of data science and how it’s different from what many digital analysts do today, but I’m working on it.
So…You’re Saying I Need to Learn Python or R?
Yes. Either one. Or both. Your choice.
How’s That R Stuff Working Out for You in Practice?
I’ve now been actively working to build out my R skills since December 2015. The effort goes in fits and starts (and evenings and weekends), and it’s definitely a two-steps-forward-and-one-step-back process. But, it has definitely delivered value to my clients, even when they’re not explicitly aware that it has. Some examples:
- Dynamic Segment Generation and Querying — I worked on an analysis project for a Google Analytics Premium client where we had a long list of hypotheses regarding site behavior, and each hypothesis, essentially, required a new segment of traffic. The wrinkle was that we also wanted to look at each of those segments by device category (mobile/tablet/desktop) and by high-level traffic source (paid traffic vs. non-paid traffic). By building dynamic segment fragments that I could programmatically swap in and out with each other, I used R to cycle through and do a sequence of data pulls for each hypothesis (six queries per hypothesis: mobile/paid, tablet/paid, desktop/paid, mobile/non-paid, etc.). Ultimately, I just had R build out a big, flat data table that I brought into Excel to pivot and visualize…because I wasn’t yet at the point of trying to visualize in R.
- Interactive Traffic Exploration Tool — I actually wrote about that one, including posting a live demo. This wasn’t a client deliverable, but was a direct outgrowth of the work above.
- Interactive Venn Diagrams — I built a little Venn Diagram that I can use when speaking to show an on-the-fly visualization. That demo is available, too, including the code used to build it. I also pivoted that demo to, instead, pull web analytics data to visually illustrate the overlap of visitors to two different areas of a web site. Live demo? Of course!
- “Same Data” from 20 Views — this was also a Google Analytics project — or, string of projects, really — for a client that has 20+ brand sites, and each brand has it’s own Google Analytics property. All brands feed into a couple of “rollup” properties, too, but, there have been a succession of projects where the rollup views haven’t had the necessary data that we wanted to look at for by site for all sites. I have a list of the Google Analytics view IDs for all of those sites, so I’ve now had many cases where I’ve simply adjusted the specifics of what data I need for each site and then kicked off the script.
- Adobe Analytics Documentation Template Builder — this is a script that was inspired by an example script that Randy Zwitch built to pull the configuration information out of Adobe Analytics and get it into a spreadsheet (using the RSitecatalyst package that Randy and Jowanza Joseph built). I wanted to extend that example to: 1) clean up the output a bit, and 2) to actually bring in data for the report suite IDs so that I could easily scan through and determine, not only which variables were enabled, but which ones had data and what did that data look like. I had an assist from Adam Greco as to what makes the most sense on the output there, and I’m confident the code is horrendously inefficient. But, it’s worked across three completely different clients, and it’s heavily commented and available for download (and mockery) on Github.
- Adobe Analytics Anomaly Detection…with Twitter Horsepower — okay…so this one isn’t quite built to where I want it…yet. But, it’s getting there! And, it is (will be!), I think, a good illustration of how programming with data can give you a big leg up on your analysis. Imagine a three-person pyramid, and I’m standing on top with a tool that will look for anomalies in my events, as well as anomalies in specified eVar/event combinations I specify (e.g., “Campaigns / Orders”) to find odd blips that could signal either an implementation issue (the initial use case) or some expected or unexpected changes in key data. This…was what I think a lot of people expected from Adobe’s built-in Anomaly Detection when it rolled out a few years ago…but that requires specifying a subset of metrics of interest. Conceptually, though, I’m standing on top of a human pyramid and doing something similar. So, who am I standing on? Well, one foot is on the shoulder of RSitecatalyst (so, really, Randy and Jowanza), because I need that package to readily get the data that I want to use out of Adobe Analytics. My other foot is standing on…Twitter. The Twitter team built and published an R Anomaly Detection package that takes a series of time-series inputs and then identifies anomalies in that data (and returns them in a plot with those anomalies highlighted). That’s a lot of power! (I know…I’m cheating… I don’t have the publishable demo of this working yet.)
What Are Other People Doing with R?
The thing about everything that I listed above is that…I’m still producing pretty lousy code. Most of what I do in 100 lines of code, someone who knows their way around in R could often do in 10 lines. On the one hand, that’s not the end of the world — if the code works, I’m just making it a little slower and a bit harder to maintain. It is generally still much faster than doing the analysis through other means, and my computer has yet to complain about me feeding it inefficient code to run.
One of the reasons I suspect my code is inefficient is because more and more R-savvy analysts are posting their work online. For instance:
- Ryan Praskievicz from EY wrote up how to get started using R with Google Search Console.
- Becky West from LunaMetrics wrote about getting started with R and Google Analytics, including several simple and practical use cases.
- Jowanza Joseph posted on analyticsplaybook.org about creating interactive Google Analytics dashboards with R.
- Mark Edmondson from IIH Nordic has posted many, many sets of instructions and code for using R with web analytics.
- Andrew Duffle posted code to do cluster analysis…with web analytics data.
The list goes on and on, but you get the idea. And, of course, everything I listed above is for R, but there are similar examples for Python. Ultimately, I’d love to see a centralized resource for these (which analyticsplaybook.org may, ultimately become), but it’s still in its relatively early days.
And I had no idea this post would get this long (but I’m not sure I should be surprised, either). What do you think? Are you convinced?