So, R We Ready for R with Adobe Analytics?
A couple of weeks ago, I published a tutorial on getting from “you haven’t touched R at all” to “a very basic line chart with Google Analytics data.” Since then, I’ve continued to explore the platform — trying like crazy to not get distracted by things like the free Watson Analytics tool (although I did get a little distracted by that; I threw part of the data set referenced in this post…which I pulled using R… at it!). All told, I’ve logged another 16 hours in various cracks between day job and family commitments on this little journey. Almost all of that time was spent not with Google Analytics data, but, instead, with Adobe Analytics data.
The end result? Well (for now) it is the pretty “meh” pseudo-heatmap shown below. It’s the beginning of something I think is pretty slick…but, for the moment, it has a slickness factor akin to 60 grit sandpaper.
What is it? It’s an anomaly detector — initially intended just to monitor for potential dropped tags:
- 12 weeks of daily data
- ~80 events and then total pageviews for three different segments
- A comparison to the total for each metric for the most recent day to the median absolute deviation (MAD) for that event/pageviews for the same day over the previous 12 weeks
- “White” means the event has no data. Green means that that day of the week looks “normal.” Red means that day of the week looks like an outlier that is below the typical total for that day. Yellow means that day of the week look like an outlier that is above the typical total for that day (yellow because it’s an anomaly…but unlikely to be a dropped tag).
On my most generous of days, I’d give this a “C” when it comes to the visualization. But, it’s really a first draft, and a final grade won’t be assessed until the end of the semester!
It’s been an interesting exercise so far. For starters, this is a less attractive, clunkier version of something I’ve already built in Excel with Report Builder. There is one structural difference between this version and the Excel version, in that the Excel version used the standard deviation for the metric (for the same day of week for the previous 12 weeks) to detect outliers. (MAD calculations in Excel require using array formulas that completely bogged down the spreadsheet.) I’m not really scholastically equipped to judge…but, from the research I’ve done, I think MAD is a better approach (which was why I decided to tackle this with R in the first place — I knew R could handle MAD calculations with ease).
What have I learned along the way (so far)? Well:
- RSiteCatalyst is a kind of awesome package. Given the niche that Randy Zwitch developed it to serve
, and given that I don’t think he’s actively maintaining it…I had data pulling into R inside of 30 minutes from installing the package. [Update 04-Feb-2016: See Randy’s comment below; he is actively maintaining the package!] - Adobe has fewer security hoops to get data than Google. I just had to make sure I had API access and then grab my “secret” from Adobe, and I was off and running with the example scripts.
- ggplot2 is. A. Bear! This is the R package that is the de facto standard for visualizations. The visualization above was the second real visualization I tried with R (the first was with Google Analytics data and some horizontal bar charts), and I have yet to even grok it in part (much less grok it in full). From doing some research, the jury’s still a little out (for me) as to whether, once I’ve fully internalized themes and aes() and coord_flip and hundreds of other minutia, whether I’ll feel like that package (and various supplemental packages) really give me the visualization control that I’d want. Stay tuned.
- Besides ggplot2, R, in general, has a lot of nuances that can be very, very confusing (“Why isn’t this working? Because it’s a factor? And it shouldn’t be a factor? Wha…? Oh. And what do you mean that number I’m looking at isn’t actually a number?!”). These nuances, clearly, make intuitive sense to a lot of people… so I’m continuing to work on my intuition.
- Like any programming environment (and even like Excel…as I’ve learned from opening many spreadsheets created by others), there are grotesque and inefficient ways to do things in R. The plot above took me 199 lines of code (including blanks and extensive commenting). That’s really not that much, but I think I should be able to cut it down by 50% easily. If I gave it to someone who really knows R, they would likely cut it in half again. If it works as is, though, why would I want to do that? Well…
- …because this approach has the promise of being super-reusable and extensible. To refresh the above, I click one button. The biggest lag is that I have to make 6 queries of Adobe (there’s a limit of 30 metrics per query). It’s set up such that I have a simple .csv where I list all of the events I want to include, and the script just grabs that and runs with it. That’s powerful when it comes to reusability. IF the visualization gets improved. And IF it’s truly reusable because the code is concise.
Clearly, I’m not done. My next steps:
- Clean up the underlying code
- Add some smarts that allow the user to adjust the sensitivity of the outlier detection
- Improve the visualization — at a minimum, remove all of the rows that either have no data or no outliers, but the red/green/yellow paradigm doesn’t really work, and I’d love to be able to drop sparklines into the anomaly-detected cells to show the actual data trend for that day.
- Web-enable the experience using Shiny (click through on that link… click the Get Inspired link. Does it inspire you? After playing with a visualization, check out how insanely brief the server.R and ui.R files are — that’s all there is for the code on those examples!)
- Start hitting some of the educational resources to revisit the fundamentals of the platform. I’ve been muddling through with extensive trial and error and Googling, but it’s time to bolster my core knowledge.
And, partly inspired by my toying with Watson Analytics, as well as the discussion we had with Jim Sterne on the latest episode of the Digital Analytics Power Hour, I’ve got some ideas for other things to try that really wouldn’t be doable in Excel or Tableau or even Domo. Stay tuned. Maybe I’ll have an update in another couple of weeks.