Exploring Site Search (with the help of R)
Last week, I wrote up 10 Arbitrary Takeaways from Superweek 2017 in Hungary. There was a specific non-arbitrary takeaway that wasn’t included in that list, but which I was pretty excited to try out. The last session before dinner on Wednesday evening of the conference was the “Golden Punchcard” competition. In that session, attendees are invited to share something they’ve done, and then the audience votes on the winner. The two finalists were Caleb Whitmore and Doug Hall, both of whom shared some cleverness centered around Google Tag Manager. This post isn’t about either of those entries!
Rather, one of the entrants who went fairly deep in the competition was Sébastien Brodeur, who showed off some work he’d done with R’s text-mining capabilities to analyze site search terms. He went on to post the details of the approach, the rationale, and the code itself.
The main idea behind the approach is that, with any sort of user-entered text, there will be lots of variations in the specifics of what gets entered. So, looking at the standard Search Terms report in Google Analytics (or looking at whatever reports are set up in Adobe or any other tool for site search) can be frustrating and, worse, somewhat misleading. So, what Sébastien did was use R to break out each individual word in the search terms report and to convert them to their “stems.” That way, different variations of the same word could be collapsed into a single entry. From that, he made a word cloud.
I’ve now taken Sébastien’s code and extended it in a few ways (this is why open source is awesome!), including layering in an approach that I saw Nancy Koons talk about years ago, but which is still both clever and handy.
Something You Can Try without Coding
IF you are using Google Analytics, and IF you have site search configured for your site, then you can try out these approaches in ~2 minutes. The first/main thing I wanted to do with Sébastien’s code was web-enable it using Shiny. And, I’ve done that here. If you go to the site, you’ll see something that looks like this:
If you click Login with Google, you will be prompted to log in with your Google Account, at which point you can select an Account, Property, and View to use with the tool (none of this is being stored anywhere; it’s “temporal,” as fancy-talkers would say).
The Basics: Just a Google Analytics Report
Now, this gets a lot more fun with sites that have high traffic volumes, a lot of content, and a lot of searches going on. Trust me, I’ve got several of those sites as clients! But, I’m going to have to use a lamer data set for this post. I bet you can figure out what it is if you look closely!
For starters, we can just check the Raw Google Analytics Results tab. We’ll come back to this in a bit, but this is just a good way to see, essentially, the Search Terms report from within the Google Analytics interface:
<yawn>
This isn’t all that interesting, but it illustrates one of the issues with the standard report: the search terms are case-sensitive, so “web analytics demystified” is not the same as “Web Analytics Demystified.” This issue actually crops up in many different ways if you scroll through the results. But, for now, let’s just file away that this should match the Search Terms report exactly, should you choose to do the comparison.
Let’s Stem and Visualize!
The meat of Sébastien’s approach was to split out each individual word in the search terms, get its “stem,” and then make a word cloud. That’s what gets displayed on the Word Cloud tab:
You can quickly see that words like “analytics” get stemmed to “analyt,” and “demystified” becomes “demystifi.” These aren’t necessarily “real” words, but that’s okay, because the value they add from collapsing is handy.
Word Clouds Suck When an Uninteresting Word Dominates
It’s all well and good that, apparently, visitors to this mystery site (<wink-wink>) did a decent amount of searching for “web analytics demystified,” but that’s not particularly interesting to me. Unfortunately, those terms dominate the word cloud. So, I added a feature where I can selectively remove specific words from the word cloud and the frequency table (which is just a table view of what ultimately shows up in the word cloud):
As I enter the terms I’m not interested in, the word cloud regenerates with them removed:
Slick, right?
My Old Eyes Can’t See the Teensy Words!
The site also allows adjusting the cutoff for how many times a particular term has to appear before it gets included in the word cloud. That’s just a simple slider control — shown here after I moved it from the default setting of “3” to a new setting of “8:”
That, then, changes the word cloud to remove some of the lower-volume terms:
Now, we’re starting to get a sense of what terms are being used most often. If we want, we can hop over to the Raw Google Analytics Results tab and filter for one of the terms to see all of the raw searches that included it:
What QUESTIONS Do Visitors Have?
Shifting gears quite drastically, as I was putting this whole thing together, I remembered an approach that I saw Nancy Koons present at Adobe Summit some years back, which she has since blogged about, as well as posted about as an “analytics recipe” locked away in the DAA’s member area. With a little bit of clunky regEx, I was able to add the Questions in Search tab, which filters the raw results to just be search phrases that include the words: who, what, why, where, or how. Those are searches way out on the long tail, but they are truly the “voice of the customer” and can yield some interesting results:
Where Else Can This Go?
As you might have deduced, this exercise started out with a quick, “make it web-based and broadly usable” exercise, and it pretty quickly spiraled on me as I started adding features and capabilities that, as an analyst, helped me refine my investigation and look at the data from different angles. What struck me is how quickly I was adding “new features,” once I had the base code in place (and, since I was lifting the meat of the base code from Sébastien, that initial push only took me about an hour).
The code itself is posted on Github for anyone who wants to log a bug, grab it and use it for their own purposes, or extend it more broadly for the community at large. I doubt I’m finished with it myself, as, just as I’ve done with other projects, I suspect I’ll be porting it over to work with Adobe Analytics data. That won’t be as easy to have as a “log in and try it with your data” solution, though, as who knows which eVar(s) and events are appropriate for each implementation? But, there will be the added ability to go beyond looking at search volume and digging into search participation for things like cart additions, orders, and revenue. And, perhaps, even mining the stemmed terms for high volume / low outcome results!
As always, I’d love to hear what you think. How would you extend this approach if you had the time?
Or, what about if you had the skills to do this work with R yourself? This post wasn’t really written as a sales pitch, but, if you’re intrigued by the possibilities and interested in diving more deeply into R yourself, check out the 3-day training on R and statistics for the digital analyst that will be held in Columbus, Ohio, in June.