Using Excel to Count Text Occurrences
[UPDATE 1/19/2015: A couple of comments have pointed out that COUNTIF would address this scenario in a single formula. That’s a great point…and one that had not occurred to me. The overall scenario here is pretty straightforward, so there are likely other equally efficient (or more efficient) ways to address the task. I’m leaving the post as is, because I think it’s a useful exercise on how nested Excel formulas can be used to parse text. And I’m also curious what other solutions might get proposed in the comments.]
I had this come up a couple of weeks ago with a client, and I realized it was something I’d done dozens of times…but had never written down the “how” on doing. So, here we go. This is a post about one very specific application of Excel, but it is also implicitly a post about how, with an intermediate level of knowledge of Excel, with a little bit of creativity, and a strong aversion to manually parsing/copying/pasting anything, a spreadsheet can accomplish a lot! And very quickly!
The Use Case
The use case where I’ve used this approach most often is with social media exports — most often, with Twitter. In the most recent situation, my client had an export of all tweets that used a specific conference hashtag. Her organization was trying to introduce a secondary (relevant) topic to the conversation around the conference, and they had a separate hashtag. So, she was looking to identify, from the 16,000 tweets at the event, what percent of them also included the hashtag that her organization was interested in? That’s a simple and reasonable ask, and, if the tweet volume is reasonable (let’s say less than 500,000), easy enough to do in under 2 minutes in Excel.
The Example
Obviously, I’m not going to use my client’s data here. But, it turns out that my own tweets are a reasonable proxy. I tweet sporadically, but I know that a decent chunk of my tweets use the “#measure” hashtag. So, how many of my tweets use that hashtag? Thanks to http://analytics.twitter.com, it’s easy enough for me to get an export of my tweets. I just exported the default, which was 1,356 tweets going back to early October 2013. Opening the .csv in Excel, it looks like this:
Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”
Step 1: Make It a Table and Add a Column for the Flag
This step isn’t strictly necessary, but Excel tables make soooooo many things more easy, that I’m including it here. If you’re, like, “What are you talking about? Isn’t data in rows and columns in a spreadsheet a ‘table’ already?” well… stop reading this post and go read this one. It’s two clicks to make the data into a table, so do that…and add a column where we’re going to put our flag:
Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”
Step 2: Use FIND() to Look for ‘#measure’
This is the core formula. All we need to do is add the FIND() formula to the rows in the first column to search column D (“[@[Tweet text]]”) for occurrences of “#measure:”
Once we add that formula to cell A2, it will autofill for all rows in the table and the table will now look like this:
That’s kind of ugly, isn’t it? But we now know that rows 7, 8, 12, and 19 all included the word “#measure,” because the FIND() formula tells us where in the cell the word started. All of the other rows didn’t include the word “#measure,” so they returned a #VALUE error.
The bulk of the work is done…but we’re not quite there yet, because we don’t yet have a pure “flag.”
Step 3: Use ISERROR() to Make a Flag
We can nest our original FIND() formula inside an ISERROR() formula. If we do that, then all of the #VALUE values will instead show as “TRUE,” and all of the situations where the FIND() formula returns an actual number will show as “FALSE.”
The resulting table:
<Whew> Isn’t that cleaner? Now, every value is either “TRUE” or “FALSE,” so we now have a true “flag.” But, this flag is a little confusing, because it’s “FALSE” whenever the tweet contains the hashtag “#measure.” That may be fine if we can just keep that straight and jump straight to step 5, but why not make it a bit more intuitive with one additional update to the formula?
Step 4: Use IF() to Flip the Flag
Since our ISERROR() is going to return a TRUE/FALSE response, we can nest the whole formula in an IF() statement to make those flags into a Yes/No flag that makes more intuitive sense:
The IF returns “Yes” instead of “FALSE” and returns “No” instead of “TRUE.” Not necessary for this exercise, but I went ahead and added a little conditional formatting to highlight the rows that include ‘#measure’ (based on whether the Column A value is “Yes”):
Step 5: A Case-Sensitivity Precaution
In this example, all of the tweets are my own, and I always use an all-lowercase “#measure.” But “FIND” is case-sensitive, so, what if I had used “#Measure” a few times? Or “#MEASURE?” Those would be mis-flagged using the above approach. So, it’s worth one more tweak to the formula to force the entire tweet to be all-lowercase before running the FIND() formula on it:
Note how the LOWER() addition is inside the FIND() function. Since Excel uses parentheses like plain old math does, the innermost parentheses (functions) will get evaluated first, and the first thing we want to do is make the tweet text all lowercase.
Step 6: Summarize with a Pivot Table
There are lots of ways this data could be summarized. You could just sort the table descending by the first column and see what row the last “Yes” occurs on. You could have used “1” and “0” rather than “Yes” and “No” in the formula and then just summed column A.
But, I’m never one to miss an opportunity to apply a pivot table. In a handful of clicks, we get our summary:
Voila! 14% of the tweets in the data set included the string “#measure” (regardless of case usage).
In Reality: Six Steps Were Three
When I most recently did this for a client, it wasn’t really six steps. It was three: 1) create the table, 2) plug in a formula, 3) generate a pivot table. But, I realize that just throwing out =IF(ISERROR(FIND(“#measure”,”LOWER([@[Tweet text]]))),”No”,”Yes”) can be a little intimidating. I do regularly iterate “from the inside out” when building formulas. The result can look messy, but not as messy as manually inspecting tweets!
Now…chime in with the other 10 ways this exercise could have been approached entirely differently!