# Using Excel to Count Text Occurrences

**Tim Wilson**on

**January 19, 2015**

*[UPDATE 1/19/2015: A couple of comments have pointed out that COUNTIF would address this scenario in a single formula. That’s a great point…and one that had not occurred to me. The overall scenario here is pretty straightforward, so there are likely other equally efficient (or more efficient) ways to address the task. I’m leaving the post as is, because I think it’s a useful exercise on how nested Excel formulas can be used to parse text. And I’m also curious what other solutions might get proposed in the comments.]*

I had this come up a couple of weeks ago with a client, and I realized it was something I’d done dozens of times…but had never written down the “how” on doing. So, here we go. This is a post about one very specific application of Excel, but it is also implicitly a post about how, with an intermediate level of knowledge of Excel, with a little bit of creativity, and a strong aversion to manually parsing/copying/pasting *anything*, a spreadsheet can accomplish a lot! And very quickly!

### The Use Case

The use case where I’ve used this approach most often is with social media exports — most often, with Twitter. In the most recent situation, my client had an export of all tweets that used a specific conference hashtag. Her organization was trying to introduce a secondary (relevant) topic to the conversation around the conference, and they had a separate hashtag. So, she was looking to identify, from the 16,000 tweets at the event, what percent of them also included the hashtag that her organization was interested in? That’s a simple and reasonable ask, and, if the tweet volume is reasonable (let’s say less than 500,000), easy enough to do in under 2 minutes in Excel.

### The Example

Obviously, I’m not going to use my client’s data here. But, it turns out that my own tweets are a reasonable proxy. I tweet sporadically, but I know that a decent chunk of my tweets use the “#measure” hashtag. So, *how many* of my tweets use that hashtag? Thanks to http://analytics.twitter.com, it’s easy enough for me to get an export of my tweets. I just exported the default, which was 1,356 tweets going back to early October 2013. Opening the .csv in Excel, it looks like this:

Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”

### Step 1: Make It a Table and Add a Column for the Flag

This step isn’t strictly necessary, but Excel tables make *soooooo* many things more easy, that I’m including it here. If you’re, like, “What are you talking about? Isn’t data in rows and columns in a spreadsheet a ‘table’ already?” well… stop reading this post and go read this one. It’s two clicks to make the data into a table, so do that…and add a column where we’re going to put our flag:

Simple enough. I just want to go through and add a flag to identify every row where column C contains the word “#measure.”

### Step 2: Use FIND() to Look for ‘#measure’

This is the core formula. All we need to do is add the FIND() formula to the rows in the first column to search column D (“[@[Tweet text]]”) for occurrences of “#measure:”

Once we add that formula to cell A2, it will autofill for all rows in the table and the table will now look like this:

That’s kind of ugly, isn’t it? But we now know that rows 7, 8, 12, and 19 all included the word “#measure,” because the FIND() formula tells us where in the cell the word started. All of the other rows didn’t include the word “#measure,” so they returned a #VALUE error.

The bulk of the work is done…but we’re not quite there yet, because we don’t yet have a pure “flag.”

### Step 3: Use ISERROR() to Make a Flag

We can nest our original FIND() formula inside an ISERROR() formula. If we do that, then all of the #VALUE values will instead show as “TRUE,” and all of the situations where the FIND() formula returns an actual number will show as “FALSE.”

The resulting table:

<Whew> Isn’t that cleaner? Now, every value is either “TRUE” or “FALSE,” so we now have a true “flag.” But, this flag is a little confusing, because it’s “FALSE” whenever the tweet *contains* the hashtag “#measure.” That may be fine if we can just keep that straight and jump straight to step 5, but why not make it a bit more intuitive with one additional update to the formula?

### Step 4: Use IF() to Flip the Flag

Since our ISERROR() is going to return a TRUE/FALSE response, we can nest the whole formula in an IF() statement to make those flags into a Yes/No flag that makes more intuitive sense:

The IF returns “Yes” instead of “FALSE” and returns “No” instead of “TRUE.” Not necessary for this exercise, but I went ahead and added a little conditional formatting to highlight the rows that include ‘#measure’ (based on whether the Column A value is “Yes”):

### Step 5: A Case-Sensitivity Precaution

In this example, all of the tweets are my own, and I always use an all-lowercase “#measure.” But “FIND” is case-sensitive, so, what if I had used “#Measure” a few times? Or “#MEASURE?” Those would be mis-flagged using the above approach. So, it’s worth one more tweak to the formula to force the entire tweet to be all-lowercase before running the FIND() formula on it:

Note how the LOWER() addition is *inside* the FIND() function. Since Excel uses parentheses like plain old math does, the innermost parentheses (functions) will get evaluated first, and the first thing we want to do is make the tweet text all lowercase.

### Step 6: Summarize with a Pivot Table

There are lots of ways this data could be summarized. You could just sort the table descending by the first column and see what row the last “Yes” occurs on. You could have used “1” and “0” rather than “Yes” and “No” in the formula and then just summed column A.

But, I’m never one to miss an opportunity to apply a pivot table. In a handful of clicks, we get our summary:

Voila! 14% of the tweets in the data set included the string “#measure” (regardless of case usage).

### In Reality: Six Steps Were Three

When I most recently did this for a client, it wasn’t really six steps. It was three: 1) create the table, 2) plug in a formula, 3) generate a pivot table. But, I realize that just throwing out **=IF(ISERROR(FIND(“#measure”,”LOWER([@[Tweet text]]))),”No”,”Yes”)** can be a little intimidating. I do regularly iterate “from the inside out” when building formulas. The result can look messy, but not as messy as manually inspecting tweets!

Now…chime in with the other 10 ways this exercise could have been approached entirely differently!

GilesJanuary 19th, 2015Now…chime in with the other 10 ways this exercise could have been approached entirely differently!

–> How about =COUNTIF([range],”*#measure*”)

Andy BattenJanuary 19th, 2015Ditto; I find COUNTIF/IFS with wildcard match to be easier since it won’t produce an error, thus you can eliminate a formula from the example above.

Tim WilsonJanuary 19th, 2015Author's ReplyI knew there was a good chance I’d learn something from the comments on this post! That’s definitely a faster solution for the counting. And, it’s not case-sensitive, which is good.

It’s a little harder to spot-check the formula, although, I suppose, it would just take initially applying the formula to the first 20 rows and manually checking that you get the same count.

I often want to grab the actual tweets themselves and manually inspect them…but I could that with a simple filter of the table for a “Text contains #measure.”

I’m having trouble making a strong case for the “flag” approach. I’m going to add an update…but then leave the base post and see what other approaches crop up in the comments.

Thanks!

randyzwitchJanuary 28th, 2015Can we use Python?

tweets = [“#Measure”, “#measure”, “randy”, “is”, “awesome”, “measure”, “#MeaSure”, “Tim”]

>>> occurrences = [x for x in tweets if x.lower() in “#measure”]

[‘#Measure’, ‘#measure’, ‘measure’, ‘#MeaSure’]

>>> len(occurrences)

4

Or if it has to be a single line:

>>> len([x for x in tweets if x.lower() in “#measure”])

4

Tim WilsonJanuary 28th, 2015Author's ReplyAs soon as you tell me that I’m more likely to be able to fully grok Python than I was able to comprehend R, Randy, I’ll track down a Coursera course and have at it. I aced the R course…but still wasn’t able to actually apply it.

randyzwitchJanuary 28th, 2015Python is easier than this nonsense:

> tweets sum(unlist(sapply(tweets, function(x) grep(“#measure”, x, ignore.case = TRUE))))

[1] 3

stephensuttonFebruary 15th, 2015Consider IFERROR as an alternative to ISERROR.

Tim WilsonFebruary 15th, 2015Author's ReplyIf I used IFERROR, it would actually return the position of the string when the string existed, and I was looking for more of a binary flag. Right?

atteboyskieMarch 5th, 2015Following your approach, this would be a little simpler:

=IFERROR(IF(SEARCH(“#measure”, E4, 1)>0, “Yes”, ),”No”)

The “SEARCH()” function works the same way as “FIND()” but isn’t case sensitive, so you could drop the “LOWER()” bit.

Or easier still, you could just filter the data (ctrl+shift+l), click the arrow at the top of the column with the tweets and search for “#measure” (again, not case sensitive). Then you could just select all of the cells in the column and look at the count at the bottom (or enter “Yes” in a column out to the side and drag it down to tag all of the cells that contain the value).

Tim WilsonMarch 5th, 2015Author's ReplyWell…those are *both* better approaches, I think. And…er…I think 95% of my “FIND()” usage going forward is going to become SEARCH() usage.

Thanks!