Text analysis you can do right now to open up insights in your data

By Daniel Durling (Senior Data Scientist) 

This quarter I have been working on (amongst other things) a sentiment analysis project here at synvert TCM (formerly Crimson Macaw), so I have been thinking a lot about text analysis. I thought it might be useful to share with you all how to start analysing the text data you have already in your systems, building up from the first things to think about before moving up to more advanced techniques (in Part 2). 

We use many different computer languages here at synvert TCM. In my previous blogposts (here & here) we used SQL, my colleagues have shown you python code (e.g. Eddy’s post here). In this post I have chosen to use R. All my examples here you can run on your own system, you can then tweak them to access your own text data. If you would like some help in extracting / accessing the text data you have stored in your systems, please get in touch. 

My hope is that from these examples you will be able to start thinking about text as data; just as valuable as the numeric data you have, but almost certainly underutilised. 

“But I do not have text data” 

This is a common retort from customers when I bring up text analysis. In my experience this usually means we are not actively using our text data fields in our database for analysis. Many businesses (perhaps yours) request feedback from customers after they purchase something, or ask for additional information when reserving a table, or they have a free text field which their telephone operatives can add in extra information into, some place for any and all information which doesn’t fit anywhere else. This is the kind of data I am talking about. 

For the longest time these free text fields have remained unanalysed. I will introduce some of the text analysis concept workflows you might be interested in, hopefully some of these might be useful to you. 

You will be able to run the code I have provided, but also I hope you take the concepts I am talking about and apply them to your own data.

Accessing some data  

In order to make this code useable I will be using copyright free eBooks accessible as part of project Guttenberg. This was inspired by one of my Data Science heroes Julia Silge. Check out her website here (and buy her a coffee here). 

So let us get some data: 

library(tidyverse) 

library(tidytext) 

library(gutenbergr) 

the_dunwich_horror <- gutenberg_download(50133,  

                                         meta_fields = c("title", "author"),  

                                         verbose = FALSE)

Having downloaded the data let’s look at the first 15 rows of text: 

the_dunwich_horror %>%  

  select(text, title) %>%  

  head(15)

 

Word count 

The first thing I would do when looking at some free text is to see what the most common words there are. This can be useful to get an understanding of what is being recorded. You might also wish to group entries together in some way, be that by userid, by location, or another way. We will look at wants to create groups (in our example we will create groupings based on chapters). 

the_dunwich_horror_clean <- the_dunwich_horror %>%  

  mutate(linenumber = row_number(), 

         chapter = cumsum(str_detect(text,  

                                     regex("^[:digit:]+$", 

                                           ignore_case = TRUE))))

We can now look to count the number of distinct words we have in our data. First we can look at a simple count of all the words: 

the_dunwich_horror_clean %>%  

  unnest_tokens(word, text) %>%  

  count(word, sort = TRUE)

Perhaps you saw this coming, but in most unfiltered free text fields, the most common words will be “stop words” – words which connect other words together, but do not contain very much information on their own. So, we will filter them out and have a look.\

the_dunwich_horror_clean %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(word, sort = TRUE)

Then we might want to group the count by the chapter so we can see the top words per chapter

Code for grouped count

Visualisations of counts – functional and expressive

We can also create visualisations of our counts, both in a traditional bar chart:

the_dunwich_horror_clean %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(word, sort = TRUE) %>% 
  slice_max(order_by = n, n = 10) %>% 
  ggplot(aes(n, reorder(word, n))) +
  geom_col() +
  labs(y = NULL,
       x = "number of occurances") +
  theme_bw()

And creating a wordcloud:

set.seed(1234) # for reproducibility 

tokenised_dunwich <- the_dunwich_horror_clean %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, by = "word") %>% 
  count(word, sort = TRUE)

wordcloud(words = tokenised_dunwich$word, 
          freq = tokenised_dunwich$n,
          scale=c(3.5,0.25),
          min.freq = 7, 
          max.words=200, 
          random.order=FALSE, 
          rot.per=0,   
          colors=brewer.pal(8, "Dark2"))

TF-IDF score

The tf-idf (term frequency – inverse document frequency) is statistic used to highlight which terms are used within a given document but not used often within a wider collection (or corpus) of documents.

This score can be useful in understand what words stand out from a given document / piece of text when compared to other documents / pieces of text.

In order to test this, we need to create a collection of documents look at more than one document, so let’s bring in don quioxte as well:

don_quioxte <- gutenberg_download(996, 
                                  meta_fields = c("title", "author"), 
                                  verbose = FALSE)

don_quioxte_clean <- don_quioxte %>% 
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, 
                                     regex("^chapter [\\divxlc]{1,}\\.$",
                                           ignore_case = TRUE))))

Now we have two documents we wish to compare. We can look to combine them into a single data.frame, count the words and then calculate the TF-IDF score. Notice how here we are not removing the stop words. This is important. We are trying to calculate the words which are unique to each document. In order to do this, we need to know which words are common in both documents. Including the stop words.

We can then visualise the top 15 words from both books and compare:

the_dunwich_horror_clean %>% 
  bind_rows(don_quioxte_clean) %>% 
  unnest_tokens(word, text) %>% 
  count(word, title, sort = TRUE) %>% 
  bind_tf_idf(word, title, n) %>% 
  arrange(desc(tf_idf)) %>% 
  group_by(title) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, fct_reorder(word, tf_idf), fill = title)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~title, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y = NULL) +
  theme_bw()

These scores will pull out the words used in each document often but not used in the other document.

Quick note, you may see that some words appear as distinct. There are a few ways we can address this. We might want to address this where we are at the analysis stage (we could do this through a process of stemming), or we can go back to the data and edit it there. Alternatively, we could create an intermediate stage where we clean the data without editing the original data. These are all important business decisions with trade-offs.

Conclusion

There we have it, hopefully from this you can see how you can start to use your free text data fields to get some more information out of them. In our next part we will look at developing these ideas further, we will look at counting word pairs rather than just single words, we will look at stemming (where we look at the “root” of a word; this is helpful for combining words with the same beginning, such as look, looked and looking) and most relevant to the project I have been working on, sentiment analysis.

Fancy chatting about text analysis? Please reach out here.