What is a word cloud?
Oh, yes, that's when I take a big bite of alphabet soup and then spit it out
because it's way too hot, right? No... no, not really. No. Not at all,
actually. A word cloud provides a visual representation of text data where
unique words are arranged in shape or clustered together and sized by their
number of occurrences or importance. The viewer of a word cloud should gain a
quick understanding of how many unique words exist in the text data, quickly
pull out important words and see what categories may exist among the unique
words.
Word clouds are a form of data visualization. They have been used since the
1990's as a means of site navigation for keyword metadata (tag clouds) to help
people find the pages they're searching for on a website. Today, word clouds
have been popularized to include any form of text data and there are many
websites that provide word cloud services. What I found after looking through
most of the word cloud websites is that the output is typically delivered in
static form. In my words, "I have all this cool text data, but all I get from
it is a lousy static picture?!?" Maybe that works for most people, but I want
more from my text data.
Let's say I got my hands on some pretty cool text, or rather, "the most
freaking amazing text," by my wife's standards. That's right, I have "Harry
Potter and the Chamber of Secrets," from the 2nd novel in the 7 novel series,
Harry Potter, written by the wonderful JK Rowling, in a text
file. And let's also say I wanted to turn that text data into a word cloud to
help me see what unique words are used and which of those unique words are
used the most throughout the story. I could easily go to one of the top
searched word cloud sites and create one for free, if only for this purpose.
Heck, I could even go to another top searched word cloud site and create one
that removes any of the commonly used words and punctuation marks in the
English language e.g. and, is, so, him, should, ?, !, (, ), etc. These are
called "stop words." Some word cloud websites provide the option to choose
which words or characters to suppress in a word cloud so that the word "the"
isn't taking up half the word cloud like in the image above. Now, in the image
below, it can be seen that the most used word throughout the book is clearly
"Harry" with "Ron" and "Hermione" trailing behind.
But these are only static images and my options for what I can do with the
word cloud are limited. What if I had more questions that I needed my word
cloud to address? What if I needed my word cloud to be interactive? What if I
wanted to host my word cloud online so that others could interact with it?
To solve these problems, I turned to Tableau.
But first I had to get a dataset, which was essentially an entire book in the
form of a text file, shaped and ready to go in Tableau. For that, I busted out
my wizarding wand and used some notepad++ magic along with a keen eye as I
parsed out the data. What I wanted my data to ultimately look like was a 2
column table. One column would contain all the words used in the book and the
other column would contain the sequence of each word used throughout the book.
I also made all words lowercase so they'd be easier to read in the word cloud.
Once I had the data prepped and saved as a text file, I plugged the data into
Tableau and built my first word cloud. As can be seen below, this word cloud
displays all unique words or characters contained in the file, except for any
unique word or character occurring less than 10 times. I figured any word
occurring less than 10 times may either be a corrupt word or one that just
isn't very interesting or relevant to the story - it also helps dashboard
performance. Each word is sized and color-categorized according to its number
of occurrences throughout the book. I've provided a way to filter the quantity
of unique words displayed in the word cloud by a minimum and maximum number of
occurrences. I've also enabled a word search function where one or more words
can be shown while filtering out all non-matching words from the
visualization.
If I go back to my data connection and left join my own, custom made Stop
Words data set to it, I can easily filter out any of the commonly used words I
do not want to see in my word cloud. Here, we can see that words like "the"
which occurs 4,092 times throughout the book, is filtered out, along with
other commonly used stop words like [and, in, a, his, of, said, he, to, etc.]
while the newly reigning word featherweight champion of the book now clearly
becomes "harry." This word cloud has the same ability to filter, search, and
easily see the word occurrence groups.
Ok, great, I built an interactive word cloud. What about all those other
questions I have? Building a word cloud dashboard would
allow me to take advantage of the unique words and also the word sequencing
that I have built into my text data through additional visualizations and
dashboard actions. This is helpful because a word cloud is not typically the
best data visualization to use if I want real insights from
my data, so the dashboard below provides 3 additional views to the
word cloud:
- A word bubble
(Word Bubble! at bottom left) easily helps me get a
sense of the quantity of each unique word in a given word occurrence
group.
- A vertical
bar chart (Words Used per Chapter at bottom middle) helps
break down each chapter by the unique words contained within it.
- A horizontal bar chart
(Most Used Words at bottom right) quickly shows words
having the most to least occurrences.
And when these 3 additional views accompany the word cloud, I can now apply
dashboard actions to click on different data points and answer
questions like:
How often, per
chapter, does the word Harry occur compared to all other words in the chapter?
In fact, Harry appears throughout the book more than the total count of
unique, relevant words in most of the chapters.
What chapters feature
Ron, Hermione, or even the adorable little Dobby? If I click on
Dobby, I'll see he is featured in chapter 2, chapter 10, and of course chapter
18 when (spoiler alert) "Master has given a sock... Master gave
it to Dobby... Got a sock. Master threw
it, and Dobby caught it, and Dobby - Dobby is freeeeeeeee."
What chapters feature
certain creatures or events? Click any data point and take a look at the
"Words Used per Chapter" chart to find out which chapters heavily feature
certain words. I can even type a word in the "search for a particular
word" filter to more clearly see which chapters feature words like "spider"
(chapter 15) ...
...or which chapters are filled with "quidditch" action (chapters 7 and 10).
Very magical stuff, indeed! If you have a great example of a word cloud
dashboard that you'd like to share or if you have any questions or feedback,
feel free to comment below.