Harry Potter and the Tableau School of Word Clouds and Data Vizardry

What is a word cloud?

Oh, yes, that's when I take a big bite of alphabet soup and then spit it out because it's way too hot, right? No... no, not really. No. Not at all, actually. A word cloud provides a visual representation of text data where unique words are arranged in shape or clustered together and sized by their number of occurrences or importance. The viewer of a word cloud should gain a quick understanding of how many unique words exist in the text data, quickly pull out important words and see what categories may exist among the unique words.

Word clouds are a form of data visualization. They have been used since the 1990's as a means of site navigation for keyword metadata (tag clouds) to help people find the pages they're searching for on a website. Today, word clouds have been popularized to include any form of text data and there are many websites that provide word cloud services. What I found after looking through most of the word cloud websites is that the output is typically delivered in static form. In my words, "I have all this cool text data, but all I get from it is a lousy static picture?!?" Maybe that works for most people, but I want more from my text data.

Let's say I got my hands on some pretty cool text, or rather, "the most freaking amazing text," by my wife's standards. That's right, I have "Harry Potter and the Chamber of Secrets," from the 2nd novel in the 7 novel series, Harry Potter, written by the wonderful JK Rowling, in a text file. And let's also say I wanted to turn that text data into a word cloud to help me see what unique words are used and which of those unique words are used the most throughout the story. I could easily go to one of the top searched word cloud sites and create one for free, if only for this purpose.

Heck, I could even go to another top searched word cloud site and create one that removes any of the commonly used words and punctuation marks in the English language e.g. and, is, so, him, should, ?, !, (, ), etc. These are called "stop words." Some word cloud websites provide the option to choose which words or characters to suppress in a word cloud so that the word "the" isn't taking up half the word cloud like in the image above. Now, in the image below, it can be seen that the most used word throughout the book is clearly "Harry" with "Ron" and "Hermione" trailing behind.

But these are only static images and my options for what I can do with the word cloud are limited. What if I had more questions that I needed my word cloud to address? What if I needed my word cloud to be interactive? What if I wanted to host my word cloud online so that others could interact with it?

To solve these problems, I turned to Tableau.

But first I had to get a dataset, which was essentially an entire book in the form of a text file, shaped and ready to go in Tableau. For that, I busted out my wizarding wand and used some notepad++ magic along with a keen eye as I parsed out the data. What I wanted my data to ultimately look like was a 2 column table. One column would contain all the words used in the book and the other column would contain the sequence of each word used throughout the book. I also made all words lowercase so they'd be easier to read in the word cloud.

Once I had the data prepped and saved as a text file, I plugged the data into Tableau and built my first word cloud. As can be seen below, this word cloud displays all unique words or characters contained in the file, except for any unique word or character occurring less than 10 times. I figured any word occurring less than 10 times may either be a corrupt word or one that just isn't very interesting or relevant to the story - it also helps dashboard performance. Each word is sized and color-categorized according to its number of occurrences throughout the book. I've provided a way to filter the quantity of unique words displayed in the word cloud by a minimum and maximum number of occurrences. I've also enabled a word search function where one or more words can be shown while filtering out all non-matching words from the visualization.


If I go back to my data connection and left join my own, custom made Stop Words data set to it, I can easily filter out any of the commonly used words I do not want to see in my word cloud. Here, we can see that words like "the" which occurs 4,092 times throughout the book, is filtered out, along with other commonly used stop words like [and, in, a, his, of, said, he, to, etc.] while the newly reigning word featherweight champion of the book now clearly becomes "harry." This word cloud has the same ability to filter, search, and easily see the word occurrence groups.


Ok, great, I built an interactive word cloud. What about all those other questions I have? Building a word cloud dashboard would allow me to take advantage of the unique words and also the word sequencing that I have built into my text data through additional visualizations and dashboard actions. This is helpful because a word cloud is not typically the best data visualization to use if I want real insights from my data, so the dashboard below provides 3 additional views to the word cloud:

  1. A word bubble (Word Bubble! at bottom left) easily helps me get a sense of the quantity of each unique word in a given word occurrence group.
  2. A vertical bar chart (Words Used per Chapter at bottom middle) helps break down each chapter by the unique words contained within it.
  3. A horizontal bar chart (Most Used Words at bottom right) quickly shows words having the most to least occurrences.



And when these 3 additional views accompany the word cloud, I can now apply dashboard actions to click on different data points and answer questions like:

How often, per chapter, does the word Harry occur compared to all other words in the chapter? In fact, Harry appears throughout the book more than the total count of unique, relevant words in most of the chapters.



What chapters feature Ron, Hermione, or even the adorable little Dobby? If I click on Dobby, I'll see he is featured in chapter 2, chapter 10, and of course chapter 18 when (spoiler alert) "Master has given a sock... Master gave it to Dobby... Got a sock. Master threw it, and Dobby caught it, and Dobby - Dobby is freeeeeeeee."



What chapters feature certain creatures or events? Click any data point and take a look at the "Words Used per Chapter" chart to find out which chapters heavily feature certain words. I can even type a word in the "search for a particular word" filter to more clearly see which chapters feature words like "spider" (chapter 15) ...



...or which chapters are filled with "quidditch" action (chapters 7 and 10).



Very magical stuff, indeed! If you have a great example of a word cloud dashboard that you'd like to share or if you have any questions or feedback, feel free to comment below.