Interpreting Texts: Digital Humanists Cannot Survive on Data Alone

Over the past two or three years I have used Voyant Tools for a variety of purposes. Voyant is one of the many “good enough” out of the box solutions that covers 95% of people’s needs 95% of the time. This is both a blessing and a curse. Because of the low technological entry barrier, many people are able to use Voyant quickly. However, the steeper learning curve of understanding how to actually generate the same visualization models on texts by coding the functions yourself in R, Python, or Java is a worthwhile experience in and of itself. A programmer versus a tool user ultimately is unconstrained by the boundaries and limitations of any one solution like Voyant Tools; because they understand how the data is generated and manipulated, they can simply extend their work if they want to present or analyze it in a different way. This is difficult or impossible to do using Voyant alone.

That said, for quick and dirty use Voyant is excellent and user-friendly. It is often the first place I go for quick textual analysis before writing R or Java scripts later to help me extend that functionality and further process my texts. For example, I recently did a class presentation on Chimamanda Adiche’s Purple Hibiscus for an honors course exploring the intersection of literature with philosophy and literature. I initially wanted to look at how characters’ influence within the novel changed over time. This is difficult to do with one homogenous text, so I split the novel into 17 sections based on chapter divisions and uploaded the sections to Voyant. Like I suspected, there was a sharp spike for occurrences of Aunty Ifeoma about 1/3 of the way through the novel, while occurrences to Papa decreased throughout the book, mirroring Kambili’s transformation from a child into a young adult and the creation of her own identity. However, when looking at the word frequency charts, I noticed something I had never even thought of: the prevalence of food terms within the text. While no one food was extremely frequent, various terms appeared throughout the novel at regular intervals. The more I thought about food in Purple Hibsicus, the more I realized it could be read not only as a cultural feature, but as a socioeconomic indicator, a plot device, an important religious symbol, and an agent of control and change.

Despite Voyant’s role in assisting me with this insight, the tool itself was insufficient for me to present me analysis. While Voyant supports custom stoplists of words so users can strip out common terms or proper names if they consider them unimportant to their analysis goal, Voyant does not support whitelists. I wanted to generate a frequency table of only certain terms, specifically food terms. I skimmed the novel quickly and created a list of approximately 100 different food terms, then wrote a Java script to read the text and count their frequencies in each chapter and then print the term that many times – essentially generating a text composed only of food terms at their frequency of appearance in the novel. I returned to Voyant to visualize this text through a text cloud and a frequency chart that showed how certain foods spiked in different chapters, usually because of festivals or other temporal events. Without my prior knowledge of Purple Hibiscus, I may have been intrigued by these patterns, but would have no basis to explain why they were relevant or significant. A researcher’s interpretation of the data is just as important as the data itself.

I have attempted to use the command line version of MALLET once or twice for different projects, but never realized there was a GUI-based Topic Modeling Tool that ran on the MALLET code as well. I am less familiar with LDA modeling than word frequency analysis, but hope to delve deeper into this area of distant or computer-assisted reading. I plan not only to learn how to use NLP tools, but also to understand how the tools actually work to create the groups of topics and find statistically significant relationships between words. While I understand the appeal of simply using tools like Voyant or MALLET, I also recognize the inherent pitfalls of trusting those tools too much. It is ridiculously easy to misinterpret or skew results through artificial manipulation, such as inappropriate stoplists or requesting too many or too few topics. That is why the “humanities” side of digital humanities is still relevant and absolutely necessary: creating charts and visualizations is wonderful, but the veracity and usefulness of that data is only as trustworthy as the researcher performing the interpretation. Without interpretation, data is meaningless. Only through sustained, insightful interpretation – and previous knowledge of the questions at hand – can data be transformed into information.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s