Cleaning Data and Fighting File Formats

I found Google Refine / OpenRefine to be an incredibly useful tool. Because I’ve previously run Java VMs and have some technical experience with database tools, I didn’t have any issues getting the program installed and running. At first I wasn’t impressed with OpenRefine – it seemed like Excel on steroids with some built-in macros for cleaning data. However, once I progressed further in the Programming Historian tutorial, I recognized its potential significance. Removing blank rows and duplicate entries easily is a valuable feature, but nothing that Excel or other programs cannot do. Splitting multi-value cells into new rows but keeping them associated with a parent record; faceting and clustering records to better associate record categories; and using regular expressions to transform and manipulate the data are all truly powerful tools. The robust change history and undo function are also quite useful. While correcting the kinds of errors shown in the tutorial is possible by hand for small data sets, once you are dealing with a database of hundreds of records, it’s nearly impossible to perform quality assurance tests on such a set.

When dealing with historical data, historians and humanists rarely have the luxury of pulling it from a predesigned database with strictly enforced business logic that ensures valid data for record queries or calculations. The data is often raw and piecemeal; many records may have missing fields, or be inconsistently formatted. When working on the Laboring Class Poets Online project, I helped decide which fields would be part of the poet record data type. I also helped create a controlled vocabulary to describe some of those fields, such as “Occupation” and “Industry.” Before creating any database, you must spend a significant amount of time planning to handle problems such as different use cases or importing “outlier” type data; changing and retrofitting a finished database is far more difficult than designed it correctly in the first place. Without a controlled vocabulary in the LC project, for example, searches for “weaver” poets wouldn’t be useful if some poets had the occupation of “weaver” and others were labeled as “textiles,” unless the database designer considered those two completely separate occupations. Our data also had to (and still needs to be!) cleaned to ensure that fields without an enforced controlled vocabulary are usable.

Changing topics completely, in our assigned article “Scarcity or Abundance? Preserving the Past in a Digital Era”, Roy Rosenzweig talks about the issue of archiving digitally because of how ephemeral digital sources really are. While they can easily be copied perfectly, digital files can also disappear without a trace because of their lack of physicality. I have heard of this problem before. In Blown to Bits (an excellent book on how technology is changing our culture – it’s the core text for part of computer science senior capstone), Abeslon et al discuss how proprietary file formats could potentially cause the loss of data. They also discuss how the BBC created a modern version of the Domesday Book in 1986 to commemorate the volume’s 900th anniversary. But “by 2001, the modern Domesday Book was unreadable. The computers and disk readers it required were obsolete and no longer manufactured. In 15 years, the memory even of how the information was formatted on the disk had been forgotten” (107). Even Google’s VP Vint Cert recently warned of a “digital dark age” because of file formats lacking backwards compatibility. Though there have recently been leaps forward, this continues to be an issue; we should look to self-describing, simpler, non-proprietary file formats to ensure future compatibility. HTML webpages should be readable for decades, and properly formed and valid XML could potentially be usable for even longer because how simple it is to read and how it is self-describing. In contrast, locked-down formats or DRM-ridden media files may not be usable in just a few years. Choosing how to save your files is just as important as where to save them when considering long term usage.