Georeferencing Images: An Exercise in (Relative) Frustration

Geographic information systems (GIS) are used by dozens of different industries, including shipping corporations, public utilities, property developers, militaries, telecom services, and agriculturists. GIS technologies are also used by humanists and academic researchers to explore spatial relationships and visualize and find patterns in information that cannot be seen in pure data. One of the most useful GIS technologies, especially for historians, is georeferencing. Georeferencing older maps allows scholars to “modernize” historical documents and make them more usable by contemporary readers, as well as increasing their accessibility by hosting them online.

Georeferencing maps requires a GIS computer program such as Google Earth, Google Maps Engine Lite, ArcGIS, QGIS, or a service such as MapTiler designed specifically for georeferencing. Through the georeferencing process, spatial analysts “layer” different maps on top of each other, ultimately associating spatial locations on these maps with real-world coordinates. The GIS user designates control points on both the historical map (usually a static image file of some kind) and a modern basemap of the same spatial area. This basemap can be a static image as well, but is more often a proprietary dynamic map delivered through Google Earth, Google Maps, Open Street Map, US Topographic Maps, National Geographic Maps, or other services. These maps are “live” online and nearly infinitely scalable, drawing from databases housing petabytes of information and image. These control points map street intersections, buildings, or geographic features on historical maps to their counterparts in real life using specific coordinate reference systems (CRS). These coordinate reference systems define how their associated maps are projected; each CRS uses a different projection system that displays the Earth’s surface differently (eg Mercator, Gall-Peters, Winkel-Tripel). Designating a CRS allows the software to link the two points and fit the historical map to the basemap. While accurate historical maps may fit nearly perfectly and only require a few control points to be accurately georeferenced, others require many points and may experience significant warping as the map’s scale is distorted to minimize its inaccuracies and fit it to the basemap.

I decided to georeference a map of Paisley, the hometown of Robert Tannahill, my primary research subject. I know this area of Scotland relatively well because of its proximity to Glasgow, where I studied abroad during fall 2013 and returned to on a research grant during summer 2014. I was able to visit Paisley on several occasions. Tannahill died in 1810, so I wanted to find a map created as close to his death as possible. John Wood’s 1828 map fit the bill perfectly. Because the resolution of the map available for public download on the National Library of Scotland website was too low, and because I potentially intend to use it for a public website, I purchased access to the image from the NLS. This gave me a massive 20MB .JPG file – perfect for using with a GIS program.

I tried several software solutions, all of which have various faults and weaknesses. I first tried to use perhaps the simplest, Google Maps Engine Lite. While this allows users to create points and import CSV data, there is no utility for georeferencing an image (raster) file. I then used MapTiler; the interface was simple and the most usable for georeferencing of any product I tested. However, I found that there was no way to remove watermarks or incorporate multiple map layers without purchasing the product. I was also somewhat unimpressed with MapTiler’s warping abilities. In exasperation, I turned to a more powerful solution: QGIS.

QGIS is the open source alternative to the main industry standard GIS program, Esri’s ArcGIS. QGIS has most of ArcGIS’ functionality and even performs faster than ArcGIS on many benchmarks. While I haven’t had enough of an opportunity to test the entire program, I did try to use QGIS to georeference my map of Paisley. I was able to select a basemap relatively easily. I started out with Google Maps, but QGIS failed to allow me to scale past 1:25,000. I changed to Bing Street Maps, which provided more detailed resolution, though the load times were still noticeable. I was able to create control points, though the interface was somewhat clunky (clicking back and forth between two fullscreen windows). However, when I attempted to actually perform the georeference transformation and overlay the images, QGIS failed to function entirely, despite my attempts to change numerous settings and configurations. I would like to install QGIS on another machine and attempt the process again, but have not yet had the luxury of doing so.

In desperation I turned to a free trail of Esri’s ArcGIS. The installation process for ArcGIS was abysmal; it required multiple massive install files and authentications because of how tightly Esri controls the software. The programs are unhelpfully titled; I had to install ArcGIS Desktop Pro, ArcMap, ArcView, and many other arcs. Once I did manage to activate the trial and begin working in ArcMap, I found it easy to add basemap data (Open Street Map) and my raster image. The georeferencing process was still slightly awkward; you essentially place the raster image over the basemap, which requires you to zoom to that location, and then create reference points in one screen rather than splitscreen. If you are working with a shapefile basemap, this would probably work very well, but both my basemap and my raster image are opaque; when the raster image is overlaid on the base image it obscures the basemap entirely. I was able to change the opacity of the raster image to allow me to see both at once, but this was less user-friendly than either MapTiler or QGIS in my opinion. However, the georeferencing accuracy in ArcGIS was far higher than either of those programs; I could see the raster image being transformed and warped with each control point I added. ArcMap even provided a table of the control points with residuals – essentially data about the estimated error of each point. By deleting control points with high residuals and adding more points, I was able to quickly make georeference my image with high accuracy.

This is where the fun ended. Initially, ArcMap seemed promising because I had previously used ArcGIS Online to create and embed location frequency maps. However, Esri’s desktop suite and ArcGIS Online apparently share very few similarities. While some layers can be added to both, raster files, even when georeferenced, are not one of these layer types. I was unable to publish my map to ArcGIS Online, though I tried Esri’s recommended methods of sharing it as a Map Service and Map Package. These uploaded to the site, but couldn’t be used by ArcGIS Online – their only use is to be downloaded by other ArcGIS Desktop users. Apparently if you have a full scale ArcServer installation you can host maps like I created, but that is beyond what I was willing to do for this project. Frustrated, I exported my map as a static image (below) and quit ArcMap.

Paisley_ArcMap

I then returned to MapTiler. I added many more control points than in my first iteration. While the transformation was still weak compared to ArcGIS’, it provided far more hosting options than Esri – and even included a slick transparency slider option. I was able to remake the map, store the completed files on my Google Drive, and embed the map in my website in under an hour. While the finished result is imperfect, it is, at least, a finished and digital hosted map! I hope to eventually learn more fully featured GIS editors such as QGIS, but cannot fault MapTiler for what it is – easy to use, single-purposed, and fast.

*Update* MapTiler is somewhat difficult to extend in terms of functionality. However, after a LOT of tinkering, I was able to use the Google Maps output (not the standard index.html page) and add a custom layer of markers on the map with some JavaScript. I also added an infoWindow that displays images when you click on the markers. Unfortunately this means that I’ve lost the wonderful opacity slider, but I’ll be working to add that back in the future as well. Here’s the result – make sure to load the “unsafe scripts” (My JavaScript) to see it.

Advertisements

Links: DH Posts I Find Useful

This will be a list that I will try to update regularly of DH projects, posts, and tools that I use or like.

Text Mining and Analytics

Projects

Tools

Interpreting Texts: Digital Humanists Cannot Survive on Data Alone

Over the past two or three years I have used Voyant Tools for a variety of purposes. Voyant is one of the many “good enough” out of the box solutions that covers 95% of people’s needs 95% of the time. This is both a blessing and a curse. Because of the low technological entry barrier, many people are able to use Voyant quickly. However, the steeper learning curve of understanding how to actually generate the same visualization models on texts by coding the functions yourself in R, Python, or Java is a worthwhile experience in and of itself. A programmer versus a tool user ultimately is unconstrained by the boundaries and limitations of any one solution like Voyant Tools; because they understand how the data is generated and manipulated, they can simply extend their work if they want to present or analyze it in a different way. This is difficult or impossible to do using Voyant alone.

That said, for quick and dirty use Voyant is excellent and user-friendly. It is often the first place I go for quick textual analysis before writing R or Java scripts later to help me extend that functionality and further process my texts. For example, I recently did a class presentation on Chimamanda Adiche’s Purple Hibiscus for an honors course exploring the intersection of literature with philosophy and literature. I initially wanted to look at how characters’ influence within the novel changed over time. This is difficult to do with one homogenous text, so I split the novel into 17 sections based on chapter divisions and uploaded the sections to Voyant. Like I suspected, there was a sharp spike for occurrences of Aunty Ifeoma about 1/3 of the way through the novel, while occurrences to Papa decreased throughout the book, mirroring Kambili’s transformation from a child into a young adult and the creation of her own identity. However, when looking at the word frequency charts, I noticed something I had never even thought of: the prevalence of food terms within the text. While no one food was extremely frequent, various terms appeared throughout the novel at regular intervals. The more I thought about food in Purple Hibsicus, the more I realized it could be read not only as a cultural feature, but as a socioeconomic indicator, a plot device, an important religious symbol, and an agent of control and change.

Despite Voyant’s role in assisting me with this insight, the tool itself was insufficient for me to present me analysis. While Voyant supports custom stoplists of words so users can strip out common terms or proper names if they consider them unimportant to their analysis goal, Voyant does not support whitelists. I wanted to generate a frequency table of only certain terms, specifically food terms. I skimmed the novel quickly and created a list of approximately 100 different food terms, then wrote a Java script to read the text and count their frequencies in each chapter and then print the term that many times – essentially generating a text composed only of food terms at their frequency of appearance in the novel. I returned to Voyant to visualize this text through a text cloud and a frequency chart that showed how certain foods spiked in different chapters, usually because of festivals or other temporal events. Without my prior knowledge of Purple Hibiscus, I may have been intrigued by these patterns, but would have no basis to explain why they were relevant or significant. A researcher’s interpretation of the data is just as important as the data itself.

I have attempted to use the command line version of MALLET once or twice for different projects, but never realized there was a GUI-based Topic Modeling Tool that ran on the MALLET code as well. I am less familiar with LDA modeling than word frequency analysis, but hope to delve deeper into this area of distant or computer-assisted reading. I plan not only to learn how to use NLP tools, but also to understand how the tools actually work to create the groups of topics and find statistically significant relationships between words. While I understand the appeal of simply using tools like Voyant or MALLET, I also recognize the inherent pitfalls of trusting those tools too much. It is ridiculously easy to misinterpret or skew results through artificial manipulation, such as inappropriate stoplists or requesting too many or too few topics. That is why the “humanities” side of digital humanities is still relevant and absolutely necessary: creating charts and visualizations is wonderful, but the veracity and usefulness of that data is only as trustworthy as the researcher performing the interpretation. Without interpretation, data is meaningless. Only through sustained, insightful interpretation – and previous knowledge of the questions at hand – can data be transformed into information.

Cleaning Data and Fighting File Formats

I found Google Refine / OpenRefine to be an incredibly useful tool. Because I’ve previously run Java VMs and have some technical experience with database tools, I didn’t have any issues getting the program installed and running. At first I wasn’t impressed with OpenRefine – it seemed like Excel on steroids with some built-in macros for cleaning data. However, once I progressed further in the Programming Historian tutorial, I recognized its potential significance. Removing blank rows and duplicate entries easily is a valuable feature, but nothing that Excel or other programs cannot do. Splitting multi-value cells into new rows but keeping them associated with a parent record; faceting and clustering records to better associate record categories; and using regular expressions to transform and manipulate the data are all truly powerful tools. The robust change history and undo function are also quite useful. While correcting the kinds of errors shown in the tutorial is possible by hand for small data sets, once you are dealing with a database of hundreds of records, it’s nearly impossible to perform quality assurance tests on such a set.

When dealing with historical data, historians and humanists rarely have the luxury of pulling it from a predesigned database with strictly enforced business logic that ensures valid data for record queries or calculations. The data is often raw and piecemeal; many records may have missing fields, or be inconsistently formatted. When working on the Laboring Class Poets Online project, I helped decide which fields would be part of the poet record data type. I also helped create a controlled vocabulary to describe some of those fields, such as “Occupation” and “Industry.” Before creating any database, you must spend a significant amount of time planning to handle problems such as different use cases or importing “outlier” type data; changing and retrofitting a finished database is far more difficult than designed it correctly in the first place. Without a controlled vocabulary in the LC project, for example, searches for “weaver” poets wouldn’t be useful if some poets had the occupation of “weaver” and others were labeled as “textiles,” unless the database designer considered those two completely separate occupations. Our data also had to (and still needs to be!) cleaned to ensure that fields without an enforced controlled vocabulary are usable.

Changing topics completely, in our assigned article “Scarcity or Abundance? Preserving the Past in a Digital Era”, Roy Rosenzweig talks about the issue of archiving digitally because of how ephemeral digital sources really are. While they can easily be copied perfectly, digital files can also disappear without a trace because of their lack of physicality. I have heard of this problem before. In Blown to Bits (an excellent book on how technology is changing our culture – it’s the core text for part of computer science senior capstone), Abeslon et al discuss how proprietary file formats could potentially cause the loss of data. They also discuss how the BBC created a modern version of the Domesday Book in 1986 to commemorate the volume’s 900th anniversary. But “by 2001, the modern Domesday Book was unreadable. The computers and disk readers it required were obsolete and no longer manufactured. In 15 years, the memory even of how the information was formatted on the disk had been forgotten” (107). Even Google’s VP Vint Cert recently warned of a “digital dark age” because of file formats lacking backwards compatibility. Though there have recently been leaps forward, this continues to be an issue; we should look to self-describing, simpler, non-proprietary file formats to ensure future compatibility. HTML webpages should be readable for decades, and properly formed and valid XML could potentially be usable for even longer because how simple it is to read and how it is self-describing. In contrast, locked-down formats or DRM-ridden media files may not be usable in just a few years. Choosing how to save your files is just as important as where to save them when considering long term usage.

Popular != Valuable

I am a strong believer in the digital humanities as a practical discipline. Theory is fine, but at some point praxis is necessary to exercise and apply those models. This allows digital humanists to actively create new knowledge, or restructure existing knowledge and present it in a new way. Thankfully, there is no shortage of application-based “building” projects in DH!

The best way to learn something is to practice doing it or teaching someone else to do it. This is why websites such as codecademy.com are useful and powerful: they allow students to dynamically and immediately apply what they are learning. However, studying excellent examples of completed work allows a broader view that focuses less on the minutiae and more on possible results.

All of the digital archives we examined in class had different aims and addressed problems in novel ways. Some of the archives tried to answer questions that I’m not sure were ever really asked. In its “About” page, the Digital Public Archive of America claims to be “a portal . . . a platform . . . and an advocate for strong public opinion.” The DPLA is one of the largest, most well-funded sites I surveyed. It was well-designed with a modern UI, and even boasts an API to help other sites integrate its vast databases into their own work. That is an excellent example of the collaborative spirit that drives DH scholarship. However, I can’t ever see myself turning to the DPLA for research purposes; because of how diverse the site’s offerings are, it fails to shine in any one area. While the DPLA might be a great site for sharing information (it boasts a Historical Cats app that grabs a random exhibit or item and tweets it for you), the lack of focus makes the website less usable than sites that are less user-friendly.

Example of topic modeling chart plotting fugitive slave and for hire ads against time and frequency on "Mining the Dispatch"
Example of topic modeling chart plotting fugitive slave and for hire ads against time and frequency on “Mining the Dispatch”

My favorite archive was Mining the Dispatch, a product of the University of Richmond’s Digital Scholarship Lab. While the website isn’t flashy, it is clean and focuses on answering specific questions through the application of topic modeling to the text of the Richmond Dispatch. By tracing the popularity of different topics from 1861 to 1865 in a clearly defined corpus of texts, users can see how trends matched real-world events. The site doesn’t present much new information; we could have looked at military enrollment logs to confirm that most men joining the Confederacy as soldiers did so in early 1861. However, it provides substantiating evidence and allows us to see the newspaper in a new way. This is just as insightful as creating new work, and is far more useful to its specific audience than the DPLA site is to a much larger group of people. While the DPLA site may be more popular, that doesn’t make it more valuable; as Melissa Terras points out, the most frequently accessed items of many internet archives are usually those that go viral via social media rather than those with the most cultural significance or those presented in the most innovative ways. This lesson can obviously be applied to entire archives as well as specific webpages.