Workflows for Paywalled Texts and Open Data Ideals

An overview

The world of text mining—particularly low-barrier-to-entry topic modeling with MALLET and work with AntConc or NTLK—opens up a whole variety of analytical options for scholars interested in pursuing distant reading. In turn, distant reading projects are often based on open-data collections, a phrase that conjures up visions of herds of information roaming free across the digital plains waiting to be corralled by avid scholars. It’s easy to do full-text analysis on these open corpora with unrestricted copyrights. Just download the full corpus, and off you go.

The reality of the data landscape for digital humanists is much more complicated. Here and there, a few roaming hand-transcribed sources that constitute fairly small collections, comparatively, pop their heads up, but many of the largest, most accurate collections of digitized and transcribed texts are largely fenced-in digitized sources like ProQuest, Early English Books Online, or (selfishly) Brepols database of medieval critical editions. The prevalence of copyrighted digitized texts, many with very restrictive copyright and usage guidelines that limit reproduction, can also limit applications of text mining, hGIS and network-theoretical approaches.

A demonstration

Let’s say you’ve just put a bunch of text from Napoleon III’s Life of Julius Caesar into MALLET and identified some interesting vocabulary, including references to “Rome”, “constitution” and “construction”. How do you get from that vocabulary to a citation?

The end result of this process gives you output that looks something like this:

Want to search for something custom? This assumes a wild-card search, so “rom” will match “Rome”, “Romans”, “romanization”, etc.

These links demonstrate the more complicated version using SQL and scripting but the process is also documented for scholars working mostly with Excel and basic text editing skills.

Why we need it

Why I needed it

As a historian, I gravitate toward big questions that require lots of sources and the integration of two discrete skillsets developed over 10 years in industry and 10 years in the academy. Simply put, I’m a better historian with the digital than without it.

My current question is about medieval conflict resolution undertaken in informal settings, and I’m approaching the question by looking at how textual authority is constructed and then used to bolster real-world authority. For an example of this, have a look at a recent conference paper, “Between Miracles and Memory: Min(d)ing the gap in construction of authority in early medieval episcopal saint’s lives and deeds of bishops” and the list of citations I generated from paywalled data using the process detailed here.

Even more narrowly, I want to understand how divine agency works as it moves from textual account to real-world conflict resolution. How does divine, saintly or otherworldly intervention help the subjects of these biographies, and their successors, as they remember, replicate, reinforce and restructure their own agency as they seek to resolve conflict in the real world? How do these patterns change over time? By role? By geographic context?

To answer these questions, I’m looking at medieval biography–saints’ lives, deeds of bishops, biographies of kings–to understand how informal conflict resolution worked outside the boundaries of formal legal or sanctioned military conflict.

That all adds up to a giant text-mining project. Because the boundaries of text mining are fairly well established, it’s a fairly simple set of parameters. I need a corpus of medieval biographies sorted by time period, genre, geography, and author, and then prepped for topic modeling, corpus linguistics and a little semantic analysis based on Part-Of-Speech tagging. On the face of it, that doesn’t seem all that difficult.

Until you consider that most transcribed critical editions of medieval sources are paywalled. And all of them are in Latin.

Why you might need it

With copyrighted, paywalled corpora¹ built-in full-text download and off-the-shelf analysis are often not an option. As such, in-text citations become an absolutely vital part of the digital analysis. However, readily available topic modeling tools like MALLET strip the citation data scholars, digital and analog alike, need to participate in a scholarly debate.

This “how did they make that” project describes the basics of a workflow that bridges the gap between open-data ideals and paywalled sources. It helps scholars working with restricted text by providing a way to maintain intact word-by-word citation information in a reasonably simple format (though there are more complex versions of this process out there). This process preserves citations in a way that accommodates the copyright restrictions of providers of paywalled data while still providing the results in broadly reproducible form.

The data-management process starts with data scraping, clean-up and import approaches that provide individual scholars with private corpora that maintain intact word-by-word citation information. The resulting database can then be used for text mining in analytical tools that still maintains a tie between distant-reading analysis and the original citations for germane word (or words) of interest.²

Who is this for?

If you’re using any paywalled data for a digital history project you’ll need to provide word-by-word citations.

It’s also helpful if you have bad OCR that needs some manual cleanup (for instance, alphabetical sorting) before you put it back together.

Finally, it’s just good practice to keep your citations in place, so even scholars using open data might benefit from a similar process.

Workflow

At a high level, the workflow is basic, as is the data description.

Familiarize yourself with the data: Figure out what sources you need. Look at each source individually and see how those sources fit into the data structure and will work with your citation needs.
Scrape and chunkify the data: Get the text out of its paywalled jail via import or copy/paste. Confirm the readability/accuracy of the text and the consistency of its divisions for citation purposes.
Tokenize, import and clean individual words: Divide the text into individual tokens (on word boundaries). Maintain word order, grouping chunks of text on a source-by-source basis using the citation structure you set up for each source in the cleaning step.
Export and analyze: Export the text in chunks for analysis in a text mining tool (the GUI version of MALLET and AntConc work well in tandem for practitioners with expertise in Excel but not in R or Python). Track the words or phrases indicated by these distant-reading tools and use Excel, SQL or a front-end Web search tool to produce a list of citations for a particular inquiry.

Familiarize yourself with the data

At its most basic, a single table called “Word” contains the fields necessary to preserve citations. Each word will be its own row or record. This flat file format can live in Excel (slow but workable) or be imported into an SQL platform.

Word_ID: an auto-incrementing numeric ID to help differentiate one instance of a word from another
Word_Source: a source name
Word_Orig: the word as it appears in the original text
Word_Clean: a cleaned value (lowercase, normalized spelling) of the word so you can maintain textual consistency without sacrificing unusual orthography
Word_Punct: punctuation that follows the word so you can put the text back together
Word_Cite1, Word_Cite2, Word_Cite3: a set of fields that lets you sort the words back into their original order. This example uses three fields, usually containing numeric values, that let you track something like book/volume no., chapter, word order in that chapter for each word).

Consider the interaction between on- and off-line citation hunting: if your discipline requires page numbers in a novel but will accept book/chapter notation, it’s much easier to use book/chapter notation, which generally travels between editions of a text, rather than page number, which is edition specific.

Make sure there are headers or footers dividing the text of each page into your chosen divisions and assign these divisions to Cite1, Cite2 and Cite3. Consistency within each source is necessary, but it is possible to mix notational styles from source to source (page number vs book->chapter->line) within a single corpus because the sorting process described here always uses multiple fields to sort on. Just don’t mix notational styles within a source or getting all the tokenized words back into order will be difficult.

Scrape and chunkify the data

Automated web scraping (a good series here) is both less time-consuming and easier to maintain the header tags that will pass citation information to the database during the cleaning process.

In practical terms, there are often limitations for paywalled data that mean wget or other automated scraping methods for data gathering give way to manual copy and paste. In these cases, it’s fairly easy to maintain discrete citation information.

Any OCR cleaning that works at a high level–full-text search and replace–works best here.

As with any data-management process, acquisition and cleaning are the most time consuming part of the project. Ultimately, however, there are two considerations for web scraping:

maintaining boundaries between individual texts in the corpus. Keeping separate Excel files for each text is advisable for the simpler version of the project.
maintaining boundaries within each of those texts for citation purposes.

If you’ve identified book-chapter-page number division in a source, make sure the headers or footers are consistent enough within a single source that you can search for those headers/footers. Using these consistent data-structure divisions, it’s fairly easy to see chunks in the text.

Tokenize and import

Once an individual source is clean enough–that is, it’s a reasonably accurate text file with occasional clear markers for citation divisions–we import. The import process should do three things:

Tokenize the text on spaces and punctuation marks, keeping punctuation marks with the word that precedes them, to create a record for each word in the text.
Create a lower-case version of each word to go in that word’s database record.
For each new page or chapter, restart the Word_Order count at 1, incrementing by 1 for each subsequent token, which goes in the Word_Order field and provides the final sorting information for recreating the text.

If you have a good text editor and some basic skill in regular expressions, it’s fairly easy to combine chunking, tokenizing, and importing.

Export and analyze³

Export the Word_Word column or the Word_Clean column, depending on your needs, into a text file so that each token is on a single line, and voilà, you have tokenized text for use in MALLET, AntConc, Voyant, etc.

Once you have some analytical results from the text mining and can track the words or phrases that are useful, the original document becomes a searchable database that corroborates the text mining. It’s also possible to sort results alphabetically, find entries for the word or words you’re interested in and then copy and paste the list of citations for the appearance of that word.

Technical competencies

There are two versions.

The simplest version requires:

Excel
A text editor
A basic understanding of regular expressions in a text editor

Scholars with more technical skill can parlay the theory behind this into a more complex version that involves

a scripting language
a relational database
front-end Web searchability.

This version also provides better support for scholars hoping to apply text mining and natural-language processing analysis to undersupported languages.

Funding and other resources

None and none. The initial forays into data like this on a smaller scale can be done on a laptop with Excel. NB: It does help to have institutional affiliation in order to get access to these very expensive paywalled sources.

At the interim stage–the dangerous stage–a laptop equipped with a Bitnami stack (mine is PHP for a quick and dirty web search) provides access to MySQL, a scripting language and basic file-write capabilities.

When I reached the million-token threshold, I did need additional processing power, which is provided via supercomputing access at my current institution. However, that need stems from a combination of requirements. I’m dealing both with paywalled data and with data in badly supported languages, so I have natural-language-processing information in several additional tables that document relationships between words in the database for network analysis of grammar. These networks of words get very demanding on the processing side.

Footnotes

Or corpora in less well supported languages. The process described here, plus one additional step, makes it easier to tackle text mining in languages without the support that English and other western-Roman-character-set languages have in spades. Occasional footnotes provide a basic explanation, but the entire process is also documented explicitly for scholars working in less-well-supported languages (paywalled or not) at http://www.kalanicraig.com/workflow/workflow-for-unsupported-languages-addendum/. ↩
For scholars working with undersupported languages, an additional field, “root” provides the ability to reconstruct a lemmatized corpus created from the paywalled data for use in MALLET, AntConc or a linguistic network analysis, again with the original citations left intact. ↩
The lemmatization step for unsupported languages happens between tokenization and export. ↩