The basic process
The harder version assumes several things:
- You’ve read the basic process and taken it for a test drive.
- You’ve looked at the simpler version and grokked its logic.
- You have some basic competence in SQL and experience using scripting language to get data into and out of your SQL platform.
If steps 1-3 are taken care of, you’ll need a little extra info.
The “Word” table
If you’ve read the simpler version, you’ve seen the basics. Three extra fields give you a little more control:
- Word_Type: If you’re running named-entity recognition or working with any kind of semantic markup that differentiates between types of word entries, this gives you room to do that.
- Word_Lemma and Word_SpeechPart: If you’re working with an unsupported language and need roots or lemma for text mining, this attaches natural-language-processing information to each token in the corpus.
Additionally, you can use these fields to completely recreate the text by printing Word_Word and Word_Punctuation, divided by spaces and sorting by (in order) Word_SourceID, Word_Cite1, Word_Cite2, Word_Cite3, Word_Order.
The “Source” table
Tracking sources extends your ability to search and sort the data across sources, which is very difficult to do in the simpler Excel-based version.
Source refers to some discrete boundary in your text: a novel, a play, a saint’s life. Consider how you might cite a text in a print publication, and use that as the smallest division. To make this data description more clear, we’ll use novels as an example of individual sources and page numbers as our example of a citational record.
At minimum, Source contains a numeric ID for each source (Source_ID) and a title (Source_Title). Variations allow for more effective subcorpus analysis. Examples here include by period (Source_Period, Source_Century), geography (Source_Geography), genre or other taxonomy (Source_Type), or Author (Source_Author). Source_Notes allows for additional documentation (a full citation, a URL, etc.).
Differences in cleaning, tokenization, and import
The biggest difference between a scripted SQL version and an Excel version is that these three steps can be combined.
- Use a script to split the file on Book/Chapter/Section or other citation structure.
- Loop through the citation structure at its smallest level, splitting each section into tokens on word divisions (\s)
- If you’re working with Roman numerals, you can also convert Roman numerals to arabic numerals for the purpose of proper numeric sorting. I’ve done that here. Depending on the which case, it may be helpful to allow varchar input into the three Cite fields but do any SELECT queries for those fields as CAST(unsigned) so they sort in numeric order.
- Import each token into the database, starting a new Word_Order loop at 1 with each change in the most granular of the Cite fields (e.g. if you’re using Cite3, each time Cite3 changes, start Word_Order over at 1 again; if you’re only chunking in Cite1 and Cite2, start Word_Order over when Cite2 changes)
An output script with a few options will give you better control over the output to text mining software.
- Which texts do you want? All of them? Just those in a particular genre or time period?
- Which field do you want? Word_Word, the original, or Word_Clean, with carefully cleaned, lower-case data?
- Do you want tokenized line-by-line output or paragraph style output with punctuation?
- Do you want files chunked by author? Text? By one or more of the Word_Cite fields?
- Do you want lemma because you want to control some of the values that get input into a topic modeler or a corpus-linguistics tool?