Workflows for Text Mining in Under-supported Languages (An Addendum)

What does “under-supported” mean and how do we deal with it?

Imagine for a moment that this isn’t Napoleon III’s biography of Caesar translated into English.

Imagine instead that it’s a medieval Latin text (selfishly) or a text in pretty much any other language other than English. You’re likely venturing into under-supported territory when it comes to text mining. Not zero support, but little things like:

most training corpora for Latin are based on classical Latin, not medieval Latin
topic modelers and corpus-linguistics/concordance tools may not lemmatize words properly, so “ecclesia” and “ecclesiae” might count as two separate words in a topic model, when clearly they’re the same word.

The Word table has two fields, Word_Lemma and Word_SpeechPart, that can accommodate these issues and output lemmatized text for use in topic modeling and corpus linguistics. It doesn’t solve all of the problems for text mining in under-supported languages, but it does provide some interesting options for distant reading that are otherwise unavailable.

How these get used

Getting to the “use” stage with carefully processed text in an undersupported language requires some careful choices in the cleaning stage.

Will you lemmatize or stem? Lemmatization, or the assignment of a meaningful root word to each token, is more useful than stemming, which just strips the word down to some number of characters regardless of potential meaning, but it has its disadvantages. In Latin, for instance, there are ambiguities in both processes. The verb form venit might be about a sale (veneo) or the more prosaic venio (to come). However, since the computer will make the same mistake each time–for instance the ambiguity in locum or locus, or vir (virum virus)–there’s value in treating the texts this way because it gives us more direct access to recognizable root words rather than random strings that may not account for conjugation and declension irregularities.

Understanding how the Word_Lemma column can reconstruct the text is also important. The primary value is in using only root words, but in putting those root words back in the original order that their corollary declined/conjugated parents appear in the text. That lets us look at word co-occurrence, or how often two words appear in close proximity to each other, as way of exploring high-level semantic patterns across a large number of texts without having to worry about matching a specific Latin word form in text mining programs that are largely trained in modern languages and then mostly in English.

The basic process

This assumes that you’re working with the advanced SQL/scripting version of the workflow and that you have fields for lemma and part of speech in your Word table. If so, you’ll need:

Excel
A part-of-speech tagger (I used TreeTagger initially and then TnT later on for more granular tagging)

Export

Export just the Word_ID and Word_Word fields for a single text, with each record on a separate line. You now have a tokenized text that can be processed for word root and part of speech.

Tag POS

Most semantic taggers output one line for each line from the input file, so pasting your Word_ID and Word_Word fields into Excel means you have a handy reference to paste TreeTagger’s output into.

Re-import

The Word_ID column makes it possible to tie TreeTagger’s lemma to the original record for that word in the database, and all you have to do is =CONCATENATE() an SQL update for each line with TreeTagger’s lemma and POS as new information for each Word_ID.

Export Lemma

At this point, you now have a set of data that can be exported to any text analysis tool. Export Word_Lemma into a tokenized file and you can use it in MALLET, AntConc or any other text mining tool.

These analytical tools will, in turn, help you identify the lemma you want to flag, and you can then search for those lemma and identify them in your database, with full-text printouts of the original text.