This is the first of a series of posts related to my ongoing work on mining citations to primary sources (classical texts) from modern secondary sources in Classics (e.g. journal articles, commentaries, etc.). Originally, I just wanted to rant about the frustrations of processing OCRed texts but then I’ve decided to turn my rant into something more interesting (hopefully) for the readers…
Let’s first explain how I got here. A substantial part of my PhD project is devoted to answering the question “how can we build an expert system able to extract citations to ancient texts from modern journal articles?”. The fun bit, however, comes a bit later and precisely when I start reflecting on which changes may take place in the way classical texts are studied once scholars have at their disposal a tool like the one I’m trying to develop.
So far I have talked about classical texts and journal articles but let’s be more precise. Being the time of a PhD by definition limited (and in some countries more than in others) I had to identify a couple of text corpora for processing.
I have decided that such corpora also had to be:
- meaningful to classical scholars
- large enough to represent a challenge as for the processing and data analysis
- different from each other so to show that the approach I came up with is, to some extent, general or at least generalisable.
The Clean and the Dirty
The two corpora that I’ve been looking at are respectively the bibliographic reviews of the L’Année Philologique (APh) (the Clean) and the journal articles contained in the JSTOR archive (the Dirty, as in dirty OCR, that is). Among the things they have in common what matters most to me is that they both contain loads of citations to primary sources, that is ancient texts, such Homer’s Iliad or Thucydides’ Histories, just to name two obvious ones.
But the two corpora differ in may ways. There are differences related to the type of texts: APh contains analytical, as opposed to critical, summaries of publications related to Classics and the study of the ancient world, whereas JSTOR contains the full text (OCR) of journals, a subset of which relates to Classics. This implies also a difference in document length: compared to an JSTOR article, an APh summary is typically of about the same length of an article abstract. Moreover, the two corpora are very much different as for data quality: cleanly transcribed texts pulled out from a DB on the one hand and OCRed full text, dumped into text files without any structural markup whatsoever on the other hand.
What this means in practice is that, for example, running heads and page numbers become part of the text to be processed, although it would be desirable to be able to simply filter them out. But this is not critical. Definitely more critical is what happens to the footnotes: there is no linking between a footnote and the place in the text where this occurs. This is critical for the work I’m doing because, for example, is very common to make a claim, footnote it and then populate the footnote with citations of primary sources that back that claim up. This means that, assuming that my system will make it to extract and identify a citation occurring within a footnote, such citation will still be de-contextualised unless we find a way to related it to the footnoted text (but this goes beyond the scope of my work, at least for the time being).
But there’s an even more fundamental yet quite thorny problem to solve…
A Thorny Problem: Sentence Segmentation
My problem was that the first corpus I have processed was the APh, the one with small-sized, clean chunks of text. One problem that I did not have when working with that corpus was what is usually called sentence segmentation, sentence splitting or identification of sentence boundaries, that is identifying the sentences of which a given text is made of.
Do I really need to split my texts into sentences? Well, yes. The sentence is important because is the fine-grained unity of context where a given citation occurs, as opposed to the “global” context of the document containing that sentence or of the corpus containing the document. And the same text passage can be cited is so many different contexts–it can be cited in relation to a specific use of the language (grammar or syntax, in relation to its content, etc.)–that we really need to be able to capture it.
Why did my system fail to split JSTOR articles into sentences? One step before: what did I use to perform this step? The software I was using was the Punkt Sentence Tokenizer contained in the Python-based NLTK framework. Now, one of the typical causes of troubles for this tool is the presence of abbreviations. And in my case, as I said already, there are loads of them. Basically what happens is that whenever a string like “Thuc. 1. 100. 3″ occurs, the algorithm wrongly inserts a new sentence after “Thuc.”, “1.” and “100.”. As you can imagine, given that my main goal is precisely to capture those citations, this seriously undermines the accuracy of my system as it happens at the very beginning of the data processing pipeline.
What to do about this? What I did so far–just to see how big is the improvement in the accuracy of the citation extractor once the sentence boundaries are fixed, and it is remarkable!–was manually to correct the result of the algorithm. But given the scale of the corpus this is no feasible and/or scalable approach: there are approximately 70,000 papers, with hundreds of sentences each, that are waiting to be processed.
Another possible solution is to compare other available libraries (e.g. LingPipe) to see if any of them performs particularly well on texts that are full of citations. The last approach that I can think of is trying to train the Punkt Sentence Tokenizer with a bunch of manually corrected JSTOR documents and see how big is the improvement: of course what I will need in my case is to see a sensible improvement with a relatively small training set.
Oh no, wait, there is another option: resort to the magic power of regular expressions to correct the most predictable mistakes of the sentence tokenizer and see if this does the trick.