XML is great, really. Most of (*)ML formats are great, included the old fashioned SGML. However, have you ever tried to perform some NLP tasks on an XML-encoded text or, even worse, to do some automatic tagging on an existing (*)ML document? It’s all but easy and straightforward. And this seems to prove the “inadequacy of embedded markup for cultural heritage texts” as D. Schmidt has persuasively argued not long ago.
But it’s lot of fun though and finding a technical solution is doable. This post is to share problems, ideas and solutions about this technical aspect of doing NLP on (*)ML-encoded texts and will be in two parts.
A little while ago I was given an SGML file (~12MB) to process. My idea was to try out on it a Named Entity Recogniser that I have been working on, which extracts standard references to ancient Classical (Greek and Latin) texts. My recogniser is written in Python and accepts as input a file encoded in the IOB format (a format used for the CoNLL-2003 shared task on language-independent named entity recognition). In the IOB format instances are separated by blank lines. Each instance is then tokenised and the resulting tokens are written one per line. Each line contains a number of space-separated column: in the example above the first contains the token itself whereas the second contains a label (category) assigned to the token. *-CRF indicates that a given token is part of a given Named Entity, in this case CRF is used to indicate the presence of a Canonical ReFerence.
This is what an example instance looks like:
this O is O a O canonical O reference: O Hom. B-CRF Il. I-CRF 1, I-CRF Hom. B-CRF Il. I-CRF 1, I-CRF 477; I-CRF 24, I-CRF 788; I-CRF
This format is used both to store the training sets and as output of the recogniser. In other words, the recogniser takes as input an IOB-encoded file where each token is initially assigned the label O (Other) and outputs the same file but with the new labels properly assigned.
Now, the main problem I was faced with is how to tag in the original SGML file those tokens that my recogniser had identified as being part of a named entity. In order to be able to do so, one needs to keep track of the token position within the XML file.
To sum up, these are the steps that I wanted to be able to perform:
- parse the XML and keep only the text content of some elements;
- tokenise the text extracted from the XML (while keeping a reference to the token position within the file): the result will be a list of instances (the text content of given elements) where each instance is a list of tokens;
- the list of instances is then processed by the Named Entity Recogniser which assigns each token one of the following labels [ O | B-CRF | I-CRF ];
- the original XML is then re-processed: the subsequent tokens that were previously labelled as B-CRF or I-CRF are to be included within a new XML element;
- the resulting new XML file (i.e. the original document plus the automatically tagged information) is written to the memory.
[To be continued...]