XML and NLP: like Oil and Water? [pt. 1]

XML is great, really. Most of (*)ML formats are great, included the old fashioned SGML. However, have you ever tried to perform some NLP tasks on an XML-encoded text or, even worse, to do some automatic tagging on an existing (*)ML document? It’s all but easy and straightforward. And this seems to prove the “inadequacy of embedded markup for cultural heritage texts” as D. Schmidt has persuasively argued not long ago.

But it’s lot of fun though and finding a technical solution is doable. This post is to share problems, ideas and solutions about this technical aspect of doing NLP on (*)ML-encoded texts and will be in two parts.

Materials

A little while ago I was given an SGML file (~12MB) to process. My idea was to try out on it a Named Entity Recogniser that I have been working on, which extracts standard references to ancient Classical (Greek and Latin) texts. My recogniser is written in Python and accepts as input a file encoded in the IOB format (a format used for the CoNLL-2003 shared task on language-independent named entity recognition). In the IOB format instances are separated by blank lines. Each instance is then tokenised and the resulting tokens are written one per line. Each line contains a number of space-separated column: in the example above the first contains the token itself whereas the second contains a label (category) assigned to the token. *-CRF indicates that a given token is part of a given Named Entity, in this case CRF is used to indicate the presence of a Canonical ReFerence.

This is what an example instance looks like:

this	O
is	O
a	O
canonical	O
reference:	O
Hom.	B-CRF
Il.	I-CRF
1,	I-CRF
Hom.	B-CRF
Il.	I-CRF
1,	I-CRF
477;	I-CRF
24,	I-CRF
788;	I-CRF

This format is used both to store the training sets and as output of the recogniser. In other words, the recogniser takes as input an IOB-encoded file where each token is initially assigned the label O (Other) and outputs the same file but with the new labels properly assigned.

Now, the main problem I was faced with is how to tag in the original SGML file those tokens that my recogniser had identified as being part of a named entity. In order to be able to do so, one needs to keep track of the token position within the XML file.

To sum up, these are the steps that I wanted to be able to perform:

parse the XML and keep only the text content of some elements;
tokenise the text extracted from the XML (while keeping a reference to the token position within the file): the result will be a list of instances (the text content of given elements) where each instance is a list of tokens;
the list of instances is then processed by the Named Entity Recogniser which assigns each token one of the following labels [ O | B-CRF | I-CRF ];
the original XML is then re-processed: the subsequent tokens that were previously labelled as B-CRF or I-CRF are to be included within a new XML element;
the resulting new XML file (i.e. the original document plus the automatically tagged information) is written to the memory.

[To be continued…]

A Simple Script to Import Unstructured Bibliographies into Zotero

After having received another bibliography in an unstructured format (.doc), I finally made up my mind to write a simple bibliographic script that allows me to import it into Zotero saving me quite a lot of manual editing.

Basically this script groups different calls to single software components (ParsCit, bibutils, Saxon) into a single pipeline.

The source code is hosted at GitHub and is likely to be quite buggy (particularly the XSLT transformation from ParsCit’s XML into MODS has not been thoroughly tested yet). So feel free to fork the repository and improve the code where needed.

In more detail what the script does is:

takes as input a plain text bibliography with one entry per line;
parses the input using a ParsCit engine;
outputs an intermediate mods encoding of the bibliography;
finally transforms the intermediate mods into a BibTeX file;
your bibliography is now ready to be imported in to Zotero!

A big CAVEAT about the accuracy of the BibTeX output: since the parsing of the plain text input is done automatically by ParsCit, some bibliographic fields might result to be incorrect and thus some manual editing may be needed.

The result won’t be perfect, but at least I don’t have to input everything manually from scratch.

(Very Asynchronous) Highlights from the “III incontro di Filologia Digitale” (Verona 3-5 marzo 2010)

3-5 March 2010 in Verona was held the third edition of the “Incontro di Filologia Digitale”, a three day meeting with more than 15 presentations totally organized by Adele Cipolla, Paola Cotticelli, Roberto Rosselli del Turco.

The asynchronous highlights from the conference here presented were selected according to my personal interests. For a complete overview please refer to the program and the full list of presentations.

A bunch of presentations was related to epigraphy: Anelli, Muscariello and Sarullo talked about “The Digital Edition of Epigraphic Texts as Research Tool: the ILA Project”; Farina presented an “Electronic Analysis and Organization of the Syro-Turkic Inscriptions of China and Central Asia” and finally …

Barbera (hand out not available) and Tomatis presented the advancements of the Corpus Taurinense project, a corpus of texts written in XIII century Italian. After Barbera’s brilliant introduction to the corpus, Tomatis focussed on the problem of disambiguating POS tagging.