NLP – Computers for the Classics

1. Client-side centric

The first scenario looks rather heavy on the client-side. The plugin is packaged as an OJS plugin and what it does is essentially as follows:

after an article is loaded for view, a javascript (grab.js) gets all the <p> elements of the HTML article and send them over ajax to a php page (proxy.php);

a php class act as a proxy (or client) for a 3rd party NER web service;

the data that are received from via the ajax call are passed on to the web service via XML-RPC;

the response is returned by the web service as JSON or XML format…

… and then processed again by the JS script, ideally using a compiled template based on jquery’s template capability. Finally, the citations that were extracted are display as a summary box alongside the article.

2. Server-side centric

Instead, in the second scenario that I envisaged most of the processing happens on the server-side.

before being displayed, the article is processed to extract <p> elements;

the main plugin class (plugin.php) takes care of sending the input to and receiving a response from the NER service;

the response is then ran through a template (template.tpl) by exploiting OJS’s templating functionalities;

the formatted summary box is injected into the HTML which is now ready to be displayed to the user.

All in all, I think that I came up with (1) mainly because my PHP is rather rusty at the moment ;). Therefore, although I’m quite reluctant to admit so, I might decide to go for (2). However, a good point to opt for the former is the case where the user can decide for each paper whether to enable this feature or not.

This is the second part of a series (see pt. 1) about XML and NLP, and specifically about how it’s not really handy to go back and forth from the former to the latter. Let me sum up what I was trying to do with my XML (SGML actually) file: I wanted to 1) process the content of some elements using a Named Entity Recognition tool and then 2) be able to reincorporate the extracted named entities as XML markup into the starting file. It sounds trivial, doesn’t it?

But why is this actually not that straightforward and worth writing about it? Essentially because to do so we need to go from a hierarchical format (XML) to a kind of flat one (BIO or IOB). During this transition all the information about the XML tree is irremediably lost unless we do something to avoid it. And once it’s lost there is no way to inject it back into the XML.

I am aware of the existence of SAX, of course. However SAX is not that suitable for my purpose since it allows me to keep track of of the position in the file being parsed just in terms of line and column number (see this and that). [I have to admit that at this point I did not look into other existing XML parsers.] Instead I just wanted to access for each node or text element its start and end position in the original file. The solution I found it’s probably not the easiest one but at the same time it’s quite interesting (I mean, in terms of the additional knowledge I acquired while solving this technical problem). The solution was to use ANTLR (ANother Tool for Language Recognition).

An explanation of what is ANTLR and how does it work it’s out the scope of this post. But let’s put it simple: ANTLR is a language to generate parsers for domain specific languages (see also here). It is typically used to write parsers to process programming languages and it’s based on few core concepts: grammar, lexer, parser and Abstract Syntax Tree (AST). Therefore, it is possible to write an ad-hoc XML/SGML parser using this language. To be honest, the learning curve is pretty steep, but one of most rewarding things to ANTLR is that it’s possible to compile the same grammar into different languages (like Python and Java in my case) with just few (and not substantial) changes, with consequent great benefits in terms of code reusability.

The parser I came up with (source code here) is based on some other code that was developed by the ANTLR community. Essentially,I did some hacking on the original to allow for tokenising the text element on the fly while parsing the XML. During the parsing process, the text elements in the XML are tokenised by space characters and split into tokens of which the start and end positions are kept.

My ANTLR XML/SGML parser does on the fly another couple normalisations in order to produce an output that is ready to be consumed by a Named Entity Recogniser:

resolving SGML entities into Unicode;
transcoding BetaCode Greek into Unicode;
tokenising text by using the non-breaking space ( ) in addition to normal spaces: this task in particular, although it may seem trivial, implies recalculating the position of the new token in the input file and it required a bit more thinking through;

The result of running the parser over an SGML file is a list of tokens. I decided to serialised the output into JSON, for the time being, and a snippet of the result looks pretty much like this:

[{"start": 2768, "end": 2778, "utext": "\u0153uvre", "otext": "&oelig;uvre"},
{"start": 2780, "end": 2782, "utext": "par", "otext": "par"},
{"start": 2784, "end": 2790, "utext": "Achille", "otext": "Achille"}]

Start and end indicate (not surprisingly) the byte position of the token within the file, whereas otext and utext contain respectively the original text and the text after the resolution of character entities.

To sum up, the main benefit of this approach is that, once named entities have been automatically identified within the text of an XML/SGML file (e.g. “Achille” in the example above), we can trasform this newly acquired NE annotation into XML markup and pipe it back into the original file.

Computers for the Classics

PKP 2011 Hackfest

1. Client-side centric

2. Server-side centric

XML and NLP: like Oil and Water? [pt. 2]