Archive | PhD RSS feed for this section

XML and NLP: like Oil and Water? [pt. 2]

8 Jul

This is the second part of a series (see pt. 1) about XML and NLP, and specifically about how it’s not really handy to go back and forth from the former to the latter. Let me sum up what I was trying to do with my XML (SGML actually) file: I wanted to 1) process the content of some elements using a Named Entity Recognition tool and then 2) be able to reincorporate the extracted named entities as XML markup into the starting file. It sounds trivial, doesn’t it?

But why is this actually not that straightforward and worth writing about it? Essentially because to do so we need to go from a hierarchical  format (XML) to a kind of flat one (BIO or IOB). During this transition all the information about the XML tree is irremediably lost unless we do something to avoid it. And once it’s lost there is no way to inject it back into the XML.

I am aware of the existence of SAX, of course. However SAX is not that suitable for my purpose since it allows me to keep track of of the position in the file being parsed just in terms of line and column number (see this and that). [I have to admit that at this point I did not look into other existing XML parsers.] Instead I just wanted to access for each node or text element its start and end position in the original file. The solution I found it’s probably not the easiest one but at the same time it’s quite interesting (I mean, in terms of the additional knowledge I acquired while solving this technical problem). The solution was to use ANTLR (ANother Tool for Language Recognition).

An explanation of what is ANTLR and how does it work it’s out the scope of this post. But let’s put it simple: ANTLR is a language to generate parsers for domain specific languages (see also here). It is typically used to write parsers to process programming languages and it’s based on few core concepts: grammar, lexer, parser and Abstract Syntax Tree (AST). Therefore, it is possible to write an ad-hoc XML/SGML parser using this language. To be honest, the learning curve is pretty steep, but one of most rewarding things to ANTLR is that it’s possible to compile the same grammar into different languages (like Python and Java in my case) with just few (and not substantial) changes, with consequent great benefits in terms of code reusability.

The parser I came up with (source code here) is based on some other code that was developed by the ANTLR community. Essentially,I did some hacking on the original to allow for tokenising the text element on the fly while parsing the XML. During the parsing process, the text elements in the XML are tokenised by space characters and split into tokens of which the start and end positions are kept.

My ANTLR XML/SGML parser does on the fly another couple normalisations in order to produce an output that is ready to be consumed by a Named Entity Recogniser:

  1. resolving SGML entities into Unicode;
  2. transcoding BetaCode Greek into Unicode;
  3. tokenising text by using the non-breaking space ( ) in addition to normal spaces: this task in particular, although it may seem trivial, implies recalculating the position of the new token in the input file and it required a bit more thinking through;
The result of running the parser over an SGML file is a list of tokens. I decided to serialised the output into JSON, for the time being, and a snippet of the result looks pretty much like this:
[{"start": 2768, "end": 2778, "utext": "\u0153uvre", "otext": "œuvre"},
{"start": 2780, "end": 2782, "utext": "par", "otext": "par"},
{"start": 2784, "end": 2790, "utext": "Achille", "otext": "Achille"}]

Start and end indicate (not surprisingly) the byte position of the token within the file, whereas otext and utext contain respectively the original text and the text after the resolution of character entities.

To sum up, the main benefit of this approach is that, once named entities have been automatically identified within the text of an XML/SGML file (e.g. “Achille” in the example above), we can trasform this newly acquired NE annotation into XML markup and pipe it back into the original file.

“The World of Thucydides” at CAA 2011

10 Apr

I’m at Heathrow airport waiting to board on a flight to Beijing (via Amsterdam) where I’ll be attending the CAA 2011 conference. To get into the conference mood I though it may be a good idea to post the abstract of the paper that myself and my colleague Agnes Thomas (CoDArchLab, University of Cologne) are going to give within a session entitled Digging with words: e-text and e-archaeology. [This version is slightly longer than the one that we submitted and has been accepted.]

The World of Thucydides: from Texts to Artifacts and back

The work presented in this paper is related to the Hellespont project, an NEH-DFG founded project aimed at joining together the digital collections of Perseus and Arachne [1]. In this paper we present ongoing work aimed at devising a Virtual Research Environment (VRE) that allows scholars to access to both archaeological and textual information [2].

An environment integrating together these two heterogeneous kinds of information will be highly valuable for both archaeologists and philologists. Indeed, the former will have easier access to literary sources of the historical period an artifact belongs to, whereas the latter will have at hand iconographic or archaeological evidences related to a given text. Therefore, we explore the idea of a VRE combining archaeological and philological data with another kind of textual information, that is secondary sources and in particular journal articles. To develop new modes of opening up and combining those different kinds of sources, the project will focus on the so called Pentecontaetia of the Greek historian Thucydides (Th. 1,89-1,118).

As of now, we do not dispose (yet) of an automatic tool capable of capturing passages of Thucydides’ Pentecontaetia that are of importance to our knowledge of Athens and Greece during the Classical period. For the identification of such “links” we totally rely on the irreplaceable, manual and accurate work of scholars. For this reason some preliminary work has been done by A. Thomas to manually identify within the whole text of Thucydides’ Pentecontaetia entities representing categories in the archaeological and philological evidence (e.g. built spaces, topography, individual persons, populations). However, what instead can be done at some extent by means of an automatic tool is extracting and parsing both canonical and modern bibliographic references that express the citation network between ancient texts (i.e. primary sources) and modern publications about them (i.e. secondary sources).

As corpus of secondary sources the journal articles available in the JSTOR and made recently available to researchers via the Data for Research API [3] are being used. Apart from JSTOR classification of such articles into the separate categories of archaeology and philology, those articles are likely to contain references to common named entities that make them overlap at some extent. As an example of what we are aiming to, in Th. I 89 the author refers to the rebuilding of the Athenian city walls – after the Persian War in the beginning of the 5th century BC – as a result of the politics of the Athenian Themistocles. Within our VRE, the corresponding archaeological and philological metadata [4,5] will be presented to the user along with JSTOR articles from both archaeological and philological journals related to the contents of this text passage.

From a technical point of view, we are applying Named Entity Recognition techniques to JSTOR data accessed via the DfR API. References to primary sources, that are usually called “canonical references”, and bibliographic references to other modern publications are to be extracted and parsed from JSTOR articles and will be used to reconstruct the above mentioned citation networks [6,7]. Semantic wise, the CIDOC-CRM will provide us with a suitable conceptual model to express the semantics of complex annotations about texts, archaeological findings, physical entities and abstract concepts that scholars might want to create using such a VRE.

References

[1] The Hellespont Project, <http://www.dainst.org/index_04b6084e91a114c63430001c3253dc21_en.html>.

2] Judith Wusteman, “Virtual Research Environments: What Is the Librarian’s Role?,” Journal of Librarianship and Information Science 40, no. 2 (n.d.): 67-70.
[3] John Burns et al., “JSTOR – Data for Research,” in Research and Advanced Technology for Digital Libraries, ed. Maristella Agosti et al., vol. 5714, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2009), 416-419 http://dx.doi.org/10.1007/978-3-642-04346-8_48.

[4] Themistokleische Mauer, http://arachne.uni-koeln.de/item/topographie/8002430

[5] http://www.perseus.tufts.edu/hopper/text?doc=Thuc.+1.89&fromdoc=Perseus:text:1999.01.01999

[6] Matteo Romanello, Federico Boschetti, and Gregory Crane, “Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields,” in Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (Suntec City, Singapore: Association for Computational Linguistics, 2009), 80–87, http://portal.acm.org/ft_gateway.cfm?id=1699763&type=pdf.

[7] C Lee Giles Isaac Councill and Min-Yen Kan, “ParsCit: an Open-source CRF Reference String Parsing Package,” in Proceedings of the Sixth International Language Resources and Evaluation (LREC’08) (Marrakech, Morocco: European Language Resources Association (ELRA), 2008), http://www.comp.nus.edu.sg/~kanmy/papers/lrec08b.pdf.

Feet on the ground, DB on the cloud

2 Mar

This quick post is just to say how much the UK NGS did save my day today, and probably even a lot more.

For my research project I’m digging into the JSTOR archive via the Data for Research API. And I realised soon to what extent scalability matters when trying to process all the data contained in JSTOR related to scholarly papers in Classics. There are ~60k of them.

The workflow I decided to go for basically consists in retrieving the data from JSTOR, making them persistent via Django (+ MySQL database backend) and then processing iteratively the data. The automatic annotation about those data (mainly Named Entity Recognition) that I’ll be producing is to be stored in the same Django DB.

After having ran the first batch to load my data into my Django application the situation was as follows: 7k documents processed and DB size of ~600MB. By the end of my data loading process the DB will grow up to approximately 6GB (just the data, without any annotation). And it’s at this stage that the cloud (or the grid) comes in handy.

I run my process locally but the remote DB is somewhere on the NGS grid (in my case it’s on the Manchester node). This is of great relieve to my and my machine of course in terms of disk space, speed in accessing the DB and system load. Whenever I need I can dump the DB and installing it locally in case I find myself in the need of accessing it and without an internet connection. Not to mention the fact that the batch processed to load the data could be ran from the grid. Finally, to give public access to the data I’m using  the same django application that pulls out the data from the remote MySQL db.

Having free access to the national grid as UK researcher is absolutely essential, also for someone – like me – who does not work in one of those fields that are known to be benefitting most from grid infrastructure. Even if digital I’m nevertheless still a humanist.

Follow

Get every new post delivered to your Inbox.

Join 304 other followers