<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Computers for the Classics</title>
	<atom:link href="http://c4tc.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://c4tc.wordpress.com</link>
	<description>... a PhD Research Blog</description>
	<lastBuildDate>Tue, 06 Dec 2011 00:08:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='c4tc.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Computers for the Classics</title>
		<link>http://c4tc.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://c4tc.wordpress.com/osd.xml" title="Computers for the Classics" />
	<atom:link rel='hub' href='http://c4tc.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Linked Open Data for the Ancient World at CAA 2012</title>
		<link>http://c4tc.wordpress.com/2011/12/06/linked-open-data-for-the-ancient-world-at-caa-2012/</link>
		<comments>http://c4tc.wordpress.com/2011/12/06/linked-open-data-for-the-ancient-world-at-caa-2012/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 00:08:03 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[caa2012]]></category>
		<category><![CDATA[cfp]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=253</guid>
		<description><![CDATA[This year the Computer Applications and Quantitative Methods in Archaeology (CAA) conference will be held in Southampton (26-30 March 2012). I will be chairing, together with Dr. Felix Schäfer (Deutsches Archäologisches Institut, Berlin) and Dr. Prof. Reinhard Förtsch (CoDArchLab University of Cologne), a session on Linked Open Data for the Ancient World.  This session aims to explore the opportunities, challenges and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=253&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This year the <em>Computer Applications and Quantitative Methods in Archaeology </em>(CAA) conference will be held in Southampton (26-30 March 2012). I will be chairing, together with Dr. Felix Schäfer (Deutsches Archäologisches Institut, Berlin) and Dr. Prof. Reinhard Förtsch (CoDArchLab University of Cologne), a session on <strong><em>Linked Open Data for the Ancient World. </em></strong></p>
<p>This session aims to explore the opportunities, challenges and methodological consequences related to the Linked Open Data approach for the study of the ancient world. We welcome multi-disciplinary submissions dealing with the following or related aspects of Linked Open Data: URIs for Cultural Heritage objects, methodological consequences of LOD, projects publishing data as LOD, relevant tools and live applications based on LOD, digital libraries and their content in relation to ancient world objects, other approaches of making data interoperable and interlinked.</p>
<p>The deadline for submission has been extended to Dec 7 (11:59pm GTM). <a href="http://www.southampton.ac.uk/caa2012/submissions/index.html">Here</a> you can find more details about the conference and read the call for paper, and <a href="https://www.ocs.soton.ac.uk/index.php/CAA/2012/schedConf/cfp">there</a> you can submit your abstract.</p>
<h2>Linked Open Data for the Ancient World (abstract)</h2>
<p><em>[session code: Data1]</em></p>
<p>The study of the Ancient World is by nature a rich soil for the adoption and exploitation of the Linked Opden Data (LOD) approach. Indeed its long tradition, the diversity of materials and resources as well as the high level of disciplinary specialisation lead to a situation where silos of knowledge, even when available online and under open access licenses, are isolated from each other. This situation is also reflected by the segmentation that the study of the Ancient World has reached with the inevitable tendency to favour one single perspective in despite of others. On the contrary, the LOD approach allows us to integrate heterogeneous sources of information by means of links and persistent identifiers while preserving the disciplinary specificity of data.</p>
<p>The recent adoption of the LOD principles by projects such as Pelagios [1], SQPR [2] and the British Museum [3], in acceptance of the CIDOC-CRM’s Linked Open Data Recommendation for Museums [4], are important steps towards a future of interoperable data in archaeology and classics. There is a variety of ways in which different resources are related to each other: an inscribed stone, for instance, will be linked to the edition of the text, to the building and location it belonged to, to different photographs of the object, to a record in the museum catalog and to related literature. Having those different pieces of information interconnected would allow us to overcome to some degree the mentioned fragmented view on antiquity by rendering a more wholistic image of the past.</p>
<p>In this session we shall discuss the advantages and disadvantages of LOD for the study of the Ancient World, look at available data, existing tools and live applications (beyond the status of being testbeds) and question which steps should be taken to overcome existing obstacles to increase the amount of LOD. Furthermore we welcome reflections on the opportunities, challenges and methodological consequences for the disciplines involved. In continuity with past sessions of the conference on related topics, this section addresses issues including but not limited to:</p>
<p>* URIs for Cultural Heritage objects</p>
<p>* methodological reflections on consequences of LOD</p>
<p>* experiences of projects publishing their data as LOD</p>
<p>* discussion of relevant tools and live applications based on LOD</p>
<p>* digital libraries and their content in relation to Ancient World objects</p>
<p>* other approaches of making data interoperable and interlinked</p>
<h3> References</h3>
<p>[1] <a href="http://pelagios-project.blogspot.com/">http://pelagios-project.blogspot.com/</a></p>
<p>[2] <a href="http://spqr.cerch.kcl.ac.uk/">http://spqr.cerch.kcl.ac.uk/</a></p>
<p>[3] <a href="http://collection.britishmuseum.org/About">http://collection.britishmuseum.org/About</a></p>
<p>[4] <a href="http://www.cidoc-crm.org/URIs_and_Linked_Open_Data.html">http://www.cidoc-crm.org/URIs_and_Linked_Open_Data.html</a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/253/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/253/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/253/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=253&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/12/06/linked-open-data-for-the-ancient-world-at-caa-2012/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>PKP 2011 Hackfest</title>
		<link>http://c4tc.wordpress.com/2011/09/27/pkp2011-hackfest/</link>
		<comments>http://c4tc.wordpress.com/2011/09/27/pkp2011-hackfest/#comments</comments>
		<pubDate>Tue, 27 Sep 2011 21:14:27 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[hackfest]]></category>
		<category><![CDATA[ojs]]></category>
		<category><![CDATA[pkp2011]]></category>
		<category><![CDATA[plugin]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=227</guid>
		<description><![CDATA[Today there was the kick-off of the hackfest at the PKP 2011 conference. Not many people turned up, but I had the chance to spend some quality (coding) time with PKP developers and to have a sort of personal code sprint  on a side project, that is developing a plugin to integrate a Named Entity [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=227&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Today there was the kick-off of the hackfest at the PKP 2011 conference. Not many people turned up, but I had the chance to spend some quality (coding) time with PKP developers and to have a sort of personal code sprint  on a side project, that is developing a plugin to integrate a Named Entity Recognition (NER) web service into an OJS installation (see <a href="http://c4tc.wordpress.com/2011/09/22/idea-for-ojs-plugin/">here</a> and <a href="http://leo.cilea.it/index.php/jlis/article/view/4603">there</a> for a more theoretical background).</p>
<p>At the end of the day what I got done was:</p>
<ul>
<li>setup a local instance of OJS (version 2.3.6) using <a href="http://www.mamp.info/">MAMB</a>;</li>
<li>give a quick try to the <a href="https://github.com/mcds/Voyeur-OJS-Plugin">OJS Voyeur plugin</a>, which unfortunately for me is working only with version &lt;=2.2.x;</li>
<li>create the bare-bone of the plugin, whose code is up <a href="https://github.com/mromanello/ojs-crex-plugin">here</a> (for my personal record rather than for other&#8217;s use, at least at this early stage);</li>
<li>write a PHP class to query a web service (that I&#8217;m developing) to extract citations of ancient works from (plain) texts;</li>
<li>come up with two possible scenarios for further implementation of the plugin, to happen possibly earlier than next year&#8217;s PKP hackfest <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </li>
</ul>
<div>The idea of this post, indeed, is to comment a little on these two possible scenarios.</div>
<h2>1. Client-side centric</h2>
<p>The first scenario looks rather heavy on the client-side. The plugin is packaged as an OJS plugin and what it does is essentially as follows:</p>
<ol>
<li>after an article is loaded for view, a javascript (grab.js) gets all the &lt;p&gt; elements of the HTML article and send them over ajax to a php page (proxy.php);</li>
<li>a php class act as a proxy (or client) for a 3rd party NER web service;</li>
<li>the data that are received from via the ajax call are passed on to the web service via XML-RPC;</li>
<li>the response is returned by the web service as JSON or XML format&#8230;</li>
<li>&#8230; and then processed again by the JS script, ideally using a compiled template based on<a href="http://api.jquery.com/jQuery.template/"> jquery&#8217;s template capability</a>. Finally, the citations that were extracted are display as a summary box alongside the article.</li>
</ol>
<p><a href="http://c4tc.files.wordpress.com/2011/09/dataflow2.jpg"><img class="aligncenter size-medium wp-image-230" title="dataflow2" src="http://c4tc.files.wordpress.com/2011/09/dataflow2.jpg?w=300&#038;h=194" alt="" width="300" height="194" /></a></p>
<h2>2. Server-side centric</h2>
<p>Instead, in the second scenario that I envisaged most of the processing happens on the server-side.</p>
<ol>
<li><strong>before</strong> being displayed, the article is processed to extract &lt;p&gt; elements;</li>
<li>the main plugin class (plugin.php) takes care of sending the input to and receiving a response from the NER service;</li>
<li>the response is then ran through a template (template.tpl) by exploiting OJS&#8217;s templating functionalities;</li>
<li>the formatted summary box is injected into the HTML which is <strong>now</strong> ready to be displayed to the user.</li>
</ol>
<p><a href="http://c4tc.files.wordpress.com/2011/09/dataflow1.jpg"><img class="aligncenter size-medium wp-image-229" title="dataflow1" src="http://c4tc.files.wordpress.com/2011/09/dataflow1.jpg?w=300&#038;h=175" alt="" width="300" height="175" /></a></p>
<p>All in all, I think that I came up with (1) mainly because my PHP is rather rusty at the moment <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . Therefore, although I&#8217;m quite reluctant to admit so, I might decide to go for (2). However, a good point to opt for the former is the case where the user can decide for each paper whether to enable this feature or not.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/227/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/227/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/227/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=227&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/09/27/pkp2011-hackfest/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>

		<media:content url="http://c4tc.files.wordpress.com/2011/09/dataflow2.jpg?w=300" medium="image">
			<media:title type="html">dataflow2</media:title>
		</media:content>

		<media:content url="http://c4tc.files.wordpress.com/2011/09/dataflow1.jpg?w=300" medium="image">
			<media:title type="html">dataflow1</media:title>
		</media:content>
	</item>
		<item>
		<title>Idea for an OJS plugin</title>
		<link>http://c4tc.wordpress.com/2011/09/22/idea-for-ojs-plugin/</link>
		<comments>http://c4tc.wordpress.com/2011/09/22/idea-for-ojs-plugin/#comments</comments>
		<pubDate>Thu, 22 Sep 2011 09:11:11 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[hackfest]]></category>
		<category><![CDATA[ojs]]></category>
		<category><![CDATA[pkp]]></category>
		<category><![CDATA[plugin]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=210</guid>
		<description><![CDATA[I have been meaning for quite a while to find some time to code a plugin for the Open Journal System (OJS) platform. Unfortunately it didn&#8217;t happen yet. However, the good news is that the chance somehow came to me, since this year&#8217;s PKP conference will be held in few weeks days in Berlin, that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=210&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have been meaning for quite a while to find some time to code a plugin for the Open Journal System (OJS) platform. Unfortunately it didn&#8217;t happen yet. However, the good news is that the chance somehow came to me, since this year&#8217;s PKP conference will be held in few<del> weeks</del> days <a href="http://pkp.sfu.ca/ocs/pkp/index.php/pkp2011/pkp2011">in Berlin</a>, that is where I recently moved and now live.  And at the same time as the PKP conference there will be a <a href="http://pkp.sfu.ca/ocs/pkp/index.php/pkp2011/index/pages/view/hackfest">PKP hackfest</a> where I hope to have the chance to push forward my idea for an OJS plugin and eventually get some coding done.</p>
<p>The idea it&#8217;s quite simple, but my knowledge of OJS&#8217; is not (yet) such to allow me to have a clear idea of how to implement it. The plugin should enable the detection and markup of certain bits and pieces (read &#8220;named entities&#8221;) of articles from an OJS installation. Although my application of the plugin is (originally) targeting a specific type of named entities, citations to ancient texts, to be found mainly in Classics journals, it&#8217;d possible to generalise the idea for a wider application. Indeed, the plugin could be thought of as applicable to any named entities contained in journal articles, provided that a web service for that is available.</p>
<p>As an example, let&#8217;s suppose to have an existing installation of OJS, where an article contains the following paragraph (which is actually taken from a real world article appeared in <a href="http://grbs.library.duke.edu/article/view/551/631"><em>Greek, Roman, and Byzantine Studies</em></a>):</p>
<blockquote><p>Thus, in the paragraphê speeches<strong> ([Dem.] 37.58–60, 38.21–22)</strong>, a binding settlement is sometimes described as a “boundary marker (horos); in an inheritance dispute (40.39), the binding decision is a telos or peras.</p></blockquote>
<p>The text in bold contains two references to Demosthenes&#8217; works, respectively (1) a reference to lines 58-60 of the speech <em>Against Pantaenetus </em>and (2) another to lines 21-22 of the one  <em>Against Nausimachus and Xenopeithes. </em>The plugin would parse each paragraph and then produce a result somewhat similar to <a href="http://www.stoa.org/projects/demos/article_libanius?page=33&amp;greekEncoding=">this</a>, where the cited texts are displayed alongside the text article. All in all the whole idea is not much different from <a href="http://pkp.sfu.ca/ojs/docs/userguide/2.3.3/sectionEditorReferences.html">OJS&#8217;s citation markup assistant</a>, although at the same time it can be generalised to cover other kind of named entities (people, organisations, etc.).</p>
<p>Some aspects that I believe are important for the implementation of such a plugin are:</p>
<ul>
<li><strong>client/serve</strong>r architecture: the plugin should act as a client with respect to the Named Entity Recognition web service; I have already a working prototype for a web service (based on XML-RPC) performing the extraction of citations as described above;</li>
<li>the <strong>markup</strong> of the extracted named entities should be customisable, ideally based on a template rewrite system, and should allow one to output RDFa or microformatted markup.</li>
<li>being able to<strong> review, correct and therefore store the output</strong> of the automatic extraction will be a plus (possibly including interaction with authority lists to which the named entities can be linked to).</li>
</ul>
<p>So, this is the idea in a nutshell. I&#8217;m looking forward to discuss it together with interested OJSers next week in Berlin and I hope there will be a follow-up post with some updates on the hackfest&#8217;s outcome.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/210/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/210/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/210/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=210&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/09/22/idea-for-ojs-plugin/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>XML and NLP: like Oil and Water? [pt. 2]</title>
		<link>http://c4tc.wordpress.com/2011/07/08/xml-and-nlp-pt2/</link>
		<comments>http://c4tc.wordpress.com/2011/07/08/xml-and-nlp-pt2/#comments</comments>
		<pubDate>Fri, 08 Jul 2011 10:00:59 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[PhD]]></category>
		<category><![CDATA[antlr]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[sgml]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=157</guid>
		<description><![CDATA[This is the second part of a series (see pt. 1) about XML and NLP, and specifically about how it&#8217;s not really handy to go back and forth from the former to the latter. Let me sum up what I was trying to do with my XML (SGML actually) file: I wanted to 1) process [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=157&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This is the second part of a series (<a href="http://c4tc.wordpress.com/2011/01/21/xml-and-nlp-pt1/">see pt. 1)</a> about XML and NLP, and specifically about how it&#8217;s not really handy to go back and forth from the former to the latter. Let me sum up what I was trying to do with my XML (SGML actually) file: I wanted to 1) process the content of some elements using a Named Entity Recognition tool and then 2) be able to reincorporate the extracted named entities as XML markup into the starting file. It sounds trivial, doesn&#8217;t it?</p>
<p>But why is this actually not that straightforward and worth writing about it? Essentially because to do so we need to go from a hierarchical  format (XML) to a kind of flat one (BIO or IOB). During this transition all the information about the XML tree is irremediably lost <strong>unless </strong>we do something to avoid it. And once it&#8217;s lost there is no way to inject it back into the XML.</p>
<p>I am aware of the existence of SAX, of course. However SAX is not that suitable for my purpose since it allows me to keep track of of the position in the file being parsed just in terms of line and column number (see <a href="http://stackoverflow.com/questions/3507350/java-xml-parsing-and-original-byte-offsets">this</a> and <a href="http://www.java-tips.org/java-se-tips/org.xml.sax/using-xml-locator-to-indicate-current-parser-pos.html">that</a>). [I have to admit that at this point I did not look into other existing XML parsers.] Instead I just wanted to access for each node or text element its <strong>start </strong>and<strong> end position</strong> in the original file. The solution I found it&#8217;s probably not the easiest one but at the same time it&#8217;s quite interesting (I mean, in terms of the additional knowledge I acquired while solving this technical problem). The solution was to use <a href="http://www.antlr.org/">ANTLR (ANother Tool for Language Recognition)</a>.</p>
<p>An explanation of what is ANTLR and how does it work it&#8217;s out the scope of this post. But let&#8217;s put it simple: ANTLR is a language to generate parsers for domain specific languages (see also <a href="http://en.wikipedia.org/wiki/ANTLR">here</a>). It is typically used to write parsers to process programming languages and it&#8217;s based on few core concepts: grammar, lexer, parser and <a href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax Tree (AST)</a>. Therefore, it is possible to write an ad-hoc XML/SGML parser using this language. To be honest, the learning curve is pretty steep, but one of most rewarding things to ANTLR is that it&#8217;s possible to compile the same grammar into different languages (like Python and Java in my case) with just few (and not substantial) changes, with consequent great benefits in terms of code reusability.</p>
<p>The parser I came up with (<a href="https://github.com/mromanello/Antlr_XML_parser">source code here</a>) is based on some <a href="http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML">other code</a> that was developed by the ANTLR community. Essentially,I did some hacking on the original to allow for tokenising the text element on the fly while parsing the XML. During the parsing process, the text elements in the XML are tokenised by space characters and split into tokens of which the start and end positions are kept.</p>
<p>My ANTLR XML/SGML parser does on the fly another couple normalisations in order to produce an output that is ready to be consumed by a Named Entity Recogniser:</p>
<ol>
<li>resolving SGML entities into Unicode;</li>
<li>transcoding BetaCode Greek into Unicode;</li>
<li>tokenising text by using the non-breaking space (&amp;nbsp;) in addition to normal spaces: this task in particular, although it may seem trivial, implies recalculating the position of the new token in the input file and it required a bit more thinking through;</li>
</ol>
<div>The result of running the parser over an SGML file is a list of tokens. I decided to serialised the output into JSON, for the time being, and a snippet of the result looks pretty much like this:</div>
<div><pre class="brush: plain;">
[{&quot;start&quot;: 2768, &quot;end&quot;: 2778, &quot;utext&quot;: &quot;\u0153uvre&quot;, &quot;otext&quot;: &quot;&amp;oelig;uvre&quot;},
{&quot;start&quot;: 2780, &quot;end&quot;: 2782, &quot;utext&quot;: &quot;par&quot;, &quot;otext&quot;: &quot;par&quot;},
{&quot;start&quot;: 2784, &quot;end&quot;: 2790, &quot;utext&quot;: &quot;Achille&quot;, &quot;otext&quot;: &quot;Achille&quot;}]
</pre></p>
<p><strong>Start</strong> and <strong>end</strong> indicate (not surprisingly) the byte position of the token within the file, whereas <strong>otext </strong>and <strong>utext </strong>contain respectively the original text and the text after the resolution of character entities.</p>
<p>To sum up, the main benefit of this approach is that, once named entities have been automatically identified within the text of an XML/SGML file (e.g. &#8220;Achille&#8221; in the example above), we can trasform this newly acquired NE annotation into XML markup and pipe it back into the original file.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/157/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/157/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/157/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=157&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/07/08/xml-and-nlp-pt2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>&#8220;The World of Thucydides&#8221; at CAA 2011</title>
		<link>http://c4tc.wordpress.com/2011/04/10/caa2011/</link>
		<comments>http://c4tc.wordpress.com/2011/04/10/caa2011/#comments</comments>
		<pubDate>Sun, 10 Apr 2011 15:38:42 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[conferences]]></category>
		<category><![CDATA[Hellespont]]></category>
		<category><![CDATA[PhD]]></category>
		<category><![CDATA[caa2011]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=173</guid>
		<description><![CDATA[I&#8217;m at Heathrow airport waiting to board on a flight to Beijing (via Amsterdam) where I&#8217;ll be attending the CAA 2011 conference. To get into the conference mood I though it may be a good idea to post the abstract of the paper that myself and my colleague Agnes Thomas (CoDArchLab, University of Cologne) are [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=173&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<div>
<p id="internal-source-marker_0.9913602876476943"><em>I&#8217;m at Heathrow airport waiting to board on a flight to Beijing (via Amsterdam) where I&#8217;ll be attending the <a href="http://caa2011.org/">CAA 2011 conference</a></em><em>. </em><em>To get into the conference mood I though it may be a good idea to post the abstract of the paper that myself and my colleague Agnes Thomas (CoDArchLab, University of Cologne) are going to give <em>within a session entitled </em><strong>Digging with words: e-text and e-archaeology. </strong>[This version is slightly longer than the one that we submitted and has been accepted.]</em></p>
<h2>The World of Thucydides: from Texts to Artifacts and back</h2>
</div>
<p>The work presented in this paper is related to the Hellespont project, an NEH-DFG founded project aimed at joining together the digital collections of Perseus and Arachne [1]. In this paper we present ongoing work aimed at devising a Virtual Research Environment (VRE) that allows scholars to access to both archaeological and textual information [2].</p>
<p>An environment integrating together these two heterogeneous kinds of information will be highly valuable for both archaeologists and philologists. Indeed, the former will have easier access to literary sources of the historical period an artifact belongs to, whereas the latter will have at hand iconographic or archaeological evidences related to a given text. Therefore, we explore the idea of a VRE combining archaeological and philological data with another kind of textual information, that is secondary sources and in particular journal articles. To develop new modes of opening up and combining those different kinds of sources, the project will focus on the so called Pentecontaetia of the Greek historian Thucydides (Th. 1,89-1,118).</p>
<p>As of now, we do not dispose (yet) of an automatic tool capable of capturing passages of Thucydides’ Pentecontaetia that are of importance to our knowledge of Athens and Greece during the Classical period. For the identification of such “links” we totally rely on the irreplaceable, manual and accurate work of scholars. For this reason some preliminary work has been done by A. Thomas to manually identify within the whole text of Thucydides’ Pentecontaetia entities representing categories in the archaeological and philological evidence (e.g. built spaces, topography, individual persons, populations). However, what instead can be done at some extent by means of an automatic tool is extracting and parsing both canonical and modern bibliographic references that express the citation network between ancient texts (i.e. primary sources) and modern publications about them (i.e. secondary sources).</p>
<p>As corpus of secondary sources the journal articles available in the JSTOR and made recently available to researchers via the Data for Research API [3] are being used. Apart from JSTOR classification of such articles into the separate categories of archaeology and philology, those articles are likely to contain references to common named entities that make them overlap at some extent. As an example of what we are aiming to, in Th. I 89 the author refers to the rebuilding of the Athenian city walls &#8211; after the Persian War in the beginning of the 5th century BC &#8211; as a result of the politics of the Athenian Themistocles. Within our VRE, the corresponding archaeological and philological metadata [4,5] will be presented to the user along with JSTOR articles from both archaeological and philological journals related to the contents of this text passage.</p>
<p>From a technical point of view, we are applying Named Entity Recognition techniques to JSTOR data accessed via the DfR API. References to primary sources, that are usually called “canonical references”, and bibliographic references to other modern publications are to be extracted and parsed from JSTOR articles and will be used to reconstruct the above mentioned citation networks [6,7]. Semantic wise, the CIDOC-CRM will provide us with a suitable conceptual model to express the semantics of complex annotations about texts, archaeological findings, physical entities and abstract concepts that scholars might want to create using such a VRE.</p>
<h3>References</h3>
<div>
<p id="internal-source-marker_0.9913602876476943">[1] The Hellespont Project, <a href="blank">&lt;http://www.dainst.org/index_04b6084e91a114c63430001c3253dc21_en.html</a>&gt;.</p>
<div>2] Judith Wusteman, “Virtual Research Environments: What Is the Librarian&#8217;s Role?,” Journal of Librarianship and Information Science 40, no. 2 (n.d.): 67-70.</div>
<div>[3] John Burns et al., “JSTOR &#8211; Data for Research,” in Research and Advanced Technology for Digital Libraries, ed. Maristella Agosti et al., vol. 5714, Lecture Notes in Computer Science (Springer Berlin / Heidelberg, 2009), 416-419 <a href="http://dx.doi.org/10.1007/978-3-642-04346-8_48">http://dx.doi.org/10.1007/978-3-642-04346-8_48</a>.</div>
<p>[4] Themistokleische Mauer, <a href="http://arachne.uni-koeln.de/item/topographie/8002430">http://arachne.uni-koeln.de/item/topographie/8002430</a></p>
<p>[5] <a href="http://www.perseus.tufts.edu/hopper/text?doc=Thuc.+1.89&amp;fromdoc=Perseus:text:1999.01.01999">http://www.perseus.tufts.edu/hopper/text?doc=Thuc.+1.89&amp;fromdoc=Perseus:text:1999.01.01999</a></p>
</div>
<div>[6] Matteo Romanello, Federico Boschetti, and Gregory Crane, “Citations in the Digital Library of Classics: Extracting Canonical References by Using Conditional Random Fields,” in Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (Suntec City, Singapore: Association for Computational Linguistics, 2009), 80–87,<a href="http://portal.acm.org/ft_gateway.cfm?id=1699763&amp;type=pdf"> http://portal.acm.org/ft_gateway.cfm?id=1699763&amp;type=pdf</a>.</div>
<p></p>
<div>[7] C Lee Giles Isaac Councill and Min-Yen Kan, “ParsCit: an Open-source CRF Reference String Parsing Package,” in Proceedings of the Sixth International Language Resources and Evaluation (LREC&#8217;08) (Marrakech, Morocco: European Language Resources Association (ELRA), 2008),<a href="http://www.comp.nus.edu.sg/~kanmy/papers/lrec08b.pdf"> http://www.comp.nus.edu.sg/~kanmy/papers/lrec08b.pdf</a>.</div>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/173/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/173/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/173/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=173&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/04/10/caa2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>Feet on the ground, DB on the cloud</title>
		<link>http://c4tc.wordpress.com/2011/03/02/jstor-dfr-api/</link>
		<comments>http://c4tc.wordpress.com/2011/03/02/jstor-dfr-api/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 17:27:04 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[Hellespont]]></category>
		<category><![CDATA[PhD]]></category>
		<category><![CDATA[dfr api]]></category>
		<category><![CDATA[jstor]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=153</guid>
		<description><![CDATA[This quick post is just to say how much the UK NGS did save my day today, and probably even a lot more. For my research project I&#8217;m digging into the JSTOR archive via the Data for Research API. And I realised soon to what extent scalability matters when trying to process all the data [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=153&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This quick post is just to say how much the <a href="http://www.ngs.ac.uk/">UK NGS</a> did save my day today, and probably even a lot more.</p>
<p>For my research project I&#8217;m digging into the JSTOR archive via the <a href="http://about.jstor.org/node/19881">Data for Research API</a>. And I realised soon to what extent scalability matters when trying to process all the data contained in JSTOR related to scholarly papers in Classics. There are ~60k of them.</p>
<p>The workflow I decided to go for basically consists in retrieving the data from JSTOR, making them persistent via Django (+ MySQL database backend) and then processing iteratively the data. The automatic annotation about those data (mainly Named Entity Recognition) that I&#8217;ll be producing is to be stored in the same Django DB.</p>
<p>After having ran the first batch to load my data into my Django application the situation was as follows: 7k documents processed and DB size of ~600MB. By the end of my data loading process the DB will grow up to approximately 6GB (just the data, without any annotation). And it&#8217;s at this stage that the cloud (or the grid) comes in handy.</p>
<p>I run my process locally but the remote DB is somewhere on the NGS grid (in my case it&#8217;s on the Manchester node). This is of great relieve to my and my machine of course in terms of disk space, speed in accessing the DB and system load. Whenever I need I can dump the DB and installing it locally in case I find myself in the need of accessing it and without an internet connection. Not to mention the fact that the batch processed to load the data could be ran from the grid. Finally, to give public access to the data I&#8217;m using  the same django application that pulls out the data from the remote MySQL db.</p>
<p>Having free access to the national grid as UK researcher is absolutely essential, also for someone &#8211; like me &#8211; who does not work in one of those fields that are known to be benefitting most from grid infrastructure. Even if digital I&#8217;m nevertheless still a humanist.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/153/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/153/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/153/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=153&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/03/02/jstor-dfr-api/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>XML and NLP: like Oil and Water? [pt. 1]</title>
		<link>http://c4tc.wordpress.com/2011/01/21/xml-and-nlp-pt1/</link>
		<comments>http://c4tc.wordpress.com/2011/01/21/xml-and-nlp-pt1/#comments</comments>
		<pubDate>Fri, 21 Jan 2011 11:47:14 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[coding]]></category>
		<category><![CDATA[Digital Classics]]></category>
		<category><![CDATA[NER]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=87</guid>
		<description><![CDATA[XML is great, really. Most of (*)ML formats are great, included the old fashioned SGML. However, have you ever tried to perform some NLP tasks on an XML-encoded text or, even worse, to do some automatic tagging on an existing (*)ML document? It&#8217;s all but easy and straightforward. And this seems to prove the &#8220;inadequacy [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=87&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>XML is great, really. Most of (*)ML formats are great, included the old fashioned SGML. However, have you ever tried to perform some NLP tasks on an XML-encoded text or, even worse, to do some automatic tagging on an existing (*)ML document? It&#8217;s all but easy and straightforward. And this seems to prove the &#8220;inadequacy of embedded markup for cultural heritage texts&#8221; as <a title="D. Schmidt, The inadequacy of embedded markup for cultural heritage texts" href="http://llc.oxfordjournals.org/content/25/3/337.short" target="_blank">D. Schmidt</a> has persuasively argued not long ago.</p>
<p>But it&#8217;s lot of fun though and finding a technical solution is doable. This post is to share problems, ideas and solutions about this technical aspect of doing NLP on (*)ML-encoded texts and will be in two parts.</p>
<h2><strong>Materials</strong></h2>
<p>A little while ago I was given an SGML file (~12MB) to process. My idea was to try out on it a <a title="A Canonical Refences Extractor written in python" href="https://github.com/mromanello/CRefEx" target="_blank">Named Entity Recogniser</a> that I have been working on, which extracts standard references to ancient Classical (Greek and Latin) texts. My recogniser is written in Python and accepts as input a file encoded in the IOB format (a format used for the <a href="http://www.cnts.ua.ac.be/conll2003/ner/">CoNLL-2003</a> shared task on language-independent named entity recognition). In the IOB format instances are separated by <strong>blank lines</strong>. Each instance is then tokenised and the resulting tokens are written one per line. Each line contains a number of space-separated column: in the example above the first contains the token itself whereas the second contains a label (category) assigned to the token. *-CRF indicates that a given token is part of a given Named Entity, in this case CRF is used to indicate the presence of a <strong>C</strong>anonical <strong>R</strong>e<strong>F</strong>erence.</p>
<p>This is what an example instance looks like:</p>
<p><pre class="brush: plain;">
this	O
is	O
a	O
canonical	O
reference:	O
Hom.	B-CRF
Il.	I-CRF
1,	I-CRF
Hom.	B-CRF
Il.	I-CRF
1,	I-CRF
477;	I-CRF
24,	I-CRF
788;	I-CRF
</pre></p>
<p>This format is used both to store the training sets and as output of the recogniser. In other words, the recogniser takes as input an IOB-encoded file where each token is initially assigned the label O (Other) and outputs the same file but with the new labels properly assigned.</p>
<p>Now, the main problem I was faced with is <strong>how to tag in the original SGML file those tokens that my recogniser had identified as being part of a named entity</strong>. In order to be able to do so, one needs to keep track of the token position within the XML file.</p>
<p>To sum up, these are the steps that I wanted to be able to perform:</p>
<ol>
<li>parse the XML and keep only the text content of some elements;</li>
<li>tokenise the text extracted from the XML (while keeping a reference to the token position within the file): the result will be a list of instances (the text content of given elements) where each instance is a list of tokens;</li>
<li>the list of instances is then processed by the Named Entity Recogniser which assigns each token one of the following labels [ O | B-CRF | I-CRF ];</li>
<li>the original XML is then re-processed: the subsequent tokens that were previously labelled as B-CRF or I-CRF are to be included within a new XML element;</li>
<li>the resulting new XML file (i.e. the original document plus the automatically tagged information) is written to the memory.</li>
</ol>
<p>[To be continued...]</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/87/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=87&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2011/01/21/xml-and-nlp-pt1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>A Simple Script to Import Unstructured Bibliographies into Zotero</title>
		<link>http://c4tc.wordpress.com/2010/08/30/structured-and-unstructured-parsing-bibliographies/</link>
		<comments>http://c4tc.wordpress.com/2010/08/30/structured-and-unstructured-parsing-bibliographies/#comments</comments>
		<pubDate>Mon, 30 Aug 2010 14:44:05 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[bibliographies]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[structured & unstructured]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=11</guid>
		<description><![CDATA[After having received another bibliography in an unstructured format (.doc), I finally made up my mind to write a simple bibliographic script that allows me to import it into Zotero saving me quite a lot of manual editing. Basically this script groups different calls to single software components (ParsCit, bibutils, Saxon) into a single pipeline. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=11&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>After having received another bibliography in an <em>unstructured format</em> (.doc), I finally made up my mind to write a simple bibliographic script that allows me to import it into Zotero saving me quite a lot of manual editing.</p>
<p>Basically this script groups different calls to single software components (<a href="http://aye.comp.nus.edu.sg/parsCit/">ParsCit</a>, <a href="http://www.scripps.edu/~cdputnam/software/bibutils/">bibutils</a>, <a href="http://saxon.sourceforge.net/#F9.2HE">Saxon</a>) into a single pipeline.</p>
<p>The source code is hosted at <a href="http://github.com/mromanello/BiblioScript">GitHub</a> and is likely to be quite buggy (particularly the XSLT transformation from ParsCit&#8217;s XML into MODS has not been thoroughly tested yet). So feel free to <a href="http://help.github.com/forking/">fork</a> the repository and improve the code where needed.</p>
<p>In more detail what the script does is:</p>
<ol>
<li>takes as input a plain text bibliography with one entry per line;</li>
<li>parses the input using a ParsCit engine;</li>
<li>outputs an intermediate mods encoding of the bibliography;</li>
<li>finally transforms the intermediate mods into a BibTeX file;</li>
<li>your bibliography is now ready to be imported in to Zotero!</li>
</ol>
<p>A big <strong>CAVEAT</strong> about the accuracy of the BibTeX output: since the parsing of the plain text input is done <strong>automatically </strong>by ParsCit, some bibliographic fields might result to be incorrect and thus <em>some </em>manual editing may be needed.</p>
<p>The result won&#8217;t be perfect, but at least I don&#8217;t have to input everything manually from scratch.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/11/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/11/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/11/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=11&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2010/08/30/structured-and-unstructured-parsing-bibliographies/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
		<item>
		<title>(Very Asynchronous) Highlights from the &#8220;III incontro di Filologia Digitale&#8221; (Verona 3-5 marzo 2010)</title>
		<link>http://c4tc.wordpress.com/2010/08/29/very-asynchronous-highlights-from-the-iii-incontro-di-filologia-digitale-verona-3-5-marzo-2010/</link>
		<comments>http://c4tc.wordpress.com/2010/08/29/very-asynchronous-highlights-from-the-iii-incontro-di-filologia-digitale-verona-3-5-marzo-2010/#comments</comments>
		<pubDate>Sun, 29 Aug 2010 22:13:10 +0000</pubDate>
		<dc:creator>Matteo Romanello</dc:creator>
				<category><![CDATA[conferences]]></category>

		<guid isPermaLink="false">http://c4tc.wordpress.com/?p=3</guid>
		<description><![CDATA[3-5 March 2010 in Verona was held the third edition of the &#8220;Incontro di Filologia Digitale&#8221;, a three day meeting with more than 15 presentations totally organized by Adele Cipolla, Paola Cotticelli, Roberto Rosselli del Turco. The asynchronous highlights from the conference here presented were selected according to my personal interests. For a complete overview please [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=3&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>3-5 March 2010 in Verona was held the third edition of the &#8220;Incontro di Filologia Digitale&#8221;, a three day meeting with more than 15 presentations totally organized by Adele Cipolla, Paola Cotticelli, Roberto Rosselli del Turco.</p>
<p>The asynchronous highlights from the conference here presented were selected according to my personal interests. For a complete overview please refer to the <a href="http://www.stoa.org/?p=1096">program</a> and the<a href="http://www.dllsc.univr.it/dol/main?ent=iniziativa&amp;id=2929"> full list of presentations</a>.</p>
<p>A bunch of presentations was related to epigraphy: <a href="http://www.dllsc.univr.it/documenti/Iniziativa/dall/dall018474.ppt">Anelli, Muscariello and Sarullo</a> talked about &#8220;The Digital Edition of Epigraphic Texts as Research Tool: the ILA Project&#8221;; <a href="http://www.dllsc.univr.it/documenti/Iniziativa/dall/dall843594.pdf">Farina</a> presented an &#8220;Electronic Analysis and Organization of the Syro-Turkic Inscriptions of China and Central Asia&#8221; and finally &#8230;</p>
<p>Barbera (hand out not available) and <a href="http://www.dllsc.univr.it/documenti/Iniziativa/dall/dall938699.ppt">Tomatis</a> presented the advancements of the <em>Corpus Taurinense </em>project, a corpus of texts written in XIII century Italian. After Barbera&#8217;s brilliant introduction to the corpus, Tomatis focussed on the problem of disambiguating POS tagging.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/c4tc.wordpress.com/3/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/c4tc.wordpress.com/3/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/c4tc.wordpress.com/3/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=c4tc.wordpress.com&amp;blog=12178372&amp;post=3&amp;subd=c4tc&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://c4tc.wordpress.com/2010/08/29/very-asynchronous-highlights-from-the-iii-incontro-di-filologia-digitale-verona-3-5-marzo-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/da8d899f4efb4a6dcd1c98a380b49e4b?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">mromanello</media:title>
		</media:content>
	</item>
	</channel>
</rss>
