Skosifying an Archaeological Thesaurus

Background

I am particularly happy about this blog post as it relates to the work I have been doing, since slightly longer than a year, within the project DARIAH-DE, that is the German branch of the EU-funded DARIAH project. In this project I am currently working on a set of recommendations for interdisciplinary interoperability.

Interoperability, generally speaking, can be used just as a buzz word or can mean something closer to fully flagged Artificial Intelligence–from which, I believe, we are still quite far. Being very much aware of this, and of the several existing definitions of interoperability, within our project we are trying to keep a pragmatic approach. This blog post describes a use case on which we have working recently describing how greater interoperability can be achieved by following some best practice recommendations concerning licenses, protocols and standards.

In a Nutshell

If you are wondering what is actually this post about–and probably whether you should read it or not–here is a short summary. I will describe the process of transforming a thesaurus encoded in Marc 21 into a SKOS thesaurus–that’s what is meant by skosification–in a way that does not involve (much) human interaction. The workflow relies upon an OAI-PMH interface, the Stellar Console and an AllegroGraph triple store where the SKOS/RDF thesaurus is stored.

Sounds interesting? Keep reading!

Legacy data

The data used in this use case come from Zenon, the OPAC of the German Archaeological Institute, and specifically from Zenon’s thesaurus. This thesaurus is stored as Marc 21 XML and is made available via an open OAI-PMH interface (here you can find the end-point).

Such a thesaurus is an essential tool for browsing the content of the library catalog: each entry is assigned one or more subject terms that are drawn from the thesaurus. The image below shows the thesaurus visualized as bibliography tree: Zenon users, and probably many archaeologists, find this and similar Information Retrieval tools of extreme importance for their daily work.

OK. How can I get the raw data?

This is one of the typical interoperability bottlenecks. A classical scenario looks as follows:

  • Jane has some data
  • Bob wants to use Jane’s data
  • Bob [phones | writes an email to] Jane asking
    • the permission to use her data
    • the actual data
  • Jane sends Bob [the data | a link to download them]
  • data change and Bob’s version is now out-of-date

In the case of Zenon’s thesaurus things look quite different as all data are accessible via an OAI-PMH interface which allows one to dowload–by means of few lines of code–the entire data collection, without further need for human-human interaction and in a way that the process can be repeated at any time.Without bothering much Jane every time by phone or email.

This latter aspect becomes even more important when data tend to change over time as it is the case of Zenon’s thesaurus. This is the main difference between negotiated interchange and interoperation, as Syd Bauman puts it here, and is also the reason why the OAI-PMH protocol is an essential piece of an interoperable architecture.

Downloading the thesaurus records as Marc 21 XML becomes as easy as running the following command from the console:

curl "http://opac.dainst.org/OAI?verb=ListRecords&metadataPrefix=marc21&set=DAI_THS"

Re-usable tools

However, this use case would have never been possible without the Stellar Console, a freely available and  open source piece of software developed by Ceri Binding and Doug Tudhope in the framework of the AHRC-funded project “Semantic Technologies Enhancing Links and Linked Data for Archaeological Resources” (STELLAR).

I came across this tool last year at the CAA 2012 conference in Southampton where Ceri gave a paper and performed a live demo of the software. The key idea underlying the Console is to accept the simplest–or at least a rather simple–input format such as CSV in order to produce a more structured and semantic output such as SKOS/RDF or CIDOC-CRM/RDF by applying a set of (customizable) templates.

The Hacking Bit

My main task consisted in writing a short script–approximately a hundred lines of Python–to a) harvest the OAI-PMH repository and fetch the ~80k records of the thesaurus and b) to produce a CSV output to be fed into the Stellar Console.  I have put all the code, including input/intermediate output/final output into the skosifaurus repository on github.

Apart from harvesting the OAI-PMH end point and spitting out some CSV, the script performs also a mapping between Marc21 fields and SKOS classes and relationship–in the code repository you can also find my notes in case you are interested in the gory details of this mapping.

In order to figure out the correct mapping between Marc and SKOS I went repeatedly to see the librarian at the DAI. Not only this turned out to be extremely helpful, but it was absolutely necessary for at least two reasons: first, my poor knowledge of library standards; second, because Marc21 lends itself to be used in slightly different ways. In this sense, Marc21 as a standard enables syntactic interoperability but only partly semantic interoperability: in other words, there is no guarantee that two thesauri both encoded in Marc21 will use precisely the same fields to encode the same kinds of information.

What didn’t Work?

Well, things mostly worked smoothly. However, there was a small problem related to the text-encoding format which puzzled me for some time. To understand the problem is important to point out that the Python script was run on a Mac  OS platform whereas the Stellar Console on a Windows one, as currently it works only on such platform. At this point one might say: “but what’s the problem if you use Unicode?”.

Funnily enough, the problem lied precisely in the way the CSV file was read by the Stellar Console. In the first version of the script the lines that write the CSV file to memory looked like this:

file = codecs.open("thesaurus_temp.csv","w","utf-8")
file.write("\n".join(output))

This works in most cases. But if the file that you are writing is to be processed on a Windows environment–for whatever reason you may want (or have) to do so–you should use the following code instead, just to be on the safe side:

file = codecs.open("thesaurus_temp.csv","w","utf-8-sig")
file.write("\n".join(output))

The reason, which is exquisitely technical, is that Microsoft uses a special sequence of bytes, a sort of Byte Order Mark (BOM), that is pre-prended to an UTF-8 encoded file, to let the software understand in which format is the file encoded. Without that character sequence the file won’t be open correctly by some software (e.g. MS Excel). You can read more about this in section 7.8 of the documentation for the Python codecs library.

Also the Stellar Console is affected by this issue and without this byte sequence any uft-8 encoded file won’t be open correctly in the Console, thus resulting in the content of the output file being crippled.

 

The SKOSified Thesaurus

To sum up the whole process:

  1. I ran a Python script (source) which harvests ~80,000 Marc21 XML records from DAI’s OPAC via its OAI-PMH interface (end-point here);
  2. this script is then producing an intermediate CSV output (file) according to a Marc2SKOS mapping that I’ve defined (further details here);
  3. the intemediate CSV file is fed into the Stellar Console which spits out an RDF serialization of the SKOS Thesaurus (RDF/XML version, RDF/turtle version).

To get a taste of the final result, here below is an image showiing what the SKOS thesaurus looks like when visualized within Gruff (Gruff is a client for the Allegro Graph triple store):

But if you are interested in further techy details on this topic please stay tuned as I will blogging about it in a follow-on post!

Leave a comment