dfr api – Computers for the Classics

This quick post is just to say how much the UK NGS did save my day today, and probably even a lot more.

For my research project I’m digging into the JSTOR archive via the Data for Research API. And I realised soon to what extent scalability matters when trying to process all the data contained in JSTOR related to scholarly papers in Classics. There are ~60k of them.

The workflow I decided to go for basically consists in retrieving the data from JSTOR, making them persistent via Django (+ MySQL database backend) and then processing iteratively the data. The automatic annotation about those data (mainly Named Entity Recognition) that I’ll be producing is to be stored in the same Django DB.

After having ran the first batch to load my data into my Django application the situation was as follows: 7k documents processed and DB size of ~600MB. By the end of my data loading process the DB will grow up to approximately 6GB (just the data, without any annotation). And it’s at this stage that the cloud (or the grid) comes in handy.

I run my process locally but the remote DB is somewhere on the NGS grid (in my case it’s on the Manchester node). This is of great relieve to my and my machine of course in terms of disk space, speed in accessing the DB and system load. Whenever I need I can dump the DB and installing it locally in case I find myself in the need of accessing it and without an internet connection. Not to mention the fact that the batch processed to load the data could be ran from the grid. Finally, to give public access to the data I’m using the same django application that pulls out the data from the remote MySQL db.

Having free access to the national grid as UK researcher is absolutely essential, also for someone – like me – who does not work in one of those fields that are known to be benefitting most from grid infrastructure. Even if digital I’m nevertheless still a humanist.