Feet on the ground, DB on the cloud

2 Mar

This quick post is just to say how much the UK NGS did save my day today, and probably even a lot more.

For my research project I’m digging into the JSTOR archive via the Data for Research API. And I realised soon to what extent scalability matters when trying to process all the data contained in JSTOR related to scholarly papers in Classics. There are ~60k of them.

The workflow I decided to go for basically consists in retrieving the data from JSTOR, making them persistent via Django (+ MySQL database backend) and then processing iteratively the data. The automatic annotation about those data (mainly Named Entity Recognition) that I’ll be producing is to be stored in the same Django DB.

After having ran the first batch to load my data into my Django application the situation was as follows: 7k documents processed and DB size of ~600MB. By the end of my data loading process the DB will grow up to approximately 6GB (just the data, without any annotation). And it’s at this stage that the cloud (or the grid) comes in handy.

I run my process locally but the remote DB is somewhere on the NGS grid (in my case it’s on the Manchester node). This is of great relieve to my and my machine of course in terms of disk space, speed in accessing the DB and system load. Whenever I need I can dump the DB and installing it locally in case I find myself in the need of accessing it and without an internet connection. Not to mention the fact that the batch processed to load the data could be ran from the grid. Finally, to give public access to the data I’m using  the same django application that pulls out the data from the remote MySQL db.

Having free access to the national grid as UK researcher is absolutely essential, also for someone – like me – who does not work in one of those fields that are known to be benefitting most from grid infrastructure. Even if digital I’m nevertheless still a humanist.

About these ads

One Response to “Feet on the ground, DB on the cloud”

  1. raffazizzi March 3, 2011 at 11:36 am #

    Hi Matteo, didn’t know about NGS and it looks great. (Digital) humanists certainly need to be told more about these infrastructures mostly used by science-y fields. Thanks for sharing!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 369 other followers

%d bloggers like this: