Your Catalog in Linked Data
This is the documentation and supplementary resources for Your Catalog in Linked Data delivered at Code4Lib 2012. In part, this talk was inspired by the statement by the Library Linked Data Incubator Group that:
Fewer bibliographic datasets have been published as Linked Data than value vocabularies and element sets.
While those value vocabularies and element sets are useful independently, their great promise is in the potential to replace MARC with a general purpose, web enabled metadata framework (i.e. RDF). For this to happen, there's a need for a proliferation of bibliographic datasets. The proposition here is to set aside the need for complete solutions—or even specific use cases—to extract from MARC what can easily be extracted, and to get catalog datasets onto the web ready for linking. This much we can do with existing Free Software tools or with simple tricks. Some of those, I'm documenting here.
Goals
Since we are limiting ourselves to solutions ready at hand, it's helpful to enumerate our goals at the outset. This should help us understand what we can expect from our datasets and make life easier when "quick and dirty" gives way to clean-up time. Broadly conceived, our goal is to resolve the issue expressed by LLD Interest Group; to get bibliographic data onto the semantic web. More specificly, we want to:
- Establish URIs for records in the library catalog.
- Convert key MARC tags into RDF data.
- Expose the graph via SPARQL.
- Make URIs resolvable.
- Start linking to other datasets.
Establishing a Namespace
Linked Data expresses data points as resources using HTTP URIs. Good practice dictates that these URIs should resolve to a relevant representation of the resource. Many of the URIs in our datasets will exist in a local namespace; so we need a domain, subdomain, or path and a webserver which we can point it to. The URIs in our namespace should be at least as persistent as the dataset itself and, for our purposes, should probably be under the control of the authority responsible for the data. Simply put, your library should own the domain and plan to hold it. A clear domain policy is a good idea.
Sketching a Simple Data Model
Creating a data model for linked bibliographic data is likely to be a long, arduous task. Though our goal is to avoid that kind of thing (please let's not repeat RDA), we still need to be aware of what it is we're building. As inspiration, see the data model published by the British Library.
Notice how few literal values there are compared to the amount of data represented as URIs. URIs are linkable and rich, literals are not. If a concept is complex enough that richness is a valid concern, URIs are good. Which URIs should be external and which should be in the local namespace is an interesting question. Existing practice seems to be that URIs which the dataset uses primarily as subjects get local URIs.
MARC Conversion
There are a number of tools out there for converting MARC to RDF. The MARC->MODS->RDFizer is one, there's a python tool written by Ross Singer. A mixture of approaches is good--our goal is to squeeze as much value out of MARC as possible. Perhaps I'll add more documentation of other tools to this page later; for now&emdash;this is a 20 minute talk&emdash;I just want to look at the perl tool released by COMET last year.
$ wget http://data.lib.cam.ac.uk/code/conversion_tool.tgz
$ tar -zxvf conversion_tool.tgz; cd conversion_tool
The tool needs to be configured to assign URIs in our namespace (instead of the one used by Cambridge). To do this, edit the $uriBase and $datasetId variables in the marc2rdf_batch.pl file:
#defaults
my $uriBase='http://data.library.oregonstate.edu/';
my $datasetId = 'catalog';
To some extent, the conversion process with this tool is configurable. bibliographic_bl.txt contains a list of MARC tags and subfields, terms they should map to, and a 'data type' which tells the script which way to process handle the data. Change the mappings to suit your needs and push your MARC through the script:
$ perl marc2rdf_batch.pl full.marc
This process will take some time. Processing about 1.4 million records took about 6 hours on my machine. The result you'll get is a very large file (11GB in my case) of triples represented in turtle.
SPARQL Endpoint
There are a variety of easy options for creating a triplestore and SPARQL endpoint. I won't waste time analyzing them. Instead, I'll just say that I've opted for 4store as a plausible high volume, high performance (and free) solution. The major dependencies are Avahi (multicast DNS) and Rasqal (Redland query library). Install 4store with your package manager or install the dependencies and build from source:
$ git clone https://github.com/garlik/4store.git; cd 4store;
$ ./autoconf.sh
$ ./configure --with-storage-path=/path/to/dir/with/suitable/space
$ make
$ sudo make install
With 4store installed, we'll create a triplestore and start the SPARQL endpoint. Here, 'catalog' is the name of the triplestore and the endpoint listens on port 8081:
$ 4s-backend-setup catalog
$ 4s-backend catalog
$ 4s-httpd -p 8081 catalog
We're now ready to load catalog data. Allowing for the possibility that the library will later use the same infrastructure publish other datasets, we should import it as a graph with a URI in our selected namespace. To keep 4store's import process from running out of memory, I had to split the file into seperate sections of about 30 million triples (lines) each. Depending on your hardware, you may need to break it up into smaller chunks.
$ head -30000000 full.marc_triples.nt > catalog1.nt
$ sed '1,30000000d' full.marc_triples.nt
$ head -30000000 full.marc_triples.nt > catalog2.nt [and so on...]
$ killall 4s-httpd
$ 4s-import catalog --model http://data.library.oregonstate.edu/catalog /path/to/catalog1.nt --format turtle
$ 4s-import catalog --model http://data.library.oregonstate.edu/catalog /path/to/catalog2.nt --format turtle [and so on...]
$ 4s-httpd -p 8081 catalog
URI Resolution
In keeping with linked data principles, URIs need to resolve to something. We need a human readable page and the ability to return various formats according to http content-type requests. At it's most basic, this requires some mod_rewrite acrobatics and a small script to generate appropriate output. If you want to implement a custom solution this way, some examples can be found at RDF and mod_rewrite. The faster option is to use the RDF site framework released by the COMET project. It is a stripped down (debranded) version of the software which runs their data site.
$ wget http://data.lib.cam.ac.uk/code/comet_site.zip
$ unzip comet_site.zip -d comet
The COMET site includes a php/mysql based triplestore and endpoint using the ARC2 library. You can use that instead, simply following the instructions in the comet readme in place of the endpoint setup described above. We're not doing that today out of concern for the scalability of that store (the dataset I'm working with at Oregon State is already ~100 million triples). To replace the default store with the existing one, edit config.php so the last few lines read:
$store = ARC2::getRemoteStore('http://data.library.oregonstate.edu/sparql');
//if (!$store->isSetUp()) {
//$store->setUp(); /* create MySQL tables */
//}
Point your PHP enabled web server at the comet directory and your links should now be resolving.
Link Up!
Some high value targets for linking:
- LOC subject headings & name authorities
- VIAF
- Freebase, DBpedia
- Existing bibliographic datasets (BNB, COMET)
- [What else?]
Next Steps
- Link to more datasets.
- Figure out holdings, figure out catalog updates.
- Build applications.
- Expand and improve the data model.
- Describe library, holdings and policies.
- Embed in library web pages as RDFa and/or schema.org microdata