Stateless

Thoughts on RDFa & Microdata

In the wake of TAG's call for a task force on RDFa and HTML microdata, I want to make some quick points about where microdata falls short for scholarly publishing use cases. To some extent, this is in reference to the ongoing work surrounding Scholarly HTML and, more broadly, the goal of enabling researchers to publish on and for the web.

<span rel="cito:obtainsBackgroundFrom" resource="http://www.w3.org/TR/microdata/#values"/>

This document "obtains background from" the microdata specification's section on values, but the relationship doesn't merit direct reference in the text. An inline citation would be forced, and the format is informal enough that a full bibliography in the footer would be out of place. Unlike the first example, there is no existing document data to mark as a citation. Instead, I'm adding descriptive data about the document itself.

An argument could be made for leaving this kind of non-publication data out, but including it has some advantages:

  • CiTO-savvy readers and software can extract a full biblograpy;
  • The author of the cited work can more accurately estimate its impact for tenure review, grants, etc…
  • Citation data is encapsulated in the document should the article (or a derived work) need to be reviewed more formally (eg. journal submission);
  • A richer RDF graph is created surrounding both resources.

By modeling citations on a deeper level than is practical in print, we open the door for new modes of scholarly communication. But, the microdata syntax introduces limitations. If the citation target doesn't appear in the document as a text value or a hyperlink, it can't be a microdata value. A few workarounds come to mind—we could markup a bibliography and suppress it with stylesheets; we could build a hyperlink with no link text—but these hacks come at the expense of the document outline. They could also become very messy very quickly if applied to more sophisticated citation1 or footnote schemes.

Even relatively simple data can present the same problem: <span property="dc:creator" resource="http://id.achelo.us/tjohnson" /> <span property="dc:abstract" content="Discussion of the functional differences between RDFa and HTML microdata as related to the publication of scholarly works in HTML." />

Neither the abstract nor the creator is properly for publication in this case. The context (being on my website; the presence of the, relatively short, full text) makes them an unnecessary distraction. Still, each could be crucial for a machine reader, or when the syndicatable <article> block finds itself removed to some other context.

It's worth noting that the problems I'm talking about aren't just grand linked-data issues, but apply even to simple internal applications. I generate RSS feeds and indexes from embedded metadata; I think that's a better way of doing it than the more standard approach of modeling all my documents in a relational database and generating both HTML & RSS from there. So I have a sliver of an application layer doing some pretty primitive operations which microdata can't support to my satisfaction. It only gets worse when we start to think about subject headings, or discipline specific markup. Embedding semantically rich research data, for example, would be entirely impractical.

The point here isn't to slag off on microdata. Like microformats (which share many of the same issues), microdata is a low-barrier, good-enough solution for a wide range of use cases. For scholarly publishing, however, it forces a choice between writing HTML structured for publication but with sporadic metadata, or providing rich markup in a sloppy document. That's not a choice I want to make.

I'm eager to see what, if anything, results from TAG's call for a clear path forward.