Monday, October 11, 2010

Thinking about citation

In digital scholarship, citation is the most fundamental form of ontology. Identifying an entity in a recognized form of citation allows scholars to agree that the object exists, while leaving room to disagree over how to represent it. Once we agree that there is an object I call "the Parthenon in Athens," and can unambiguously cite that object, we allow software to recognize that the same object might be represented in a GIS with spatial data, in a photo gallery with a collection of photographs, or in an architectural database with structured fields of textual data.

For digital scholars, it's hard to avoid talking or even thinking about scholarly reference systems without getting bogged down in technology-specific tar pits. Take the W3C's Resource Description Framework as an example. Its triplet model is brilliantly simple and powerful: whether you think of it as "subject-verb-object" or "object-directed link-object", it is general, extensible, and lends itself readily to both abstract graph models and real machine implementations. Its syntax is expressed in terms of URIs, which could equally well be URLs (real addresses in the "http" schema), or URNs (abstract names in the "urn" schema). (See the W3C's document "URI Clarification" or this discussion "Untangle URIs, URLs, and URNs" to sort out the acronyms.) But when was the last time you saw a scholarly project using RDF to describe relations among objects identified by abstract name, rather than by address? In an RDF graph, "http" is a good URI scheme for an application retrieving material on the internet, but a poor choice for a description of relations among persistent and immutable objects, a case where the "urn" scheme would be more appropriate.

For canonically citable texts, work at the Center for Hellenic Studies on has led to:

  1. identifying abstract properties of canonical texts
  2. developing a human readable and machine actionable notation for citation that expresses these properties (the CTS URN notation)
  3. defining a service for identifying and retrieving texts identified by this notation (the Canonical Text Services protocol)
I've written a bit about this in a dry article on "Digital infrastructure and the Homer Multitext Project" but here I would just note that this three-tiered hierarchy has been very useful both in trying to think about citation outside of any particular technological context, and for defining machine-actionable tests to evaluate implementations of text services. Only level [3] deals with specific technologies on the internet. If you don't want to interact with a Canonical Text Service, the CTS URN notation [2] can still be used by any application referring to canonically cited texts. If you don't like the CTS URN notation, you can still evaluate any alternative notation by seeing whether it implements the abstract properties of [1]. And if you are dissatisfied with the identification of those properties, you can start from scratch and redefine what you think a canonically citable text really is.

I think this graded distinction of abstract property, reference notation, and application could be equally valuable in citing other kinds of material. In a following series of posts, I'll look successively at each level to analyze how we might cite uniquely identified objects.