Friday, April 11, 2008

Citation schemes: empty content elements considered harmful

Classicists have, by and large, relied on standard, logical citation schemes to cite works of ancient literature. In the scheme of the Functional Requirements for Bibliographic Records (or FRBR), we could say that classicists have cited notional works using references that could then be applied to any manifestation or expression of that work.

In the print world, this practice has made it possible for scholars to apply a reference to different printed editions or translations of a work. As the internet becomes our library, this practice can turn references into machine-actionable entry points to the library (whether the reference is automatically discovered, or manually cited by a scholars). It is therefore a vital prerequisite that digital editions encode standard, logical citation data such as the book/chapter/section divisions of Thucydides, or the book/line divisions of the Iliad.

The TEI Guidelines (as so often) offer more than one way to approach the problem. It is valid TEI to encode citation values as attributes on containing elements that define the logical structure of a document. Book/chapter/section in Thucydides might be represented by a successive hierarchy of TEI div elements, for example, or book/line in the Iliad by div elements containing l elements; the citation values could be placed in the @n attribute of each container.

Alternatively, since the earliest work of the TEI in the 1980s, the Guidelines have included empty elements (such as the milestone) that could be used to mark transitional points in a document. It is easy to find examples of scholarly texts using such empty elements to mark the beginning of a new unit like a chapter or section.

Arguably, there was little difference between these two approaches in SGML. In XML, however, scholars should avoid using empty elements to encode citation data.

A host of supporting and related technologies have developed around XML in its first decade. One of the most important is XPath, a notation for referring to parts of an XML document by the document's structure. Higher-level technologies such as XSLT or implementations of the DOM model in many programming languages in turn support XPath expressions. The result is that programmers working in many environments can succinctly retrieve a unit like "book 2, chapter 5" of Thucydides with a simple XPath expression like

/TEI.2/text/body/div[@type='book' and @n='2']/div[@type='chapter' and @n='5']
Content between empty elements, on the other hand, cannot be addressed directly with XPath expressions.

Placing citation data on empty elements cuts programmers off from a galaxy of technologies they can use when citation data is kept on containing elements. Empty citation elements should never be necessary if the citation scheme is in fact a logical hierarchy: if it is not, consider whether there is a problem either with your choice of citation scheme or with your design of the rest of the document's structure.

Separation of concerns applies to document content, too

Twenty years ago, before the internet was open to the public, the print publishing industry was a leader in SGML document markup, and scholarly markup projects tended to think of "documents" as the content bound between a pair of covers. This heritage is clearly reflected in the TEI Guidelines' thorough inventory of elements to identify "front" and "back" material of documents, or a variety of groupings or collections of texts.

The major syntactic differences between XML and SGML — insistence on a single hierarchy of elements, each with explicitly marked end — were introduced in part to adapt markup to the needs of a very different environment: a network of computers exchanging information dynamically. The already well-understood distinction between semantic markup and presentational markup certainly contributed to the articulation of "separation of concerns" in the design of network applications. Individuals with different skills could apply appropriate technologies to the different parts of a network application, so in creating an application to run in a web browser, programmers might write the controlling code in javascript, and design specialists define its appearance with CSS. In a network of semantically structured content, XML plays the vital roles of defining the data structure (explicitly via a schema or DTD, or implicitly in the case of well-formed XML), and of providing a format for data exchange. The question of what this XML should look like — the kind of question the TEI has considered since the 1980s — had to be rethought. Humanists might rephrase Sun Microsystem's famous slogan, "The network is the computer," as "The network is the library."

When applications can exchange structured content, it is straightforward to create compound documents. Asymmetrically, it can be more difficult to disaggregate a complex document into component parts, since an application then needs a more detailed knowledge of the internal structure of a necessarily more complex document. An application could easily juxtapose a document in original language with a document in translation, or weave together a commentary with a text associated through a common citation system, for example, but disentangling interleaved translation or commentary from a complex document is more problematic.

I've been thinking about this in designing a set of TEI documents to represent the multiple texts of the famous Venetus A manuscript of the Iliad. There are four distinct sets of scholia, in addition to the manuscript's text of the Iliad. I chose to treat each set as an independent document, and as I am now reaching the stage of putting together applications drawing on those documents, I am glad that I did: cleanly separated, discrete documents are making that job much easier than it otherwise would be.

I expect that I will never use the elaborate TEI mechanisms to document the relation of a transcribed document to graphic images. In keeping with the guiding principle of separate, discrete documents, I'm associating images of the manuscript with ranges of text through external indices: here, too, the standoff markup of a separate, simple (non-TEI) document is easy to marshall together with the TEI document of the transcribed text.

In many ways, TEI P5, with its support for XML namespaces, is nudging scholars towards this kind of document organization. But we need to push harder: it's time to move away from monolithic TEI replicas of print or even manuscript sources. In editing scholarly texts for use on the internet, let each logical component stand alone.

Coordinating separate documents in a networked library requires a common understanding of how to cite them. I'll follow up with a note on how editors of TEI texts should think about that part of their markup.