Wednesday, February 8, 2012

Digital scholarship must be technology-agnostic


As smart phones and tablets assume an ever-larger role in browsing the web, “responsive design” has become a hot topic among web designers. How far is it possible to design a single web site that can adapt its display depending on the characteristics of the reading device? Are there times when it’s simply necessary to maintain separate resources for phones vs. large-screen computers?

Designers of digital scholarship face even more demanding requirements. We know that we will replace our digital technologies, but it’s part of our responsibilities to preserve and transmit the scholarly record we work with. Our predecessors have not always set an ideal example for us. The work of Hellenistic scholars of the Iliad like Aristarchus of Samothrace was originally composed for papyrus scrolls. By the time of our earliest complete manuscripts of the Iliad, the tenth and eleventh century, the standard form of “publication” was the codex, or manuscript book. In a large codex, the wide margins offered invitingly convenient space to annotate the Iliadic text with selected notes from earlier scholars, as we see in the famous Venetus A manuscript.

As a consequence, virtually all ancient scholarship on the Iliad ceased to be copied as separate texts, and is today known to us only from the snippets preserved in these marginal notes, or scholia. The convenience of this early “hypertext” technology led directly to the loss of important scholarly work.

This illustrates a fundamental and somewhat paradoxical principle that should guide all our work on digital scholarship: it must be technology-agnostic. Well designed digital work will be machine-actionable, but will also be capable of expressing its content when moved to other media, even non-digital media.

One area where we must apply this principle rigorously is in our citation practice. It is tempting to yield to the convenience of using a URL to refer to on-line work: after all, with a URL we can immediately see some kind of response in a web browser.

But this convenience is as dangerous as the medieval scribes’ use of the margins of manuscripts for scholia. URLs are addresses: they will change or vanish; more fundamentally, the web that they point to will ultimately vanish (and, on a time scale that looks back to Aristarchus of Samothrace and other scholars of the library at Alexandria, it will certainly vanish sooner rather than later).

I’ve worked over the past several years with colleagues at the Center for Hellenic Studies to develop a URN notation for citing texts. (Some formal documentation is beginning to appear here ) URNs offer a formally specified notation for referring to some kind of resource, without reference to any particular technology. One of my favorite examples is the ISBN, which can be expressed with URN syntax. Many computer applications work with ISBNs: sales clerks in book stores read them with bar-code scanners, and you can search Amazon or bookfinder.com by ISBN for example. But until a few years ago, I routinely filled out request forms at my college bookstore by hand-writing ISBNs on a paper form, and they functioned perfectly well in that analog environment.

The Canonical Text Service URN (or CTS URN), like an ISBN, is a formally specified machine-parseable reference, but at the same time a simple text string that can be read by human beings and used outside of a digital environment. I have successfully disseminated URNs using chalk on blackboards, and pen on the back of a napkin. But since a CTS URN is also machine actionable, it can be passed in to a Canonical Text Service to retrieve cited passages of text. When our form of citation is not tied to a specific technology, we are free to imagine previously unforeseen re-uses of that material. Would it be handy if the printed copy of a book you want to carry with you were augmented with URNs represented as QR codes you could point your smart phone at to read a cited text? I don’t know, but it would not be difficult to implement. The QR code at the top of this blog entry represents the CTS URN

urn:cts:greekLit:tlg0012.tlg001:1.1

Here is a link passing the same URN to a Canonical Text Service.

5 comments:

Sebastian Heath said...

Hi Neel,

I think there are advantages to using identifiers that are both unique going forward and actionable now.

It is true that an address like http://nomisma.org/id/igch1544 has a specific meaning in today's technological environment. As you say, you can paste it into a browser and get to a web page.

I don't think that means that access to the definition of the concept "IGCH 1544" will go away if the DNS system changes. "http://nomisma.org/id/igch1544" can remain unique, even if access to the definition needs to be mediated through a new location (a new URL).

Taking your longer perspective, I think sequences of characters beginning "http://..." can be put out on the internet in such a way that they implement very long-term identification. That some 1000s of years from now, a future scholar may be able find sufficient traces of certain "http://" identifiers to be able to both figure what they identified and to find that content. And I do think that ability to recover is made more likely by the fact that the content behind "http://" identifiers can be copied without intervention of the "owner" of the identifier.

To summarize: only time will tell but I think we can point to "http://" identifiers as one route to long-term viability and technological independence.

Neel Smith said...

Agreed that "http:" identifiers derived from URL addresses could be used independently of their original intended function, and are here for a long time, but are they expressive enough? Taken out of their planned use an address, they aren't parseable with any specified semantics.

Aren't we at risk of looking for nails we can bang our hammer on?

Esther said...

[In reply to Sebastian] The good news about using URN citations is that they are primarily implemented as URLs through the http:// structure. But what makes having a URN better than having a simple URL is that the URN is the consistent part of the URL - that is, whenever the text service is hosted somewhere else (which happens all the time in say, the Homer Multitext Project) - it is still actionable and useful. One neat outcome recently discovered is that one can type a URN into Google and Google has been able to pull the appropriate text through whatever text service it finds.

As for the longevity of links/URLs - I read an article (http://worldcat.org/arcviewer/5/LEGAL/2011/06/15/H1308163631444/viewer/file2.php) that talks about the issues of preservation when relying on links/URLs to point to a particular source. I think that this is a very scary concept when the goal is to preserve content and preserve access to content.

So - my comment is, URLs provide transient access to information. But, one can have their cake and eat it too since the URLs to access these texts are built on this stable, long-term solution of a URN referring to a text.

Paolo said...

I find the CTS URN system very interesting, but there is one thing that puzzles me in the ontology of the text that lies behind it. The system follows the FRBR approach, where the "expressions", "manifestations" and "items" of a text emanate from the "notional text". This is the librarian's (and the scribe's) perspective. The philologist's perspective is quite opposite. For us the "text" originates from its sources, its witnesses (MSS, papyri, etc.), that is from the "items". The witnesses ("items") are the only reality of the text. In other words, the philologist's ontology of the text is (should be) document-based. This is, by the way, what makes the Homer Multitext project so interesting to me.

From this perspective, a controversial point lays in the passage from, say, urn:cts:greekLit:tlg0012.tlg001:1.1-1.20 to urn:cts:greekLit:tlg0012.tlg001.manuscriptX:1.1-1.20 and to urn:cts:greekLit:tlg0012.tlg001.manuscriptY:1.1-1.20. This assumes that the first 20 verses of the Iliad in our "notational text" called "Iliad" (that is in fact the positive result of a scholarly consensus and is not abstract in itself) are unproblematically mappable upon the first 20 verses in manuscript X and in manuscript Y. How does the CTS URN system manage a MS Y having a "spurious" 11th verse after the 10th "canonical" verse? This might work as a counterexample showing the difference between a notional text-based ontology and a document-based ontology of the text.

Assma said...

I can't http://besteditingservices.org/ you the number of hours I have spent dealing with the sigma insanity in Unicode. It is *utterly* mad. Greek had the misfortune of being an early entrant into the Unicode space, and presumably they learned from their mistakes when it came to dealing with Arabic.