Vitruvian design for scholarship in the humanities: CTS URN

Showing posts with label CTS URN. Show all posts

Saturday, March 7, 2015

OHOC2 FTW

Underlying the CTS URN notation is the abstract model of textual structure abbreviated as OHCO2.

The generality of this model is nicely illustrated by recent implementations of the Canonical Text Services (CTS) protocol. The CTS protocol provides retrieval of texts by CTS URN: implementations linked from this page use XML tree structures, relational databases and directed graph stores to store and retrieve texts.

For an essential scholarly concept (identifying a citable passage of text), that's a powerful level of abstraction permitting scholars and developers to select technologies best suited to the specific kind of work they want to pursue with a citable corpus.

Sunday, March 17, 2013

CTS is complete under OHCO2

My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text: trees (formatted in XML), tables, and graphs (expressed in RDF). Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model." More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2: hence the title of the present post.

The CTS protocol uses CTS URNs to cite passages of texts. The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated. A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).

The remaining two OHCO2 properties are provided by a pair of CTS requests. The GetPrevNext request places a passage within an ordered sequence; the GetPassage request returning the contents of the passage supports a mixed content model.

After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request. We introduced the CTS GetPassagePlus request for just this purpose. His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.

Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

every node belongs to a citation hierarchy
every node belongs to a FRBR-like version hierarchy
nodes belonging to the same version are ordered
nodes support a mixed content model

Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:

XML
tabular structures
RDF triples

I’ll tag this series with the label "text data structures".

Wednesday, February 8, 2012

Digital scholarship must be technology-agnostic

As smart phones and tablets assume an ever-larger role in browsing the web, “responsive design” has become a hot topic among web designers. How far is it possible to design a single web site that can adapt its display depending on the characteristics of the reading device? Are there times when it’s simply necessary to maintain separate resources for phones vs. large-screen computers?

Designers of digital scholarship face even more demanding requirements. We know that we will replace our digital technologies, but it’s part of our responsibilities to preserve and transmit the scholarly record we work with. Our predecessors have not always set an ideal example for us. The work of Hellenistic scholars of the Iliad like Aristarchus of Samothrace was originally composed for papyrus scrolls. By the time of our earliest complete manuscripts of the Iliad, the tenth and eleventh century, the standard form of “publication” was the codex, or manuscript book. In a large codex, the wide margins offered invitingly convenient space to annotate the Iliadic text with selected notes from earlier scholars, as we see in the famous Venetus A manuscript.

(See interactive version)

As a consequence, virtually all ancient scholarship on the Iliad ceased to be copied as separate texts, and is today known to us only from the snippets preserved in these marginal notes, or scholia. The convenience of this early “hypertext” technology led directly to the loss of important scholarly work.

This illustrates a fundamental and somewhat paradoxical principle that should guide all our work on digital scholarship: it must be technology-agnostic. Well designed digital work will be machine-actionable, but will also be capable of expressing its content when moved to other media, even non-digital media.

One area where we must apply this principle rigorously is in our citation practice. It is tempting to yield to the convenience of using a URL to refer to on-line work: after all, with a URL we can immediately see some kind of response in a web browser.

But this convenience is as dangerous as the medieval scribes’ use of the margins of manuscripts for scholia. URLs are addresses: they will change or vanish; more fundamentally, the web that they point to will ultimately vanish (and, on a time scale that looks back to Aristarchus of Samothrace and other scholars of the library at Alexandria, it will certainly vanish sooner rather than later).

I’ve worked over the past several years with colleagues at the Center for Hellenic Studies to develop a URN notation for citing texts. (Some formal documentation is beginning to appear here ) URNs offer a formally specified notation for referring to some kind of resource, without reference to any particular technology. One of my favorite examples is the ISBN, which can be expressed with URN syntax. Many computer applications work with ISBNs: sales clerks in book stores read them with bar-code scanners, and you can search Amazon or bookfinder.com by ISBN for example. But until a few years ago, I routinely filled out request forms at my college bookstore by hand-writing ISBNs on a paper form, and they functioned perfectly well in that analog environment.

The Canonical Text Service URN (or CTS URN), like an ISBN, is a formally specified machine-parseable reference, but at the same time a simple text string that can be read by human beings and used outside of a digital environment. I have successfully disseminated URNs using chalk on blackboards, and pen on the back of a napkin. But since a CTS URN is also machine actionable, it can be passed in to a Canonical Text Service to retrieve cited passages of text. When our form of citation is not tied to a specific technology, we are free to imagine previously unforeseen re-uses of that material. Would it be handy if the printed copy of a book you want to carry with you were augmented with URNs represented as QR codes you could point your smart phone at to read a cited text? I don’t know, but it would not be difficult to implement. The QR code at the top of this blog entry represents the CTS URN

urn:cts:greekLit:tlg0012.tlg001:1.1

Here is a link passing the same URN to a Canonical Text Service.

Vitruvian design for scholarship in the humanities