Vitruvian design for scholarship in the humanities: CTS

Showing posts with label CTS. Show all posts

Saturday, March 7, 2015

OHOC2 FTW

Underlying the CTS URN notation is the abstract model of textual structure abbreviated as OHCO2.

The generality of this model is nicely illustrated by recent implementations of the Canonical Text Services (CTS) protocol. The CTS protocol provides retrieval of texts by CTS URN: implementations linked from this page use XML tree structures, relational databases and directed graph stores to store and retrieve texts.

For an essential scholarly concept (identifying a citable passage of text), that's a powerful level of abstraction permitting scholars and developers to select technologies best suited to the specific kind of work they want to pursue with a citable corpus.

Sunday, March 30, 2014

Specs + tests for CTS

In February, Chris Blackwell and I released a release candidate version of the CTS protocol specification, 5.0. Today, we are releasing a second release candidate, in parallel with a suite of tests packaged with a servlet that can run the tests and format the resulting report in a web page.

We are currently working on a third release candidate taking account of all the helpful comments we have received so far on rc.1, and plan to continue coordinating releases of the CTS protocol specification with parallel test suites. We expect that rc.3 will be the last candidate version before a final CTS 5.0.

All our released work on the CITE architecture now belongs to a cite-architecture group on github. For a guide to our repositories, see the organization home page on github.

Wednesday, April 17, 2013

GUT

"Grand Unification Theory" may be a touch grandiose, but the underlying libraries used in the Homer Muiltitext project now generate RDF statements that fully express all three types of CITE-architecture information: textual archives, archives of data collections, and indices relating citable objects to other citable objects or to raw data. There will be lots of interesting connections to explore in the resulting unified graph of scholarly material.

In parallel with this, I've now implemented the CTS protocol, the CITE Collections Service protocol, and its extension with the CHS Image protocol in servlets drawing on a SPARQL endpoint, so creating a complete CITE environment can be reduced to:

- build all RDF (automatically), and import into a triple store
- drop the three servlets for CITE services into a servlet container
- install the iipsrv fastcgi for working with binary image data. This is the most troublesome step on many platforms, but happily iipsrv is now available as a package under debian.

Not bad. Chris Blackwell is preparing an image for the < $50 raspberry pi with these requirements preinstalled: a complete CITE Box roughly the size of an Altoids container.

As we review the schemas used in the services this month, we'll begin looking at defining a more permanent RDF vocabulary. I'm not sure at this point if we need to break out a generic CITE vocabulary distinct from a specific HMT vocabulary, or whether one ontology will suffice. We'll be looking at other projects' work: thanks to Joel Kavlesmaki for pointing to the useful list here.

Friday, April 12, 2013

Updating the CTS TextInventory schema

Scott Mcphee points out the absurdity of a Canonical Text Service (CTS) definition that uses CTS URNs for all retrieval requests, but doesn't include CTS URNs in the service's TextInventory. The historical explanation for the inconsistency is embarassingly simple: the TextInventory schema predates the invention of CTS URNs, and has not been revisited since! That oversight is rectified with today's release of version 0.12.1 of the CITE schemas package.

Ultimately, we want to arrive at catalog entries with urn attributes that look like this:

<textgroup urn="urn:cts:greekLit:tlg0012">
<groupname xml:lang="eng">Homeric poetry</groupname>
<work urn="urn:cts:greekLit:tlg0012.tlg001" xml:lang="grc">
<title xml:lang="eng">Iliad</title>
<edition urn=":cts:greekLit:tlg0012.tlg001">
<label xml:lang="eng">Allen (OCT 1931)</label>
</edition>
</work>
</textgroup>

With release 0.12.1, the urn attribute is now optional but strongly recommended, alongside the previous projid attribute. With release 0.13.0, the urn attribute will be required, and the projid attribute deprecated. With release 0.14.0, the projid attribute will be dropped.

So grab cite-0.12.1-schemas.zip from our nexus repository to get started with a modern TextInventory identifiying texts by URN. You can manually download a zip bundle from the repository, or update your maven coordinates with groupId "edu.harvard.chs", artifactId "cite" and version "0.12.1".

[Updated: bumped version from 0.12.0 to 0.12.1 after adding trailing slash to dc namespace as requested by Bridget Almas]

Sunday, March 17, 2013

CTS is complete under OHCO2

My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text: trees (formatted in XML), tables, and graphs (expressed in RDF). Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model." More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2: hence the title of the present post.

The CTS protocol uses CTS URNs to cite passages of texts. The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated. A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).

The remaining two OHCO2 properties are provided by a pair of CTS requests. The GetPrevNext request places a passage within an ordered sequence; the GetPassage request returning the contents of the passage supports a mixed content model.

After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request. We introduced the CTS GetPassagePlus request for just this purpose. His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.

Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

every node belongs to a citation hierarchy
every node belongs to a FRBR-like version hierarchy
nodes belonging to the same version are ordered
nodes support a mixed content model

Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:

XML
tabular structures
RDF triples

I’ll tag this series with the label "text data structures".

Friday, February 3, 2012

Unplanned reuse

There’s really only one thing you can do with a book: read it. You can learn from it, cite it or feel that your life has been changed by it, but you can’t directly reuse it (well, apart from making it an
accessory piece of furniture, but that doesn’t make use of the contents of the book). One of the distinctive differences of digital scholarship is that, if it is well designed, it can be used for purposes the original author may not have foreseen. The original author may even discover unintended reuse for digital work, as I did recently.

I had been working on an image service using a URN notation to retrieve and view images of the famous Archimedes Palimpsest. Using a URN like

urn:cite:hmt:chsimg.081v–088r_Arch03v_Sinar_pseudo_no-veil

the service lets you do things like

Retrieve a binary image at a given size. . This is bifolio 81v–88r at 50 pixels wide.
Retrieve a region of interest . This extracts from the same image a region with a mathematical figure, the construction of Archimedes, Floating Bodies 1.proposition.1
open a pannable/zoomable version of the image in a web browser, either with or without a highlighted region of interest. Try these two links to the same bifolio illustrated in the static images above:
1. with no highlighted region
2. including highlighting of the mathematical figure

For a course I taught in English translation, I put together a text service, allowing you to retrieve passages of text by canonical reference. With a URN like this

urn:cts:greekLit:tlg0552.tlg008.chs03:1.proposition.1

the service lets you retrieve archival XML source for a passage. This request gets the XML source for Archimedes, Floating Bodies, postulate 1 — not necessarily a thing of beauty to the casual reader of Archimedes. But it’s trivial to associate an XSLT stylesheet to format the archival XML for reading in a browser, so here is the same passage associated with stylesheet for easy reading.

At some point, the penny dropped, and I realized it would also be trivial to mash up the two services. When I started work on the image service, I had not imagined that the digital images of the Greek palimpsest would be of any interest to Greekless readers of Archimedes, but the mathematical figures in the manuscript are extremely important even if you’re reading Thomas Heath’s public-domain English translation.

A minor addition to the XSLT stylesheet uses the markup indicating the presence of canonically identified figures in Heath’s translation to embed references to the image service.

Try this view of book 1, proposition 1, where any reader (Greek scholar or not) now gets to follow the text in Heath’s translation together with images in the only surviving Greek manuscript of Floating Bodies. Images of regions are embedded in the text, and are linked to the zoomable view of the whole bifolio.

Monday, October 11, 2010

Thinking about citation

In digital scholarship, citation is the most fundamental form of ontology. Identifying an entity in a recognized form of citation allows scholars to agree that the object exists, while leaving room to disagree over how to represent it. Once we agree that there is an object I call "the Parthenon in Athens," and can unambiguously cite that object, we allow software to recognize that the same object might be represented in a GIS with spatial data, in a photo gallery with a collection of photographs, or in an architectural database with structured fields of textual data.

For digital scholars, it's hard to avoid talking or even thinking about scholarly reference systems without getting bogged down in technology-specific tar pits. Take the W3C's Resource Description Framework as an example. Its triplet model is brilliantly simple and powerful: whether you think of it as "subject-verb-object" or "object-directed link-object", it is general, extensible, and lends itself readily to both abstract graph models and real machine implementations. Its syntax is expressed in terms of URIs, which could equally well be URLs (real addresses in the "http" schema), or URNs (abstract names in the "urn" schema). (See the W3C's document "URI Clarification" or this discussion "Untangle URIs, URLs, and URNs" to sort out the acronyms.) But when was the last time you saw a scholarly project using RDF to describe relations among objects identified by abstract name, rather than by address? In an RDF graph, "http" is a good URI scheme for an application retrieving material on the internet, but a poor choice for a description of relations among persistent and immutable objects, a case where the "urn" scheme would be more appropriate.

For canonically citable texts, work at the Center for Hellenic Studies on has led to:

identifying abstract properties of canonical texts
developing a human readable and machine actionable notation for citation that expresses these properties (the CTS URN notation)
defining a service for identifying and retrieving texts identified by this notation (the Canonical Text Services protocol)

I've written a bit about this in a dry article on "Digital infrastructure and the Homer Multitext Project" but here I would just note that this three-tiered hierarchy has been very useful both in trying to think about citation outside of any particular technological context, and for defining machine-actionable tests to evaluate implementations of text services. Only level [3] deals with specific technologies on the internet. If you don't want to interact with a Canonical Text Service, the CTS URN notation [2] can still be used by any application referring to canonically cited texts. If you don't like the CTS URN notation, you can still evaluate any alternative notation by seeing whether it implements the abstract properties of [1]. And if you are dissatisfied with the identification of those properties, you can start from scratch and redefine what you think a canonically citable text really is.

I think this graded distinction of abstract property, reference notation, and application could be equally valuable in citing other kinds of material. In a following series of posts, I'll look successively at each level to analyze how we might cite uniquely identified objects.

Vitruvian design for scholarship in the humanities