Wednesday, April 17, 2013

GUT

"Grand Unification Theory" may be a touch grandiose, but the underlying libraries used in the Homer Muiltitext project now generate RDF statements that fully express all three types of CITE-architecture information:  textual archives, archives of data collections, and indices relating citable objects to other citable objects or to raw data.  There will be lots of interesting connections to explore in the resulting unified graph of scholarly material.

In parallel with this, I've now implemented the CTS protocol, the CITE Collections Service protocol, and its extension with the CHS Image protocol in servlets drawing on a SPARQL endpoint, so creating a complete CITE environment can be reduced to:

- build all RDF (automatically), and import into a triple store
- drop the three servlets for CITE services into a servlet container
- install the iipsrv fastcgi for working with binary image data.  This is the most troublesome step on many platforms, but happily iipsrv is now available as a package under debian.

Not bad.  Chris Blackwell is preparing an image for the < $50 raspberry pi with these requirements preinstalled:  a complete CITE Box roughly the size of an Altoids container.

As we review the schemas used in the services this month, we'll begin looking at defining a more permanent RDF vocabulary.  I'm not sure at this point if we need to break out a generic CITE vocabulary distinct from a specific HMT vocabulary, or whether one ontology will suffice.  We'll be looking at other projects' work:  thanks to Joel Kavlesmaki for pointing to the useful list here.





Sunday, April 14, 2013

CITE Collection Inventory

In parallel with Friday's update to the schema for CTS text inventories, CITE Collection inventories now include an optional urn attribute on the schema for Collections.  Bump your build system's dependency for the cite library up to 0.12.2 to include this change.

As with the CTS TextInventory, we plan to make the Collection inventory's urn attribute mandatory in 0.13, and will drop the parallel name attribute in 0.14.



Friday, April 12, 2013

Updating the CTS TextInventory schema

Scott Mcphee points out the absurdity of a Canonical Text Service (CTS) definition that uses CTS URNs for all retrieval requests, but doesn't include CTS URNs in the service's TextInventory.  The historical explanation for the inconsistency is embarassingly simple:  the TextInventory schema predates the invention of CTS URNs, and has not been revisited since!  That oversight is rectified with today's release of version 0.12.1 of the CITE schemas package.

Ultimately, we want to arrive at catalog entries with urn attributes that look like this:


<textgroup urn="urn:cts:greekLit:tlg0012">
 <groupname xml:lang="eng">Homeric poetry</groupname>
 <work urn="urn:cts:greekLit:tlg0012.tlg001" xml:lang="grc">
  <title xml:lang="eng">Iliad</title>
  <edition urn=":cts:greekLit:tlg0012.tlg001">
   <label xml:lang="eng">Allen (OCT 1931)</label>
  </edition>
 </work>
</textgroup>


With release 0.12.1, the urn attribute is now optional but strongly recommended, alongside the previous projid attribute.  With release 0.13.0, the urn attribute will be required, and the projid attribute deprecated.  With release 0.14.0, the projid attribute will be dropped.

So grab cite-0.12.1-schemas.zip from our nexus repository to get started with a modern TextInventory identifiying texts by URN.  You can manually download a zip bundle from the repository,  or update your maven coordinates with groupId "edu.harvard.chs", artifactId "cite" and version "0.12.1".

[Updated:  bumped version from 0.12.0 to 0.12.1 after adding trailing slash to dc namespace as requested by Bridget Almas]




Thursday, April 11, 2013

How hard is it to imagine "popular scholarship"?

I heard an interesting talk yesterday at Clark University by Robert Anderson, former director of the British Museum, on "The British Museum and Library at the New Millennium:"   wonderful anecdotes from the early history of the museum, and a compelling argument for the essential intellectual unity of what museums and libraries do.

The British Museum Great Court.
Photograph by Eric Pouhier,
licensed under cc-by-sa license.

Two details troubled me.  First, while the rare book library at Clark was filled, I saw only one student, and I probably fell well below the median age of the audience.  The talk was sponsored by the "Friends of the Goddard Library," but if this audience was representative, the library won't have too many friends in a few more years.

Second, both Anderson's talk and some of the discussion afterward made some curious assumptions about scholarship.  As the director at the time of the separation of the British Library from the Museum, and the opening of the fabulous facility at the new Euston Road location, Anderson offered insightful comments on the tensions of an institution committed both to free public access and to serving the needs of specialist scholars.  He brought up a problem familiar to anyone who has worked at the BL recently:  it's such a popular place, that all the desks fill up early in the morning with students looking for a comfortable place to work (with free wifi and good coffee!), but who aren't necessarily taking advantage of any of the unique offerings of the British Library.  This can impose a real hardship on people working on projects that depend on BL material.  Two assumptions emerged in the discussion that struck me as odd:  that the results of scholarly research would only be of interest to a small circle of specialists; and that digital material should be openly viewable, but scholarly research was being well served by a policy that allows free reuse of scholarly material only in print publications with a very limited print run.

Interior of the British Library.
Photograph by Maria Giulia Tolotti
licensed under cc-by-sa license.
Let's parse that logic a little more closely:  scholarly reuse of BL material is OK as long as not too many people care to read it;  and that's fine, because scholars' research is only of interest to a handful of other specialists, and expensive print media are an adequate way to meet this need.  (The host's introduction of Anderson referred light-heartedly, in what was evidently intended to be humor, to the fact that his most recent multi-volume publication costs hundreds of dollars.)

If we think the goal of scholarly research is to produce high-priced monographs of interest only to other specialists, is it really a surprise that the general reading public sees in the British Library a wonderful café?  If we think of "digital access" as a way of entertaining or at best informing a wide public, without inviting scholars to build upon the digital foundations of the BL's collections, is it any wonder that visitors to the BL are not drawn to the library's unique resources, but instead spend their time with the amazing hodge podge of entertainment and information that populates the internet?

(Footnote:  I was able to include the photographs by Eric Pouhier and Maria Giulia Polotti, without regard for how many people might view them, because both are available from wikimedia commons under the terms of a cc-by-sa license.)

Sunday, March 17, 2013

CTS is complete under OHCO2


My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text:  trees (formatted in XML), tables, and graphs (expressed in RDF).    Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model."  More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2:  hence the title of the present post.

The CTS protocol uses CTS URNs to cite passages of texts.  The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated.  A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).

The remaining two OHCO2 properties are provided by a pair of CTS requests.  The GetPrevNext request places a passage within an ordered sequence;  the GetPassage request returning the contents of the passage supports a mixed content model.

After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request.  We introduced the CTS GetPassagePlus request for just this purpose.  His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.





Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

  • every node belongs to a citation hierarchy
  • every node belongs to a FRBR-like version hierarchy
  • nodes belonging to the same version are ordered
  • nodes support a mixed content model
Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:
  1. XML
  2. tabular structures
  3. RDF triples
I’ll tag this series with the label "text data structures".

Wednesday, February 27, 2013

The maturity of a discipline

If a scholarly commuity cannot identify what material it studies, it has not yet matured to a point where digital technology matters very much:  scholarly discussion requires being able to cite evidence.  I worked on the CITE architecture for scholarly reference in part because I come from a background in Classics where, by and large, we have a reasonable tradition of citing works by logical, canonical reference schemes.  (Of course there are exceptions, like that little corpus of Plato that we continue to cite, bizarrely, by physical pages in the sixteenth century edition of Stephanus...)

The suggestion that scholars need to be able to identify and cite their evidence seems to me a pretty minimal measure of the maturity  of a discipline, but if I hold up classicists' conventions as a positive example of canonical citation conventions, people occasionally misunderstand this as an elitist attack on their field of study.  No:  I want to apply the same standard to subjects I work on.

For example, in the study of ancient science, we have not advanced much beyond the stituation described almost 40 years ago by Neugebauer (never one to sugar coat his judgment of the state of scholarship) as follows:
For classical antiquity and the Middle Ages no systematic collection of mathematical or astronomical treatises exists.  No attempt has ever been made to compile basic collections comparable to the Loeb Classical Library or the Budé Collection, or Migne's Patrologia, the Monumenta Germaniae Historica, the Bonn Corpus of Byzantine historians, etc.  ... This fact alone suffices to show that the so-called 'History of Science' is still operating on an exceedingly primitive level.
A History of Ancient Mathematical Astronomy, 1975, p 15 
Can we leap straight to a digital corpus for ancient science, like a third-world country bypassing costly and slow expansion of landlines and immediately delivering phone service through cellular networks?