"Grand Unification Theory" may be a touch grandiose, but the underlying libraries used in the Homer Muiltitext project now generate RDF statements that fully express all three types of CITE-architecture information: textual archives, archives of data collections, and indices relating citable objects to other citable objects or to raw data. There will be lots of interesting connections to explore in the resulting unified graph of scholarly material.
In parallel with this, I've now implemented the CTS protocol, the CITE Collections Service protocol, and its extension with the CHS Image protocol in servlets drawing on a SPARQL endpoint, so creating a complete CITE environment can be reduced to:
- build all RDF (automatically), and import into a triple store
- drop the three servlets for CITE services into a servlet container
- install the iipsrv fastcgi for working with binary image data. This is the most troublesome step on many platforms, but happily iipsrv is now available as a package under debian.
Not bad. Chris Blackwell is preparing an image for the < $50 raspberry pi with these requirements preinstalled: a complete CITE Box roughly the size of an Altoids container.
As we review the schemas used in the services this month, we'll begin looking at defining a more permanent RDF vocabulary. I'm not sure at this point if we need to break out a generic CITE vocabulary distinct from a specific HMT vocabulary, or whether one ontology will suffice. We'll be looking at other projects' work: thanks to Joel Kavlesmaki for pointing to the useful list here.
Wednesday, April 17, 2013
GUT
Sunday, April 14, 2013
CITE Collection Inventory
In parallel with Friday's update to the schema for CTS text inventories, CITE Collection inventories now include an optional urn attribute on the schema for Collections. Bump your build system's dependency for the cite library up to 0.12.2 to include this change.
As with the CTS TextInventory, we plan to make the Collection inventory's urn attribute mandatory in 0.13, and will drop the parallel name attribute in 0.14.
Friday, April 12, 2013
Updating the CTS TextInventory schema
Scott Mcphee points out the absurdity of a Canonical Text Service (CTS) definition that uses CTS URNs for all retrieval requests, but doesn't include CTS URNs in the service's TextInventory. The historical explanation for the inconsistency is embarassingly simple: the TextInventory schema predates the invention of CTS URNs, and has not been revisited since! That oversight is rectified with today's release of version 0.12.1 of the CITE schemas package.
Ultimately, we want to arrive at catalog entries with urn attributes that look like this:
<textgroup urn="urn:cts:greekLit:tlg0012">
<groupname xml:lang="eng">Homeric poetry</groupname>
<work urn="urn:cts:greekLit:tlg0012.tlg001" xml:lang="grc">
<title xml:lang="eng">Iliad</title>
<edition urn=":cts:greekLit:tlg0012.tlg001">
<label xml:lang="eng">Allen (OCT 1931)</label>
</edition>
</work>
</textgroup>
With release 0.12.1, the urn attribute is now optional but strongly recommended, alongside the previous projid attribute. With release 0.13.0, the urn attribute will be required, and the projid attribute deprecated. With release 0.14.0, the projid attribute will be dropped.
So grab cite-0.12.1-schemas.zip from our nexus repository to get started with a modern TextInventory identifiying texts by URN. You can manually download a zip bundle from the repository, or update your maven coordinates with groupId "edu.harvard.chs", artifactId "cite" and version "0.12.1".
[Updated: bumped version from 0.12.0 to 0.12.1 after adding trailing slash to dc namespace as requested by Bridget Almas]
Thursday, April 11, 2013
How hard is it to imagine "popular scholarship"?
I heard an interesting talk yesterday at Clark University by Robert Anderson, former director of the British Museum, on "The British Museum and Library at the New Millennium:" wonderful anecdotes from the early history of the museum, and a compelling argument for the essential intellectual unity of what museums and libraries do.
![]() |
The British Museum Great Court. Photograph by Eric Pouhier, licensed under cc-by-sa license. |
Two details troubled me. First, while the rare book library at Clark was filled, I saw only one student, and I probably fell well below the median age of the audience. The talk was sponsored by the "Friends of the Goddard Library," but if this audience was representative, the library won't have too many friends in a few more years.
Second, both Anderson's talk and some of the discussion afterward made some curious assumptions about scholarship. As the director at the time of the separation of the British Library from the Museum, and the opening of the fabulous facility at the new Euston Road location, Anderson offered insightful comments on the tensions of an institution committed both to free public access and to serving the needs of specialist scholars. He brought up a problem familiar to anyone who has worked at the BL recently: it's such a popular place, that all the desks fill up early in the morning with students looking for a comfortable place to work (with free wifi and good coffee!), but who aren't necessarily taking advantage of any of the unique offerings of the British Library. This can impose a real hardship on people working on projects that depend on BL material. Two assumptions emerged in the discussion that struck me as odd: that the results of scholarly research would only be of interest to a small circle of specialists; and that digital material should be openly viewable, but scholarly research was being well served by a policy that allows free reuse of scholarly material only in print publications with a very limited print run.
Interior of the British Library. Photograph by Maria Giulia Tolotti licensed under cc-by-sa license. |
If we think the goal of scholarly research is to produce high-priced monographs of interest only to other specialists, is it really a surprise that the general reading public sees in the British Library a wonderful café? If we think of "digital access" as a way of entertaining or at best informing a wide public, without inviting scholars to build upon the digital foundations of the BL's collections, is it any wonder that visitors to the BL are not drawn to the library's unique resources, but instead spend their time with the amazing hodge podge of entertainment and information that populates the internet?
(Footnote: I was able to include the photographs by Eric Pouhier and Maria Giulia Polotti, without regard for how many people might view them, because both are available from wikimedia commons under the terms of a cc-by-sa license.)
Sunday, March 17, 2013
CTS is complete under OHCO2
My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text: trees (formatted in XML), tables, and graphs (expressed in RDF). Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model." More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2: hence the title of the present post.
The CTS protocol uses CTS URNs to cite passages of texts. The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated. A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).
The remaining two OHCO2 properties are provided by a pair of CTS requests. The GetPrevNext request places a passage within an ordered sequence; the GetPassage request returning the contents of the passage supports a mixed content model.
After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request. We introduced the CTS GetPassagePlus request for just this purpose. His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.
Sunday, March 10, 2013
Data structures for texts
My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:
- every node belongs to a citation hierarchy
- every node belongs to a FRBR-like version hierarchy
- nodes belonging to the same version are ordered
- nodes support a mixed content model
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:
- XML
- tabular structures
- RDF triples
Wednesday, February 27, 2013
The maturity of a discipline
For classical antiquity and the Middle Ages no systematic collection of mathematical or astronomical treatises exists. No attempt has ever been made to compile basic collections comparable to the Loeb Classical Library or the Budé Collection, or Migne's Patrologia, the Monumenta Germaniae Historica, the Bonn Corpus of Byzantine historians, etc. ... This fact alone suffices to show that the so-called 'History of Science' is still operating on an exceedingly primitive level.
A History of Ancient Mathematical Astronomy, 1975, p 15