Saturday, March 7, 2015


Underlying the CTS URN notation is the abstract model of textual structure abbreviated as OHCO2.

The generality of this model is nicely illustrated by recent implementations of the Canonical Text Services (CTS) protocol.  The CTS protocol provides retrieval of texts by CTS URN:  implementations linked from this page use XML tree structures, relational databases and directed graph stores to store and retrieve texts.

For an essential scholarly concept (identifying a citable passage of text), that's a powerful level of abstraction permitting scholars and developers to select technologies best suited to the specific kind of work they want to pursue with a citable corpus.

Saturday, October 25, 2014

Open license + an iPad mini

At the end of Open Access Week, I'd like to salute the library of Leiden University for living up to the goal of making open access the norm in scholarship.  If you work in its very pleasant setting, there are no restrictions on how you make use of out-of-copyright material.

When I visited Leiden earlier this year, I had an iPad mini with me, so I took a few quick snaps of Codex Vossianus  Graecus 1, a set of maps (perhaps sixteenth century) to accompany Ptolemy's Geography.  Thanks to the library's policies, I can make images like available as citable scholarly resources.

Leiden University, Codex Voss.Gr. 1: world map in Ptolemy's first projection

When the phone or tablet you happen to be carrying gives you photos rivaling or surpassing anything published in print, the technology is not much of a barrier.  When the default policy is that you can use your photographs as you see fit, neither is legal licensing.

(To see what's legible in a quick and poorly lit snap from an iPad mini,  see this zoomable image of folios 2-3.)

Friday, July 4, 2014

Paleography matters in the Declaration of Independence: a CITE response

My colleague Tom Martin points me to this article in the New York Times, reporting that Danielle Allen at the Institute for Advanced Study in Princeton has questioned the National Archives’ transcription of a a crucial phrase in the Declaration of Independence. Are Thomas Jefferson’s “self-evident truths” comprised of individual rights, or do they also include a governmental role “to secure these rights”? Your judgment could hang on whether or not you see a period followed by a long dash or simply a long dash in the original document.

I browsed the National Archives web site, and found that they offer two downloadable images, one a photograph of the original parchment, and another of the 1823 engraving by William Stone, both apparently in the public domain.

So I took a few minutes of my Fourth of July holiday to set up a CITE Image Service where you can browse and create citable references of the images. Here is the detail of the crucial passage in the photograph of the parchment:
Happiness followed by punctuation.
In the Image Collection I created this afternoon, this detail can be cited generically with this URN

and the URN can also be resolved to see the detail in context.

Contrast the Stone engraving:

1823 engraving
1823 engraving
(citable as urn:cite:mid:natarchimgs.Declaration_Engrav_Pg1of1_AC@0.465,0.1919,0.076,0.0177, and viewable in context here)

With references like this, it would be easy to cite other examples in the document of periods and long dashes, much as participants at last week’s Homer Multitext seminar collated evidence to interpret features of the oldest extant manuscript of the Iliad.

Conclusions? The parchment of the Declaration is hard to read, but paleography is important, and the CITE architecture that was originally created for the Homer Multitext project can be applied to any sort of paleographic problem.

Saturday, May 10, 2014

More reasons to love markdown plus critic markup

Deadlines for senior projects mean that in addition to the interesting challenge of how to submit genuinely replicable digital scholarship to the library's institutional repository, it's time to generate pdfs so that the Graphic Arts Department can bind something for the library shelves.  The projects I'm advising are formatted in markdown extended to support citation by scholarly URN (what I'm calling "citedown").  We wanted to create markdown source that could be used with leanpub, beautiful docs, or pandoc, so the automated workflow has to handle some potentially complex issues resolving URNs, downloading local copies of embedded images and rewriting references to them, etc.

I had been using critic markup for editorial questions and copy editing, but with one eye on the calendar, I wanted to test the pdf workflow before we had a complete draft with all critic markup resolved.

To my surprise, when we used pandoc to lay out the text with a LaTex book structure, it recognized the critic markup and formatted it in the resulting pdf!  Comments default, appropriately, to a screaming magenta that could have been taken from a 1990s GIS palette.  (Anyone who forgets to run their automated process to find and resolve critic markup will have a hard time missing these.)

Pandoc has always been a major reason to love markdown's simplicity.  Now it's one more reason to consider the combination of markdown plus critic markup.

Wednesday, April 30, 2014

Publishing digital scholarship

The Holy Cross Manuscripts, Inscriptions and Documents Club (HC MID) has extended its projects’ routine working practice to include a plan for publishing replicable digital scholarship. Generally, projects in HC MID generate three kinds of related material that need to be accounted for in a publication:
  • archival material: TEI-conformant editions of texts, and a variety of data sets in simple delimited text formats
  • analytical material: expository prose in markdown, using URNs to refer to all citable resources
  • source material for user-interfaces: interactive presentations of analytical and archival material as servlets.
All projects in the club already use git for version control in public repositories hosted on github, so it is straightforward to identify and retrieve a specific version of any repository, whether it hosts archival data sets, expository writing, or source material for servlets. In assembling a servlet for end users, the club’s projects use gradle as their build system. Typically, a build task verifies the contents of the archive with a series of automated tests, then generates an RDF graph of the entire project using the citemgr build system. Additional tasks can load the resulting RDF graph into an RDF server, and start up a servlet that knows how to talk to the the RDF server. The entire sequence can be reduced to a single shell script; some projects have even put a boot script of this type on a cron job that rebuilds the project graph and servlet nightly. Every step in the trip from our github repositories to a running application can be fully automated.

The Holy Cross Libraries have recently begun hosting an institutional repository for digital scholarship. Because members of HC MID have presented their work at several international conferences and seminars this year, the library offered to include this work in the new institutional repository, but could not realistically plan to support separate running applications for every digital project that might ever be developed at the college. How can we publish to others fully functional replicas of our digital work through institutional repositories of this kind?

Our solution involves only one addition to our normal working routine. With the emergence of systems like Vagrant, it is simple to define a virtual machine configured with exactly the resources that a particular project requires. We create one further git repository, but it is the smallest of all, since in most cases it consists of little more than a Vagrant file and a shell script to provision the virtual machine. Given that Virtual Box is freely available for essentially any host platform, we can reduce the problem of how to replicate our projects to:
  1. be sure you have Vagrant and Virtual Box installed
  2. download our virtual machine repository, and run vagrant up in its root directory
Our library can happily host versioned releases of these simple git repositories, and add the metadata, indexing and search services that library staff has expertise in. We change nothing in our regular work flow, and add a virtual machine definition when we are ready to publish a release of a project.

As we gain practical experience with this approach to replicable publication, perhaps we will discover shortcomings we do not recognize yet, but as we approach the end of the spring term for 2014, the combination of github version control, automated build systems, virtual machines and institutional repositories seems to cover the complex requirements of publishing digital scholarship as effectively as anything I am familiar with. It cleanly isolates distinct concerns, and relies on generic solutions where they are available.

In any case, the design is not hypothetical. Three projects will publish release versions of their work after the spring semester ends at Holy Cross.


Tuesday, April 1, 2014

Get funding for your DH project?

Try these regular expressions on the embedded youtube video:

s/our company/our project/g
s/your company/IT staff/g

Sunday, March 30, 2014

Specs + tests for CTS

In February, Chris Blackwell and I released a release candidate version of the CTS protocol specification, 5.0. Today, we are releasing a second release candidate, in parallel with a suite of tests packaged with a servlet that can run the tests and format the resulting report in a web page.

We are currently working on a third release candidate taking account of all the helpful comments we have received so far on rc.1, and plan to continue coordinating releases of the CTS protocol specification with parallel test suites. We expect that rc.3 will be the last candidate version before a final CTS 5.0.

All our released work on the CITE architecture now belongs to a cite-architecture group on github. For a guide to our repositories, see the organization home page on github.