Wednesday, April 30, 2014

Publishing digital scholarship

The Holy Cross Manuscripts, Inscriptions and Documents Club (HC MID) has extended its projects’ routine working practice to include a plan for publishing replicable digital scholarship. Generally, projects in HC MID generate three kinds of related material that need to be accounted for in a publication:
  • archival material: TEI-conformant editions of texts, and a variety of data sets in simple delimited text formats
  • analytical material: expository prose in markdown, using URNs to refer to all citable resources
  • source material for user-interfaces: interactive presentations of analytical and archival material as servlets.
All projects in the club already use git for version control in public repositories hosted on github, so it is straightforward to identify and retrieve a specific version of any repository, whether it hosts archival data sets, expository writing, or source material for servlets. In assembling a servlet for end users, the club’s projects use gradle as their build system. Typically, a build task verifies the contents of the archive with a series of automated tests, then generates an RDF graph of the entire project using the citemgr build system. Additional tasks can load the resulting RDF graph into an RDF server, and start up a servlet that knows how to talk to the the RDF server. The entire sequence can be reduced to a single shell script; some projects have even put a boot script of this type on a cron job that rebuilds the project graph and servlet nightly. Every step in the trip from our github repositories to a running application can be fully automated.

The Holy Cross Libraries have recently begun hosting an institutional repository for digital scholarship. Because members of HC MID have presented their work at several international conferences and seminars this year, the library offered to include this work in the new institutional repository, but could not realistically plan to support separate running applications for every digital project that might ever be developed at the college. How can we publish to others fully functional replicas of our digital work through institutional repositories of this kind?

Our solution involves only one addition to our normal working routine. With the emergence of systems like Vagrant, it is simple to define a virtual machine configured with exactly the resources that a particular project requires. We create one further git repository, but it is the smallest of all, since in most cases it consists of little more than a Vagrant file and a shell script to provision the virtual machine. Given that Virtual Box is freely available for essentially any host platform, we can reduce the problem of how to replicate our projects to:
  1. be sure you have Vagrant and Virtual Box installed
  2. download our virtual machine repository, and run vagrant up in its root directory
Our library can happily host versioned releases of these simple git repositories, and add the metadata, indexing and search services that library staff has expertise in. We change nothing in our regular work flow, and add a virtual machine definition when we are ready to publish a release of a project.

As we gain practical experience with this approach to replicable publication, perhaps we will discover shortcomings we do not recognize yet, but as we approach the end of the spring term for 2014, the combination of github version control, automated build systems, virtual machines and institutional repositories seems to cover the complex requirements of publishing digital scholarship as effectively as anything I am familiar with. It cleanly isolates distinct concerns, and relies on generic solutions where they are available.

In any case, the design is not hypothetical. Three projects will publish release versions of their work after the spring semester ends at Holy Cross.


No comments: