Monday, January 27, 2014

Environments for collaborating

Cloud-based services have made collaborating on scholarly material so much easier in the last few years that it’s hard to remember how onerous it used to be. (Raise your hand if you have ever hosted your own shared version control system.)

github is a prime example. In addition to version control,  github provides each repository with a wiki, issue tracking and other services that you can use entirely through your web browser. Edit version-controlled files through the browser or in the comfort of your own computer’s OS, and push them back to a shared repository.

While github solves nearly every challenge of collaborating on static files or data, it does not directly address the question of how to share computational processes. How do we share with collaborators when the goal is not to show the results of a process, but to share the process itself? This, like so many technical challenges in humanities scholarship, is a problem we have in common with programmers who have to collaborate on writing code, and who have kindly provided us with the solution.

Virtual machines are half of the answer. Consumer-level hardware and VM software have reached the point where we can realistically say, “No matter what OS you’re actually using, we’ll just use a VM so we can all work on this project in Ubuntu 12.04.”

The other half of the answer is a system like vagrant. Vagrant provides a way to specify the configuration of your virtual machine, and can work with many VM systems, including the freely available VirtualBox. The specification is expressed in a simple text file — ideal for sharing from your github repository! So starting from scratch, new collaborators can perfectly replicate the system you run in your project in these steps:
  1. Make sure git is installed on their machines: http://git-scm.com/
  2. Install virtual box on their machines: https://www.virtualbox.org/
  3. Install vagrant on their machines: http://www.vagrantup.com/
  4. Run this vagrant command: vagrant gem install vagrant-vbguest
  5. Clone your project's git repository including a Vagrantfile specifying the configuration of your virtual machine
At this point, they can begin any work session within your repository directory by running

vagrant up

to start the virtual machine. (The first boot will be slow as the virtual machine is downloaded and built; after that, it’s tolerable.)


What is perhaps most remarkable about this sequence is that it imposes only two prior technical requirements on new collaborators: they must be able to use a web browser to download and install virtualbox and vagrant (and git, if they have not already done so); and they must be able to find a terminal or console where they can run a vagrant command. If that’s too much to demand, maybe it’s time for them to reconsider whether they’re really interested in collaborating on a digital scholarly project.

Monday, January 20, 2014

Markdown everywhere

Think there's a little momentum behind markdown lately?

This article from Mashable is already half a year old, and lists seventy-eight (78!) tools for "writing and previewing markdown"!  And its topic doesn't even extend to some of the very interesting services that use markdown, like leanpub and draft, or any of the numerous markdown-to-slideshow toolkits out there...

I'm convinced enough that I've just completed an initial version of a tool for working with markdown extended to allow citation using canonical URN values, and converting the source to generic markdown that any of these tools can process.  When I've polished the docs a little more, I'll post here with further notes on markdown and its increasing importance for scholarly work.

Tuesday, January 7, 2014

Designing scholarly publications: some lessons we can take from programmers

Scholarly publication involves more than just making work accessible.  When scholars publish, they are contributing their work to the collective endeavor of the entire scholarly community.  In order for other scholars to inspect, critique, and build upon published scholarship, it must be appropriately:

  •  identified
  •  verified
  •  structured for reuse
  •  licensed for reuse

Scholars are fortunate that all of these requirements are shared by coders, who have consequently developed well established practices and tools to satisfy each of them.  The infrastructure that programmers rely on is especially significant for digital scholarship because it has been designed for automated interaction.  For humanists, the shift from creating scholarship designed for manual processing to scholarship designed to be used through the mediation of software and hardware is often an enormous challenge.  My experience working with many collaborators on the Homer Muiltlitext project (HMT) has convinced me that we can greatly accelerate the progress of our digital scholarly work by learning from decades of experience in software development.   (In follow-up posts, I'll illustrate some working examples that exploit the design of the HMT's digital publications.)

Identifying a digital publication

Any unit of publication must be clearly identified with a specific, fixed version.  For a print monograph, this might be an edition  number, and possibly also a printing;  journals are normally identified by a date (year, quarter, or other cycle) and a volume or other serial count.  A library catalog might then resolve that reference to a storage location identified by a call number.  Digital publications likewise must be uniquely identified in a system that recognizes different editions or versions, and permits automated resolution of identifiers to a storage location.

Consider what you would typically do if you were writing a Java program, and wanted to use Saxon (a library for processing XSLT).  You can specify the library by its Maven coordinates, giving its "publisher," net.sf.saxon, the name of the "publication"", saxon-dom, and an "edition," or version, number (e.g., 8.7).  You would then rely on an automated build process to retrieve a local copy from a repository that recognizes the identifier.  With repository management systems such as Nexus, scholars can use exactly the same system of Maven coordinates to make published material automatically retrievable.  The Homer Multitext project, for example, plans to use a Nexus repository to publish the project's archive of editorial work three times a year.  The publications belong to the group org.homermultitext, and will include a publication named hmtarchive;  versions will have names like 2014.1 (the first publication of the year 2014).

Verifying a publication 

One of the distinguishing features of publication is that it has undergone some form of review.  The review process evaluates work which, in principle, ought to be replicable.  Review of digital scholarship should include not only manual evaluation, but, where applicable, automated tests assessing the data.   All of us already apply an automated test whenever we run a spell checker over a text:  when we produce digital work with more complex structure than simply a stream of words, more extensive digital tests are called for, and ought to be included as part of the publication.

One valuable idea we can take from programming practice is "test-driven development."  In test-driven development, the programmer specifies an automated test before beginning to write some section of a program, and works on the program until it passes the test.  Of course in conventional scholarly work, we evaluate work in progress as we go along: we don't submit something for review that we have not thoroughly reviewed ourselves.  But applying a test-driven approach to the editorial work of the HMT has been an eye-opening experience.  Because it compels us to reckon with "minor" irregularities we might otherwise gloss over, it can expose assumptions needing more critical examination.  In the HMT, one test we apply after tokenizing our edited texts, for example, is  a morphological analysis of all lexical tokens, on the assumption that failures will represent either Byzantine orthographic practices unrecognized by our parsing system, or errors in our edition.  We were surprised to discover a third explanation:  a number of technical terms appearing in the scholia are not fact in the standard Greek lexicon by Liddell and Scott, and therefore failed to parse.  When we retroactively applied our morphological tests to sections that had been edited before we adopted a full test-driven approach, we uncovered further examples that, as isolated cases, editors had not noticed.

Structuring for reuse

Complex computer programs are possible in part because reusable units of code encapsulate solutions to individual problems.   For example, I should never again have to spend my time writing a program to translate ancient Greek from one encoding system to another, because I can rely on Hugh Cayless' Epidoc transcoding library.  Hugh's code has a clean interface:  define the system you're translating from, the system you're translating to, and then the `getString` method hands you your result.

One of the major challenges humanists need to address today is how to design APIs to digital scholarship.  What are the appropriate components or methods, and how should they be identified?  At a minimum, a scholarly publication should address this question in two ways:

  1. Citations of source material should be expressed in technology-independent but machine-actionable notation.  In the absence of an alternative that fully accomplishes this, that he Homer Multitext project has developed a URN notation for texts (CTS URNs) and for discrete objects (CITE Object URNs), as well as an extension to CITE Object URNs for resolution-independent citation of regions of interest on an image.
  2. If sections of the publication itself can be cited, they too should be addressable with CTS URNs based on some logical unit (and not by accidental physical features such as page numbers).

Licensing for reuse

The act of publication alienates a work of scholarship from the author in a form that others can use, and contributes it to the scholarly community.  In addition to an appropriate technological design, scholarly publications must therefore be available under an appropriate license that must allow at least non-commercial reuse.  For programmers, the leading such license is the GNU General Public License (GPL]), and this is ideal for source code included in scholarly publication;  for other kinds of digital data, the Creative Commons project has defined Attribution-ShareAlike licenses that achieve the same aim.

Highly trained attorneys around the world have contributed their time and expertise to developing these licenses, and in many instances tailoring them to the specific requirements of local legal systems, as well as translating them into a large number of languages.  The easiest part of designing your publication should be taking advantage of their work and applying one of these licenses to your work.