Tuesday, January 7, 2014

Designing scholarly publications: some lessons we can take from programmers

Scholarly publication involves more than just making work accessible.  When scholars publish, they are contributing their work to the collective endeavor of the entire scholarly community.  In order for other scholars to inspect, critique, and build upon published scholarship, it must be appropriately:

  •  identified
  •  verified
  •  structured for reuse
  •  licensed for reuse

Scholars are fortunate that all of these requirements are shared by coders, who have consequently developed well established practices and tools to satisfy each of them.  The infrastructure that programmers rely on is especially significant for digital scholarship because it has been designed for automated interaction.  For humanists, the shift from creating scholarship designed for manual processing to scholarship designed to be used through the mediation of software and hardware is often an enormous challenge.  My experience working with many collaborators on the Homer Muiltlitext project (HMT) has convinced me that we can greatly accelerate the progress of our digital scholarly work by learning from decades of experience in software development.   (In follow-up posts, I'll illustrate some working examples that exploit the design of the HMT's digital publications.)

Identifying a digital publication

Any unit of publication must be clearly identified with a specific, fixed version.  For a print monograph, this might be an edition  number, and possibly also a printing;  journals are normally identified by a date (year, quarter, or other cycle) and a volume or other serial count.  A library catalog might then resolve that reference to a storage location identified by a call number.  Digital publications likewise must be uniquely identified in a system that recognizes different editions or versions, and permits automated resolution of identifiers to a storage location.

Consider what you would typically do if you were writing a Java program, and wanted to use Saxon (a library for processing XSLT).  You can specify the library by its Maven coordinates, giving its "publisher," net.sf.saxon, the name of the "publication"", saxon-dom, and an "edition," or version, number (e.g., 8.7).  You would then rely on an automated build process to retrieve a local copy from a repository that recognizes the identifier.  With repository management systems such as Nexus, scholars can use exactly the same system of Maven coordinates to make published material automatically retrievable.  The Homer Multitext project, for example, plans to use a Nexus repository to publish the project's archive of editorial work three times a year.  The publications belong to the group org.homermultitext, and will include a publication named hmtarchive;  versions will have names like 2014.1 (the first publication of the year 2014).

Verifying a publication 

One of the distinguishing features of publication is that it has undergone some form of review.  The review process evaluates work which, in principle, ought to be replicable.  Review of digital scholarship should include not only manual evaluation, but, where applicable, automated tests assessing the data.   All of us already apply an automated test whenever we run a spell checker over a text:  when we produce digital work with more complex structure than simply a stream of words, more extensive digital tests are called for, and ought to be included as part of the publication.

One valuable idea we can take from programming practice is "test-driven development."  In test-driven development, the programmer specifies an automated test before beginning to write some section of a program, and works on the program until it passes the test.  Of course in conventional scholarly work, we evaluate work in progress as we go along: we don't submit something for review that we have not thoroughly reviewed ourselves.  But applying a test-driven approach to the editorial work of the HMT has been an eye-opening experience.  Because it compels us to reckon with "minor" irregularities we might otherwise gloss over, it can expose assumptions needing more critical examination.  In the HMT, one test we apply after tokenizing our edited texts, for example, is  a morphological analysis of all lexical tokens, on the assumption that failures will represent either Byzantine orthographic practices unrecognized by our parsing system, or errors in our edition.  We were surprised to discover a third explanation:  a number of technical terms appearing in the scholia are not fact in the standard Greek lexicon by Liddell and Scott, and therefore failed to parse.  When we retroactively applied our morphological tests to sections that had been edited before we adopted a full test-driven approach, we uncovered further examples that, as isolated cases, editors had not noticed.

Structuring for reuse

Complex computer programs are possible in part because reusable units of code encapsulate solutions to individual problems.   For example, I should never again have to spend my time writing a program to translate ancient Greek from one encoding system to another, because I can rely on Hugh Cayless' Epidoc transcoding library.  Hugh's code has a clean interface:  define the system you're translating from, the system you're translating to, and then the `getString` method hands you your result.

One of the major challenges humanists need to address today is how to design APIs to digital scholarship.  What are the appropriate components or methods, and how should they be identified?  At a minimum, a scholarly publication should address this question in two ways:

  1. Citations of source material should be expressed in technology-independent but machine-actionable notation.  In the absence of an alternative that fully accomplishes this, that he Homer Multitext project has developed a URN notation for texts (CTS URNs) and for discrete objects (CITE Object URNs), as well as an extension to CITE Object URNs for resolution-independent citation of regions of interest on an image.
  2. If sections of the publication itself can be cited, they too should be addressable with CTS URNs based on some logical unit (and not by accidental physical features such as page numbers).

Licensing for reuse

The act of publication alienates a work of scholarship from the author in a form that others can use, and contributes it to the scholarly community.  In addition to an appropriate technological design, scholarly publications must therefore be available under an appropriate license that must allow at least non-commercial reuse.  For programmers, the leading such license is the GNU General Public License (GPL]), and this is ideal for source code included in scholarly publication;  for other kinds of digital data, the Creative Commons project has defined Attribution-ShareAlike licenses that achieve the same aim.

Highly trained attorneys around the world have contributed their time and expertise to developing these licenses, and in many instances tailoring them to the specific requirements of local legal systems, as well as translating them into a large number of languages.  The easiest part of designing your publication should be taking advantage of their work and applying one of these licenses to your work.

No comments: