Saturday, October 25, 2014

Open license + an iPad mini

At the end of Open Access Week, I'd like to salute the library of Leiden University for living up to the goal of making open access the norm in scholarship.  If you work in its very pleasant setting, there are no restrictions on how you make use of out-of-copyright material.

When I visited Leiden earlier this year, I had an iPad mini with me, so I took a few quick snaps of Codex Vossianus  Graecus 1, a set of maps (perhaps sixteenth century) to accompany Ptolemy's Geography.  Thanks to the library's policies, I can make images like available as citable scholarly resources.

Leiden University, Codex Voss.Gr. 1: world map in Ptolemy's first projection

When the phone or tablet you happen to be carrying gives you photos rivaling or surpassing anything published in print, the technology is not much of a barrier.  When the default policy is that you can use your photographs as you see fit, neither is legal licensing.
 

(To see what's legible in a quick and poorly lit snap from an iPad mini,  see this zoomable image of folios 2-3.)


Friday, July 4, 2014

Paleography matters in the Declaration of Independence: a CITE response

My colleague Tom Martin points me to this article in the New York Times, reporting that Danielle Allen at the Institute for Advanced Study in Princeton has questioned the National Archives’ transcription of a a crucial phrase in the Declaration of Independence. Are Thomas Jefferson’s “self-evident truths” comprised of individual rights, or do they also include a governmental role “to secure these rights”? Your judgment could hang on whether or not you see a period followed by a long dash or simply a long dash in the original document.

I browsed the National Archives web site, and found that they offer two downloadable images, one a photograph of the original parchment, and another of the 1823 engraving by William Stone, both apparently in the public domain.

So I took a few minutes of my Fourth of July holiday to set up a CITE Image Service where you can browse and create citable references of the images. Here is the detail of the crucial passage in the photograph of the parchment:
Happiness followed by punctuation.
In the Image Collection I created this afternoon, this detail can be cited generically with this URN

urn:cite:mid:natarchimgs.Declaration_Pg1of1_AC@0.472,0.1872,0.082,0.0213
and the URN can also be resolved to see the detail in context.

Contrast the Stone engraving:


1823 engraving
1823 engraving
(citable as urn:cite:mid:natarchimgs.Declaration_Engrav_Pg1of1_AC@0.465,0.1919,0.076,0.0177, and viewable in context here)

With references like this, it would be easy to cite other examples in the document of periods and long dashes, much as participants at last week’s Homer Multitext seminar collated evidence to interpret features of the oldest extant manuscript of the Iliad.

Conclusions? The parchment of the Declaration is hard to read, but paleography is important, and the CITE architecture that was originally created for the Homer Multitext project can be applied to any sort of paleographic problem.

Saturday, May 10, 2014

More reasons to love markdown plus critic markup

Deadlines for senior projects mean that in addition to the interesting challenge of how to submit genuinely replicable digital scholarship to the library's institutional repository, it's time to generate pdfs so that the Graphic Arts Department can bind something for the library shelves.  The projects I'm advising are formatted in markdown extended to support citation by scholarly URN (what I'm calling "citedown").  We wanted to create markdown source that could be used with leanpub, beautiful docs, or pandoc, so the automated workflow has to handle some potentially complex issues resolving URNs, downloading local copies of embedded images and rewriting references to them, etc.

I had been using critic markup for editorial questions and copy editing, but with one eye on the calendar, I wanted to test the pdf workflow before we had a complete draft with all critic markup resolved.

To my surprise, when we used pandoc to lay out the text with a LaTex book structure, it recognized the critic markup and formatted it in the resulting pdf!  Comments default, appropriately, to a screaming magenta that could have been taken from a 1990s GIS palette.  (Anyone who forgets to run their automated process to find and resolve critic markup will have a hard time missing these.)

Pandoc has always been a major reason to love markdown's simplicity.  Now it's one more reason to consider the combination of markdown plus critic markup.


Wednesday, April 30, 2014

Publishing digital scholarship

The Holy Cross Manuscripts, Inscriptions and Documents Club (HC MID) has extended its projects’ routine working practice to include a plan for publishing replicable digital scholarship. Generally, projects in HC MID generate three kinds of related material that need to be accounted for in a publication:
  • archival material: TEI-conformant editions of texts, and a variety of data sets in simple delimited text formats
  • analytical material: expository prose in markdown, using URNs to refer to all citable resources
  • source material for user-interfaces: interactive presentations of analytical and archival material as servlets.
All projects in the club already use git for version control in public repositories hosted on github, so it is straightforward to identify and retrieve a specific version of any repository, whether it hosts archival data sets, expository writing, or source material for servlets. In assembling a servlet for end users, the club’s projects use gradle as their build system. Typically, a build task verifies the contents of the archive with a series of automated tests, then generates an RDF graph of the entire project using the citemgr build system. Additional tasks can load the resulting RDF graph into an RDF server, and start up a servlet that knows how to talk to the the RDF server. The entire sequence can be reduced to a single shell script; some projects have even put a boot script of this type on a cron job that rebuilds the project graph and servlet nightly. Every step in the trip from our github repositories to a running application can be fully automated.

The Holy Cross Libraries have recently begun hosting an institutional repository for digital scholarship. Because members of HC MID have presented their work at several international conferences and seminars this year, the library offered to include this work in the new institutional repository, but could not realistically plan to support separate running applications for every digital project that might ever be developed at the college. How can we publish to others fully functional replicas of our digital work through institutional repositories of this kind?

Our solution involves only one addition to our normal working routine. With the emergence of systems like Vagrant, it is simple to define a virtual machine configured with exactly the resources that a particular project requires. We create one further git repository, but it is the smallest of all, since in most cases it consists of little more than a Vagrant file and a shell script to provision the virtual machine. Given that Virtual Box is freely available for essentially any host platform, we can reduce the problem of how to replicate our projects to:
  1. be sure you have Vagrant and Virtual Box installed
  2. download our virtual machine repository, and run vagrant up in its root directory
Our library can happily host versioned releases of these simple git repositories, and add the metadata, indexing and search services that library staff has expertise in. We change nothing in our regular work flow, and add a virtual machine definition when we are ready to publish a release of a project.

As we gain practical experience with this approach to replicable publication, perhaps we will discover shortcomings we do not recognize yet, but as we approach the end of the spring term for 2014, the combination of github version control, automated build systems, virtual machines and institutional repositories seems to cover the complex requirements of publishing digital scholarship as effectively as anything I am familiar with. It cleanly isolates distinct concerns, and relies on generic solutions where they are available.

In any case, the design is not hypothetical. Three projects will publish release versions of their work after the spring semester ends at Holy Cross.

Links


Tuesday, April 1, 2014

Get funding for your DH project?

Try these regular expressions on the embedded youtube video:

s/our company/our project/g
s/your company/IT staff/g
s/expert/developer/g







Sunday, March 30, 2014

Specs + tests for CTS

In February, Chris Blackwell and I released a release candidate version of the CTS protocol specification, 5.0. Today, we are releasing a second release candidate, in parallel with a suite of tests packaged with a servlet that can run the tests and format the resulting report in a web page.

We are currently working on a third release candidate taking account of all the helpful comments we have received so far on rc.1, and plan to continue coordinating releases of the CTS protocol specification with parallel test suites. We expect that rc.3 will be the last candidate version before a final CTS 5.0.


All our released work on the CITE architecture now belongs to a cite-architecture group on github. For a guide to our repositories, see the organization home page on github.

Wednesday, March 26, 2014

Visualization from CITE URNs + d3 + hive maps

Many software packages make it relatively easy to create visualizations of complex networks of data, but often produce hairballs that tell us more about the visual layout algorithm than the structure of the network. Martin Krzywinski has proposed an alternative, called hive plots: lay out your nodes along a series of axes that you know have meaning in your network, and explore the network visually from there. Mike Bostock, predictably, has done gorgeous interactive work with Krzywinski’s idea in the d3 javascript library.

 hive map
I created my first hive plot this morning using d3. The screen shot above illustrates a project by Megan Whitacre (Holy Cross ’14) annotating a series of illustrated inscriptions for use in teaching introductory Latin. The five axes are (clockwise beginning from the blue axis at the top) 32 broad grammatical concepts, 71 narrower topics about the morphology of substantives, 55 topics about verbal morphology, 9 syntactic topics and, along the purple axis, 103 images. All of Megan’s annotations are expressed with CITE URNs; this makes it straightforward both to gather all references to the same image, or to apply her region of interest to highlight areas of the image. d3 practically begs for interactive displays, so you can highlight nodes or edges to see further information, or can click on image nodes to see the image with linked, highlighted areas for all references to the image.



There is plenty of room for improvement. Selecting a node or edge really out to select all direct connections to it as well, and hovering should use Megan’ rdf:Label values, instead of the raw CITE URN to identify the node, to name two obvious desiderata. But as an initial effort put together between second cup of coffee and lunch break, it’s hard to be disappointed with it. It underscores for me that as our tools improve, it becomes more and more important to have properly structured and properly citable data.


(The screen shot is linked to a live version of the graph.)

Monday, March 3, 2014

bl.ocks rocks

http://bl.ocks.org/ occupies an interesting space in the overlap of coding and writing. It lets you simultaneously view the rendering of a source page and its source code, together with commentary in the form of a README. Each bl.ock is defined simply as a github gist that follows the naming convention README.md for commentary (in markdown), and index.html for source file to be both rendered and displayed in source view. This is extraordinarily powerful when index.html is a single-page web application, of the kind that D3 (http://d3js.org/) encourages you to build – and Mike Bostock, the main developer of D3, just happens to be the inventor of bl.ocks as well.

bl.ocks are a great way to pull away the curtain and illustrate how a particular analysis or visualization works, and studying other people’s bl.ocks can be a fast route to learning a new technique.
From work with Christine Bannan on the Phoros project, I’ve put up this bl.ock as we begin to map changing patterns of Athenian tribute over time:


Extant records of tribute payment


Links

  • Phoros project github repositories: http://phoros.github.io/
  • Phoros project test site: http://beta.hpcc.uh.edu/phoros/

Monday, January 27, 2014

Environments for collaborating

Cloud-based services have made collaborating on scholarly material so much easier in the last few years that it’s hard to remember how onerous it used to be. (Raise your hand if you have ever hosted your own shared version control system.)

github is a prime example. In addition to version control,  github provides each repository with a wiki, issue tracking and other services that you can use entirely through your web browser. Edit version-controlled files through the browser or in the comfort of your own computer’s OS, and push them back to a shared repository.

While github solves nearly every challenge of collaborating on static files or data, it does not directly address the question of how to share computational processes. How do we share with collaborators when the goal is not to show the results of a process, but to share the process itself? This, like so many technical challenges in humanities scholarship, is a problem we have in common with programmers who have to collaborate on writing code, and who have kindly provided us with the solution.

Virtual machines are half of the answer. Consumer-level hardware and VM software have reached the point where we can realistically say, “No matter what OS you’re actually using, we’ll just use a VM so we can all work on this project in Ubuntu 12.04.”

The other half of the answer is a system like vagrant. Vagrant provides a way to specify the configuration of your virtual machine, and can work with many VM systems, including the freely available VirtualBox. The specification is expressed in a simple text file — ideal for sharing from your github repository! So starting from scratch, new collaborators can perfectly replicate the system you run in your project in these steps:
  1. Make sure git is installed on their machines: http://git-scm.com/
  2. Install virtual box on their machines: https://www.virtualbox.org/
  3. Install vagrant on their machines: http://www.vagrantup.com/
  4. Run this vagrant command: vagrant gem install vagrant-vbguest
  5. Clone your project's git repository including a Vagrantfile specifying the configuration of your virtual machine
At this point, they can begin any work session within your repository directory by running

vagrant up

to start the virtual machine. (The first boot will be slow as the virtual machine is downloaded and built; after that, it’s tolerable.)


What is perhaps most remarkable about this sequence is that it imposes only two prior technical requirements on new collaborators: they must be able to use a web browser to download and install virtualbox and vagrant (and git, if they have not already done so); and they must be able to find a terminal or console where they can run a vagrant command. If that’s too much to demand, maybe it’s time for them to reconsider whether they’re really interested in collaborating on a digital scholarly project.

Monday, January 20, 2014

Markdown everywhere

Think there's a little momentum behind markdown lately?

This article from Mashable is already half a year old, and lists seventy-eight (78!) tools for "writing and previewing markdown"!  And its topic doesn't even extend to some of the very interesting services that use markdown, like leanpub and draft, or any of the numerous markdown-to-slideshow toolkits out there...

I'm convinced enough that I've just completed an initial version of a tool for working with markdown extended to allow citation using canonical URN values, and converting the source to generic markdown that any of these tools can process.  When I've polished the docs a little more, I'll post here with further notes on markdown and its increasing importance for scholarly work.

Tuesday, January 7, 2014

Designing scholarly publications: some lessons we can take from programmers

Scholarly publication involves more than just making work accessible.  When scholars publish, they are contributing their work to the collective endeavor of the entire scholarly community.  In order for other scholars to inspect, critique, and build upon published scholarship, it must be appropriately:

  •  identified
  •  verified
  •  structured for reuse
  •  licensed for reuse

Scholars are fortunate that all of these requirements are shared by coders, who have consequently developed well established practices and tools to satisfy each of them.  The infrastructure that programmers rely on is especially significant for digital scholarship because it has been designed for automated interaction.  For humanists, the shift from creating scholarship designed for manual processing to scholarship designed to be used through the mediation of software and hardware is often an enormous challenge.  My experience working with many collaborators on the Homer Muiltlitext project (HMT) has convinced me that we can greatly accelerate the progress of our digital scholarly work by learning from decades of experience in software development.   (In follow-up posts, I'll illustrate some working examples that exploit the design of the HMT's digital publications.)

Identifying a digital publication

Any unit of publication must be clearly identified with a specific, fixed version.  For a print monograph, this might be an edition  number, and possibly also a printing;  journals are normally identified by a date (year, quarter, or other cycle) and a volume or other serial count.  A library catalog might then resolve that reference to a storage location identified by a call number.  Digital publications likewise must be uniquely identified in a system that recognizes different editions or versions, and permits automated resolution of identifiers to a storage location.

Consider what you would typically do if you were writing a Java program, and wanted to use Saxon (a library for processing XSLT).  You can specify the library by its Maven coordinates, giving its "publisher," net.sf.saxon, the name of the "publication"", saxon-dom, and an "edition," or version, number (e.g., 8.7).  You would then rely on an automated build process to retrieve a local copy from a repository that recognizes the identifier.  With repository management systems such as Nexus, scholars can use exactly the same system of Maven coordinates to make published material automatically retrievable.  The Homer Multitext project, for example, plans to use a Nexus repository to publish the project's archive of editorial work three times a year.  The publications belong to the group org.homermultitext, and will include a publication named hmtarchive;  versions will have names like 2014.1 (the first publication of the year 2014).

Verifying a publication 

One of the distinguishing features of publication is that it has undergone some form of review.  The review process evaluates work which, in principle, ought to be replicable.  Review of digital scholarship should include not only manual evaluation, but, where applicable, automated tests assessing the data.   All of us already apply an automated test whenever we run a spell checker over a text:  when we produce digital work with more complex structure than simply a stream of words, more extensive digital tests are called for, and ought to be included as part of the publication.

One valuable idea we can take from programming practice is "test-driven development."  In test-driven development, the programmer specifies an automated test before beginning to write some section of a program, and works on the program until it passes the test.  Of course in conventional scholarly work, we evaluate work in progress as we go along: we don't submit something for review that we have not thoroughly reviewed ourselves.  But applying a test-driven approach to the editorial work of the HMT has been an eye-opening experience.  Because it compels us to reckon with "minor" irregularities we might otherwise gloss over, it can expose assumptions needing more critical examination.  In the HMT, one test we apply after tokenizing our edited texts, for example, is  a morphological analysis of all lexical tokens, on the assumption that failures will represent either Byzantine orthographic practices unrecognized by our parsing system, or errors in our edition.  We were surprised to discover a third explanation:  a number of technical terms appearing in the scholia are not fact in the standard Greek lexicon by Liddell and Scott, and therefore failed to parse.  When we retroactively applied our morphological tests to sections that had been edited before we adopted a full test-driven approach, we uncovered further examples that, as isolated cases, editors had not noticed.

Structuring for reuse

Complex computer programs are possible in part because reusable units of code encapsulate solutions to individual problems.   For example, I should never again have to spend my time writing a program to translate ancient Greek from one encoding system to another, because I can rely on Hugh Cayless' Epidoc transcoding library.  Hugh's code has a clean interface:  define the system you're translating from, the system you're translating to, and then the `getString` method hands you your result.

One of the major challenges humanists need to address today is how to design APIs to digital scholarship.  What are the appropriate components or methods, and how should they be identified?  At a minimum, a scholarly publication should address this question in two ways:

  1. Citations of source material should be expressed in technology-independent but machine-actionable notation.  In the absence of an alternative that fully accomplishes this, that he Homer Multitext project has developed a URN notation for texts (CTS URNs) and for discrete objects (CITE Object URNs), as well as an extension to CITE Object URNs for resolution-independent citation of regions of interest on an image.
  2. If sections of the publication itself can be cited, they too should be addressable with CTS URNs based on some logical unit (and not by accidental physical features such as page numbers).

Licensing for reuse

The act of publication alienates a work of scholarship from the author in a form that others can use, and contributes it to the scholarly community.  In addition to an appropriate technological design, scholarly publications must therefore be available under an appropriate license that must allow at least non-commercial reuse.  For programmers, the leading such license is the GNU General Public License (GPL]), and this is ideal for source code included in scholarly publication;  for other kinds of digital data, the Creative Commons project has defined Attribution-ShareAlike licenses that achieve the same aim.

Highly trained attorneys around the world have contributed their time and expertise to developing these licenses, and in many instances tailoring them to the specific requirements of local legal systems, as well as translating them into a large number of languages.  The easiest part of designing your publication should be taking advantage of their work and applying one of these licenses to your work.