Tuesday, April 1, 2014

Get funding for your DH project?

Try these regular expressions on the embedded youtube video:

s/our company/our project/g
s/your company/IT staff/g
s/expert/developer/g







Sunday, March 30, 2014

Specs + tests for CTS

In February, Chris Blackwell and I released a release candidate version of the CTS protocol specification, 5.0. Today, we are releasing a second release candidate, in parallel with a suite of tests packaged with a servlet that can run the tests and format the resulting report in a web page.

We are currently working on a third release candidate taking account of all the helpful comments we have received so far on rc.1, and plan to continue coordinating releases of the CTS protocol specification with parallel test suites. We expect that rc.3 will be the last candidate version before a final CTS 5.0.


All our released work on the CITE architecture now belongs to a cite-architecture group on github. For a guide to our repositories, see the organization home page on github.

Wednesday, March 26, 2014

Visualization from CITE URNs + d3 + hive maps

Many software packages make it relatively easy to create visualizations of complex networks of data, but often produce hairballs that tell us more about the visual layout algorithm than the structure of the network. Martin Krzywinski has proposed an alternative, called hive plots: lay out your nodes along a series of axes that you know have meaning in your network, and explore the network visually from there. Mike Bostock, predictably, has done gorgeous interactive work with Krzywinski’s idea in the d3 javascript library.

 hive map
I created my first hive plot this morning using d3. The screen shot above illustrates a project by Megan Whitacre (Holy Cross ’14) annotating a series of illustrated inscriptions for use in teaching introductory Latin. The five axes are (clockwise beginning from the blue axis at the top) 32 broad grammatical concepts, 71 narrower topics about the morphology of substantives, 55 topics about verbal morphology, 9 syntactic topics and, along the purple axis, 103 images. All of Megan’s annotations are expressed with CITE URNs; this makes it straightforward both to gather all references to the same image, or to apply her region of interest to highlight areas of the image. d3 practically begs for interactive displays, so you can highlight nodes or edges to see further information, or can click on image nodes to see the image with linked, highlighted areas for all references to the image.



There is plenty of room for improvement. Selecting a node or edge really out to select all direct connections to it as well, and hovering should use Megan’ rdf:Label values, instead of the raw CITE URN to identify the node, to name two obvious desiderata. But as an initial effort put together between second cup of coffee and lunch break, it’s hard to be disappointed with it. It underscores for me that as our tools improve, it becomes more and more important to have properly structured and properly citable data.


(The screen shot is linked to a live version of the graph.)

Monday, March 3, 2014

bl.ocks rocks

http://bl.ocks.org/ occupies an interesting space in the overlap of coding and writing. It lets you simultaneously view the rendering of a source page and its source code, together with commentary in the form of a README. Each bl.ock is defined simply as a github gist that follows the naming convention README.md for commentary (in markdown), and index.html for source file to be both rendered and displayed in source view. This is extraordinarily powerful when index.html is a single-page web application, of the kind that D3 (http://d3js.org/) encourages you to build – and Mike Bostock, the main developer of D3, just happens to be the inventor of bl.ocks as well.

bl.ocks are a great way to pull away the curtain and illustrate how a particular analysis or visualization works, and studying other people’s bl.ocks can be a fast route to learning a new technique.
From work with Christine Bannan on the Phoros project, I’ve put up this bl.ock as we begin to map changing patterns of Athenian tribute over time:


Extant records of tribute payment


Links

  • Phoros project github repositories: http://phoros.github.io/
  • Phoros project test site: http://beta.hpcc.uh.edu/phoros/

Monday, January 27, 2014

Environments for collaborating

Cloud-based services have made collaborating on scholarly material so much easier in the last few years that it’s hard to remember how onerous it used to be. (Raise your hand if you have ever hosted your own shared version control system.)

github is a prime example. In addition to version control,  github provides each repository with a wiki, issue tracking and other services that you can use entirely through your web browser. Edit version-controlled files through the browser or in the comfort of your own computer’s OS, and push them back to a shared repository.

While github solves nearly every challenge of collaborating on static files or data, it does not directly address the question of how to share computational processes. How do we share with collaborators when the goal is not to show the results of a process, but to share the process itself? This, like so many technical challenges in humanities scholarship, is a problem we have in common with programmers who have to collaborate on writing code, and who have kindly provided us with the solution.

Virtual machines are half of the answer. Consumer-level hardware and VM software have reached the point where we can realistically say, “No matter what OS you’re actually using, we’ll just use a VM so we can all work on this project in Ubuntu 12.04.”

The other half of the answer is a system like vagrant. Vagrant provides a way to specify the configuration of your virtual machine, and can work with many VM systems, including the freely available VirtualBox. The specification is expressed in a simple text file — ideal for sharing from your github repository! So starting from scratch, new collaborators can perfectly replicate the system you run in your project in these steps:
  1. Make sure git is installed on their machines: http://git-scm.com/
  2. Install virtual box on their machines: https://www.virtualbox.org/
  3. Install vagrant on their machines: http://www.vagrantup.com/
  4. Run this vagrant command: vagrant gem install vagrant-vbguest
  5. Clone your project's git repository including a Vagrantfile specifying the configuration of your virtual machine
At this point, they can begin any work session within your repository directory by running

vagrant up

to start the virtual machine. (The first boot will be slow as the virtual machine is downloaded and built; after that, it’s tolerable.)


What is perhaps most remarkable about this sequence is that it imposes only two prior technical requirements on new collaborators: they must be able to use a web browser to download and install virtualbox and vagrant (and git, if they have not already done so); and they must be able to find a terminal or console where they can run a vagrant command. If that’s too much to demand, maybe it’s time for them to reconsider whether they’re really interested in collaborating on a digital scholarly project.

Monday, January 20, 2014

Markdown everywhere

Think there's a little momentum behind markdown lately?

This article from Mashable is already half a year old, and lists seventy-eight (78!) tools for "writing and previewing markdown"!  And its topic doesn't even extend to some of the very interesting services that use markdown, like leanpub and draft, or any of the numerous markdown-to-slideshow toolkits out there...

I'm convinced enough that I've just completed an initial version of a tool for working with markdown extended to allow citation using canonical URN values, and converting the source to generic markdown that any of these tools can process.  When I've polished the docs a little more, I'll post here with further notes on markdown and its increasing importance for scholarly work.

Tuesday, January 7, 2014

Designing scholarly publications: some lessons we can take from programmers

Scholarly publication involves more than just making work accessible.  When scholars publish, they are contributing their work to the collective endeavor of the entire scholarly community.  In order for other scholars to inspect, critique, and build upon published scholarship, it must be appropriately:

  •  identified
  •  verified
  •  structured for reuse
  •  licensed for reuse

Scholars are fortunate that all of these requirements are shared by coders, who have consequently developed well established practices and tools to satisfy each of them.  The infrastructure that programmers rely on is especially significant for digital scholarship because it has been designed for automated interaction.  For humanists, the shift from creating scholarship designed for manual processing to scholarship designed to be used through the mediation of software and hardware is often an enormous challenge.  My experience working with many collaborators on the Homer Muiltlitext project (HMT) has convinced me that we can greatly accelerate the progress of our digital scholarly work by learning from decades of experience in software development.   (In follow-up posts, I'll illustrate some working examples that exploit the design of the HMT's digital publications.)

Identifying a digital publication

Any unit of publication must be clearly identified with a specific, fixed version.  For a print monograph, this might be an edition  number, and possibly also a printing;  journals are normally identified by a date (year, quarter, or other cycle) and a volume or other serial count.  A library catalog might then resolve that reference to a storage location identified by a call number.  Digital publications likewise must be uniquely identified in a system that recognizes different editions or versions, and permits automated resolution of identifiers to a storage location.

Consider what you would typically do if you were writing a Java program, and wanted to use Saxon (a library for processing XSLT).  You can specify the library by its Maven coordinates, giving its "publisher," net.sf.saxon, the name of the "publication"", saxon-dom, and an "edition," or version, number (e.g., 8.7).  You would then rely on an automated build process to retrieve a local copy from a repository that recognizes the identifier.  With repository management systems such as Nexus, scholars can use exactly the same system of Maven coordinates to make published material automatically retrievable.  The Homer Multitext project, for example, plans to use a Nexus repository to publish the project's archive of editorial work three times a year.  The publications belong to the group org.homermultitext, and will include a publication named hmtarchive;  versions will have names like 2014.1 (the first publication of the year 2014).

Verifying a publication 

One of the distinguishing features of publication is that it has undergone some form of review.  The review process evaluates work which, in principle, ought to be replicable.  Review of digital scholarship should include not only manual evaluation, but, where applicable, automated tests assessing the data.   All of us already apply an automated test whenever we run a spell checker over a text:  when we produce digital work with more complex structure than simply a stream of words, more extensive digital tests are called for, and ought to be included as part of the publication.

One valuable idea we can take from programming practice is "test-driven development."  In test-driven development, the programmer specifies an automated test before beginning to write some section of a program, and works on the program until it passes the test.  Of course in conventional scholarly work, we evaluate work in progress as we go along: we don't submit something for review that we have not thoroughly reviewed ourselves.  But applying a test-driven approach to the editorial work of the HMT has been an eye-opening experience.  Because it compels us to reckon with "minor" irregularities we might otherwise gloss over, it can expose assumptions needing more critical examination.  In the HMT, one test we apply after tokenizing our edited texts, for example, is  a morphological analysis of all lexical tokens, on the assumption that failures will represent either Byzantine orthographic practices unrecognized by our parsing system, or errors in our edition.  We were surprised to discover a third explanation:  a number of technical terms appearing in the scholia are not fact in the standard Greek lexicon by Liddell and Scott, and therefore failed to parse.  When we retroactively applied our morphological tests to sections that had been edited before we adopted a full test-driven approach, we uncovered further examples that, as isolated cases, editors had not noticed.

Structuring for reuse

Complex computer programs are possible in part because reusable units of code encapsulate solutions to individual problems.   For example, I should never again have to spend my time writing a program to translate ancient Greek from one encoding system to another, because I can rely on Hugh Cayless' Epidoc transcoding library.  Hugh's code has a clean interface:  define the system you're translating from, the system you're translating to, and then the `getString` method hands you your result.

One of the major challenges humanists need to address today is how to design APIs to digital scholarship.  What are the appropriate components or methods, and how should they be identified?  At a minimum, a scholarly publication should address this question in two ways:

  1. Citations of source material should be expressed in technology-independent but machine-actionable notation.  In the absence of an alternative that fully accomplishes this, that he Homer Multitext project has developed a URN notation for texts (CTS URNs) and for discrete objects (CITE Object URNs), as well as an extension to CITE Object URNs for resolution-independent citation of regions of interest on an image.
  2. If sections of the publication itself can be cited, they too should be addressable with CTS URNs based on some logical unit (and not by accidental physical features such as page numbers).

Licensing for reuse

The act of publication alienates a work of scholarship from the author in a form that others can use, and contributes it to the scholarly community.  In addition to an appropriate technological design, scholarly publications must therefore be available under an appropriate license that must allow at least non-commercial reuse.  For programmers, the leading such license is the GNU General Public License (GPL]), and this is ideal for source code included in scholarly publication;  for other kinds of digital data, the Creative Commons project has defined Attribution-ShareAlike licenses that achieve the same aim.

Highly trained attorneys around the world have contributed their time and expertise to developing these licenses, and in many instances tailoring them to the specific requirements of local legal systems, as well as translating them into a large number of languages.  The easiest part of designing your publication should be taking advantage of their work and applying one of these licenses to your work.