Vitruvian design for scholarship in the humanities

OHOC2 FTW

2015-03-07T04:17:00.001-08:00

Underlying the CTS URN notation is the abstract model of textual structure abbreviated as OHCO2.

The generality of this model is nicely illustrated by recent implementations of the Canonical Text Services (CTS) protocol. The CTS protocol provides retrieval of texts by CTS URN: implementations linked from this page use XML tree structures, relational databases and directed graph stores to store and retrieve texts.

For an essential scholarly concept (identifying a citable passage of text), that's a powerful level of abstraction permitting scholars and developers to select technologies best suited to the specific kind of work they want to pursue with a citable corpus.

Open license + an iPad mini

2014-10-25T19:38:00.000-07:00

At the end of Open Access Week, I'd like to salute the library of Leiden University for living up to the goal of making open access the norm in scholarship. If you work in its very pleasant setting, there are no restrictions on how you make use of out-of-copyright material.

When I visited Leiden earlier this year, I had an iPad mini with me, so I took a few quick snaps of Codex Vossianus Graecus 1, a set of maps (perhaps sixteenth century) to accompany Ptolemy's Geography. Thanks to the library's policies, I can make images like available as citable scholarly resources.

Leiden University, Codex Voss.Gr. 1: world map in Ptolemy's first projection

When the phone or tablet you happen to be carrying gives you photos rivaling or surpassing anything published in print, the technology is not much of a barrier. When the default policy is that you can use your photographs as you see fit, neither is legal licensing.

(To see what's legible in a quick and poorly lit snap from an iPad mini, see this zoomable image of folios 2-3.)

Paleography matters in the Declaration of Independence: a CITE response

2014-07-04T13:44:00.000-07:00

My colleague Tom Martin points me to this article in the New York Times, reporting that Danielle Allen at the Institute for Advanced Study in Princeton has questioned the National Archives’ transcription of a a crucial phrase in the Declaration of Independence. Are Thomas Jefferson’s “self-evident truths” comprised of individual rights, or do they also include a governmental role “to secure these rights”? Your judgment could hang on whether or not you see a period followed by a long dash or simply a long dash in the original document.

I browsed the National Archives web site, and found that they offer two downloadable images, one a photograph of the original parchment, and another of the 1823 engraving by William Stone, both apparently in the public domain.

So I took a few minutes of my Fourth of July holiday to set up a CITE Image Service where you can browse and create citable references of the images. Here is the detail of the crucial passage in the photograph of the parchment:
.
In the Image Collection I created this afternoon, this detail can be cited generically with this URN

urn:cite:mid:natarchimgs.Declaration_Pg1of1_AC@0.472,0.1872,0.082,0.0213

and the URN can also be resolved to see the detail in context.

Contrast the Stone engraving:

1823 engraving

(citable as urn:cite:mid:natarchimgs.Declaration_Engrav_Pg1of1_AC@0.465,0.1919,0.076,0.0177, and viewable in context here)

With references like this, it would be easy to cite other examples in the document of periods and long dashes, much as participants at last week’s Homer Multitext seminar collated evidence to interpret features of the oldest extant manuscript of the Iliad.

Conclusions? The parchment of the Declaration is hard to read, but paleography is important, and the CITE architecture that was originally created for the Homer Multitext project can be applied to any sort of paleographic problem.

More reasons to love markdown plus critic markup

2014-05-10T07:45:00.001-07:00

Deadlines for senior projects mean that in addition to the interesting challenge of how to submit genuinely replicable digital scholarship to the library's institutional repository, it's time to generate pdfs so that the Graphic Arts Department can bind something for the library shelves. The projects I'm advising are formatted in markdown extended to support citation by scholarly URN (what I'm calling "citedown"). We wanted to create markdown source that could be used with leanpub, beautiful docs, or pandoc, so the automated workflow has to handle some potentially complex issues resolving URNs, downloading local copies of embedded images and rewriting references to them, etc.

I had been using critic markup for editorial questions and copy editing, but with one eye on the calendar, I wanted to test the pdf workflow before we had a complete draft with all critic markup resolved.

To my surprise, when we used pandoc to lay out the text with a LaTex book structure, it recognized the critic markup and formatted it in the resulting pdf! Comments default, appropriately, to a screaming magenta that could have been taken from a 1990s GIS palette. (Anyone who forgets to run their automated process to find and resolve critic markup will have a hard time missing these.)

Pandoc has always been a major reason to love markdown's simplicity. Now it's one more reason to consider the combination of markdown plus critic markup.

Publishing digital scholarship

2014-04-30T15:06:00.000-07:00

The Holy Cross Manuscripts, Inscriptions and Documents Club (HC MID) has extended its projects’ routine working practice to include a plan for publishing replicable digital scholarship. Generally, projects in HC MID generate three kinds of related material that need to be accounted for in a publication:

archival material: TEI-conformant editions of texts, and a variety of data sets in simple delimited text formats
analytical material: expository prose in markdown, using URNs to refer to all citable resources
source material for user-interfaces: interactive presentations of analytical and archival material as servlets.

All projects in the club already use git for version control in public repositories hosted on github, so it is straightforward to identify and retrieve a specific version of any repository, whether it hosts archival data sets, expository writing, or source material for servlets. In assembling a servlet for end users, the club’s projects use gradle as their build system. Typically, a build task verifies the contents of the archive with a series of automated tests, then generates an RDF graph of the entire project using the citemgr build system. Additional tasks can load the resulting RDF graph into an RDF server, and start up a servlet that knows how to talk to the the RDF server. The entire sequence can be reduced to a single shell script; some projects have even put a boot script of this type on a cron job that rebuilds the project graph and servlet nightly. Every step in the trip from our github repositories to a running application can be fully automated.

The Holy Cross Libraries have recently begun hosting an institutional repository for digital scholarship. Because members of HC MID have presented their work at several international conferences and seminars this year, the library offered to include this work in the new institutional repository, but could not realistically plan to support separate running applications for every digital project that might ever be developed at the college. How can we publish to others fully functional replicas of our digital work through institutional repositories of this kind?

Our solution involves only one addition to our normal working routine. With the emergence of systems like Vagrant, it is simple to define a virtual machine configured with exactly the resources that a particular project requires. We create one further git repository, but it is the smallest of all, since in most cases it consists of little more than a Vagrant file and a shell script to provision the virtual machine. Given that Virtual Box is freely available for essentially any host platform, we can reduce the problem of how to replicate our projects to:

be sure you have Vagrant and Virtual Box installed
download our virtual machine repository, and run vagrant up in its root directory

Our library can happily host versioned releases of these simple git repositories, and add the metadata, indexing and search services that library staff has expertise in. We change nothing in our regular work flow, and add a virtual machine definition when we are ready to publish a release of a project.

As we gain practical experience with this approach to replicable publication, perhaps we will discover shortcomings we do not recognize yet, but as we approach the end of the spring term for 2014, the combination of github version control, automated build systems, virtual machines and institutional repositories seems to cover the complex requirements of publishing digital scholarship as effectively as anything I am familiar with. It cleanly isolates distinct concerns, and relies on generic solutions where they are available.

In any case, the design is not hypothetical. Three projects will publish release versions of their work after the spring semester ends at Holy Cross.

Get funding for your DH project?

2014-04-01T05:48:00.000-07:00

Try these regular expressions on the embedded youtube video:

s/our company/our project/g
s/your company/IT staff/g
s/expert/developer/g

Specs + tests for CTS

2014-03-30T13:42:00.000-07:00

In February, Chris Blackwell and I released a release candidate version of the CTS protocol specification, 5.0. Today, we are releasing a second release candidate, in parallel with a suite of tests packaged with a servlet that can run the tests and format the resulting report in a web page.

We are currently working on a third release candidate taking account of all the helpful comments we have received so far on rc.1, and plan to continue coordinating releases of the CTS protocol specification with parallel test suites. We expect that rc.3 will be the last candidate version before a final CTS 5.0.

All our released work on the CITE architecture now belongs to a cite-architecture group on github. For a guide to our repositories, see the organization home page on github.

Visualization from CITE URNs + d3 + hive maps

2014-03-26T11:34:00.002-07:00

Many software packages make it relatively easy to create visualizations of complex networks of data, but often produce hairballs that tell us more about the visual layout algorithm than the structure of the network. Martin Krzywinski has proposed an alternative, called hive plots: lay out your nodes along a series of axes that you know have meaning in your network, and explore the network visually from there. Mike Bostock, predictably, has done gorgeous interactive work with Krzywinski’s idea in the d3 javascript library.

I created my first hive plot this morning using d3. The screen shot above illustrates a project by Megan Whitacre (Holy Cross ’14) annotating a series of illustrated inscriptions for use in teaching introductory Latin. The five axes are (clockwise beginning from the blue axis at the top) 32 broad grammatical concepts, 71 narrower topics about the morphology of substantives, 55 topics about verbal morphology, 9 syntactic topics and, along the purple axis, 103 images. All of Megan’s annotations are expressed with CITE URNs; this makes it straightforward both to gather all references to the same image, or to apply her region of interest to highlight areas of the image. d3 practically begs for interactive displays, so you can highlight nodes or edges to see further information, or can click on image nodes to see the image with linked, highlighted areas for all references to the image.

There is plenty of room for improvement. Selecting a node or edge really out to select all direct connections to it as well, and hovering should use Megan’ rdf:Label values, instead of the raw CITE URN to identify the node, to name two obvious desiderata. But as an initial effort put together between second cup of coffee and lunch break, it’s hard to be disappointed with it. It underscores for me that as our tools improve, it becomes more and more important to have properly structured and properly citable data.

(The screen shot is linked to a live version of the graph.)

bl.ocks rocks

2014-03-03T05:19:00.000-08:00

http://bl.ocks.org/ occupies an interesting space in the overlap of coding and writing. It lets you simultaneously view the rendering of a source page and its source code, together with commentary in the form of a README. Each bl.ock is defined simply as a github gist that follows the naming convention README.md for commentary (in markdown), and index.html for source file to be both rendered and displayed in source view. This is extraordinarily powerful when index.html is a single-page web application, of the kind that D3 (http://d3js.org/) encourages you to build – and Mike Bostock, the main developer of D3, just happens to be the inventor of bl.ocks as well.

bl.ocks are a great way to pull away the curtain and illustrate how a particular analysis or visualization works, and studying other people’s bl.ocks can be a fast route to learning a new technique.

From work with Christine Bannan on the Phoros project, I’ve put up this bl.ock as we begin to map changing patterns of Athenian tribute over time:

Environments for collaborating

2014-01-27T04:09:00.000-08:00

Cloud-based services have made collaborating on scholarly material so much easier in the last few years that it’s hard to remember how onerous it used to be. (Raise your hand if you have ever hosted your own shared version control system.)

github is a prime example. In addition to version control, github provides each repository with a wiki, issue tracking and other services that you can use entirely through your web browser. Edit version-controlled files through the browser or in the comfort of your own computer’s OS, and push them back to a shared repository.

While github solves nearly every challenge of collaborating on static files or data, it does not directly address the question of how to share computational processes. How do we share with collaborators when the goal is not to show the results of a process, but to share the process itself? This, like so many technical challenges in humanities scholarship, is a problem we have in common with programmers who have to collaborate on writing code, and who have kindly provided us with the solution.

Virtual machines are half of the answer. Consumer-level hardware and VM software have reached the point where we can realistically say, “No matter what OS you’re actually using, we’ll just use a VM so we can all work on this project in Ubuntu 12.04.”

The other half of the answer is a system like vagrant. Vagrant provides a way to specify the configuration of your virtual machine, and can work with many VM systems, including the freely available VirtualBox. The specification is expressed in a simple text file — ideal for sharing from your github repository! So starting from scratch, new collaborators can perfectly replicate the system you run in your project in these steps:

Make sure git is installed on their machines: http://git-scm.com/
Install virtual box on their machines: https://www.virtualbox.org/
Install vagrant on their machines: http://www.vagrantup.com/
Run this vagrant command: vagrant gem install vagrant-vbguest
Clone your project's git repository including a Vagrantfile specifying the configuration of your virtual machine

At this point, they can begin any work session within your repository directory by running

vagrant up

to start the virtual machine. (The first boot will be slow as the virtual machine is downloaded and built; after that, it’s tolerable.)

What is perhaps most remarkable about this sequence is that it imposes only two prior technical requirements on new collaborators: they must be able to use a web browser to download and install virtualbox and vagrant (and git, if they have not already done so); and they must be able to find a terminal or console where they can run a vagrant command. If that’s too much to demand, maybe it’s time for them to reconsider whether they’re really interested in collaborating on a digital scholarly project.

Markdown everywhere

2014-01-20T13:14:00.000-08:00

Think there's a little momentum behind markdown lately?

This article from Mashable is already half a year old, and lists seventy-eight (78!) tools for "writing and previewing markdown"! And its topic doesn't even extend to some of the very interesting services that use markdown, like leanpub and draft, or any of the numerous markdown-to-slideshow toolkits out there...

I'm convinced enough that I've just completed an initial version of a tool for working with markdown extended to allow citation using canonical URN values, and converting the source to generic markdown that any of these tools can process. When I've polished the docs a little more, I'll post here with further notes on markdown and its increasing importance for scholarly work.

Designing scholarly publications: some lessons we can take from programmers

2014-01-07T13:56:00.002-08:00

Scholarly publication involves more than just making work accessible. When scholars publish, they are contributing their work to the collective endeavor of the entire scholarly community. In order for other scholars to inspect, critique, and build upon published scholarship, it must be appropriately:

identified
verified
structured for reuse
licensed for reuse

Scholars are fortunate that all of these requirements are shared by coders, who have consequently developed well established practices and tools to satisfy each of them. The infrastructure that programmers rely on is especially significant for digital scholarship because it has been designed for automated interaction. For humanists, the shift from creating scholarship designed for manual processing to scholarship designed to be used through the mediation of software and hardware is often an enormous challenge. My experience working with many collaborators on the Homer Muiltlitext project (HMT) has convinced me that we can greatly accelerate the progress of our digital scholarly work by learning from decades of experience in software development. (In follow-up posts, I'll illustrate some working examples that exploit the design of the HMT's digital publications.)

Identifying a digital publication

Any unit of publication must be clearly identified with a specific, fixed version. For a print monograph, this might be an edition number, and possibly also a printing; journals are normally identified by a date (year, quarter, or other cycle) and a volume or other serial count. A library catalog might then resolve that reference to a storage location identified by a call number. Digital publications likewise must be uniquely identified in a system that recognizes different editions or versions, and permits automated resolution of identifiers to a storage location.

Consider what you would typically do if you were writing a Java program, and wanted to use Saxon (a library for processing XSLT). You can specify the library by its Maven coordinates, giving its "publisher," net.sf.saxon, the name of the "publication"", saxon-dom, and an "edition," or version, number (e.g., 8.7). You would then rely on an automated build process to retrieve a local copy from a repository that recognizes the identifier. With repository management systems such as Nexus, scholars can use exactly the same system of Maven coordinates to make published material automatically retrievable. The Homer Multitext project, for example, plans to use a Nexus repository to publish the project's archive of editorial work three times a year. The publications belong to the group org.homermultitext, and will include a publication named hmtarchive; versions will have names like 2014.1 (the first publication of the year 2014).

Verifying a publication

One of the distinguishing features of publication is that it has undergone some form of review. The review process evaluates work which, in principle, ought to be replicable. Review of digital scholarship should include not only manual evaluation, but, where applicable, automated tests assessing the data. All of us already apply an automated test whenever we run a spell checker over a text: when we produce digital work with more complex structure than simply a stream of words, more extensive digital tests are called for, and ought to be included as part of the publication.

One valuable idea we can take from programming practice is "test-driven development." In test-driven development, the programmer specifies an automated test before beginning to write some section of a program, and works on the program until it passes the test. Of course in conventional scholarly work, we evaluate work in progress as we go along: we don't submit something for review that we have not thoroughly reviewed ourselves. But applying a test-driven approach to the editorial work of the HMT has been an eye-opening experience. Because it compels us to reckon with "minor" irregularities we might otherwise gloss over, it can expose assumptions needing more critical examination. In the HMT, one test we apply after tokenizing our edited texts, for example, is a morphological analysis of all lexical tokens, on the assumption that failures will represent either Byzantine orthographic practices unrecognized by our parsing system, or errors in our edition. We were surprised to discover a third explanation: a number of technical terms appearing in the scholia are not fact in the standard Greek lexicon by Liddell and Scott, and therefore failed to parse. When we retroactively applied our morphological tests to sections that had been edited before we adopted a full test-driven approach, we uncovered further examples that, as isolated cases, editors had not noticed.

Structuring for reuse

Complex computer programs are possible in part because reusable units of code encapsulate solutions to individual problems. For example, I should never again have to spend my time writing a program to translate ancient Greek from one encoding system to another, because I can rely on Hugh Cayless' Epidoc transcoding library. Hugh's code has a clean interface: define the system you're translating from, the system you're translating to, and then the `getString` method hands you your result.

One of the major challenges humanists need to address today is how to design APIs to digital scholarship. What are the appropriate components or methods, and how should they be identified? At a minimum, a scholarly publication should address this question in two ways:

Citations of source material should be expressed in technology-independent but machine-actionable notation. In the absence of an alternative that fully accomplishes this, that he Homer Multitext project has developed a URN notation for texts (CTS URNs) and for discrete objects (CITE Object URNs), as well as an extension to CITE Object URNs for resolution-independent citation of regions of interest on an image.
If sections of the publication itself can be cited, they too should be addressable with CTS URNs based on some logical unit (and not by accidental physical features such as page numbers).

Licensing for reuse

The act of publication alienates a work of scholarship from the author in a form that others can use, and contributes it to the scholarly community. In addition to an appropriate technological design, scholarly publications must therefore be available under an appropriate license that must allow at least non-commercial reuse. For programmers, the leading such license is the GNU General Public License (GPL]), and this is ideal for source code included in scholarly publication; for other kinds of digital data, the Creative Commons project has defined Attribution-ShareAlike licenses that achieve the same aim.

Highly trained attorneys around the world have contributed their time and expertise to developing these licenses, and in many instances tailoring them to the specific requirements of local legal systems, as well as translating them into a large number of languages. The easiest part of designing your publication should be taking advantage of their work and applying one of these licenses to your work.

The APA privatizes, too

2013-12-30T03:45:00.003-08:00

Unlike many of my colleagues and friends in Classics Departments around the US and abroad, I will not be travelling to Chicago this week for the annual meeting of the American Philological Association. The APA continues to accept donations to a recently completed capital campaign with the goal of supporting a digital "Center for Classics Research and Teaching." (See the description here.) The APA claims that its center will "make high quality information about the Classical World available in accessible formats to the largest possible audience by using technology in new and exciting ways," but has never clearly addressed the fact that, as proposed, the center will include material for APA members only.

Like Elsevier and some other distributors, in other words, the APA wants to control who can read scholarly work as part of its "business model." Like Elsevier, the APA leadership is doubtless sincere in its belief that its "business model" is paramount. But like Elsevier, the APA winds up in a Wonderland, where, with Humpty Dumpty, we can make words mean whatever we choose. The idea that closed-access material could be available to "the largest possible audience" is ludicrous. In 2012, over a billion IPv4 addresses were in use, and, while difficult to estimate, the number of individual internet users is certainly much higher. It must exceed the APA's membership by at least six orders of magnitude. (That is, the number of internet users is surely at least 100,000 times greater than the number of APA members.)

More simply, like Elsevier, the APA's plan privatizes scholarly work that should be published. In criticizing Elsevier's business practices, I argued that

Scholarly publication in a digital world means that a work is openly accessible for others to inspect, critique, and build upon, and we should insist that in reviews for tenure and promotion, only scholarly publications meeting this definition qualify as published work.

We should hold professional organizations to the same standard.

Unfortunately, just as these essential scholarly values are often ignored in reviews of individuals for tenure and promotion, they are often likewise neglected in evaluation of funding requests from educational institutions, federal programs and private philanthropic organizations. There is no quick or easy way to change these entrenched practices that directly oppose the basic working method of scholarship. But I have the choice not to become a member of (and support with my membership fees) an organization that is building a system of information apartheid.

If you are at the APA this week, try to get a clear answer to a yes/no question: will the APA's digital publications be openly accessible for others to inspect, critique, and build upon?

Elsevier does not publish: it privatizes

2013-12-19T05:44:00.000-08:00

If you were shocked that Elsevier has apparently issued a takedown notice to the University of Calgary, you should consider auditioning for Claude Rains' role in Casablanca. Elsevier has never hidden the fact that its business model depends on restricting access to scholarly work. Alicia Wise of Elsevier responds to the post linked above with this question:

the business model is based largely on paid access post-publication, and if freely accessible on a large scale what library will continue to subscribe?

The question may be sincerely intended, but its logic is straight from Alice in Wonderland: if Elsevier cannot profit by making scholarship publicly available — that is, by publishing it — then it must privatize the information, and sell access only to clients who cede to Elsevier control over who may read the scholarly work.

The intellectual roots of western scholarship reach back to ancient Greece, and the radical idea that scholarly understanding is not determined by political or social power. (This is exemplified in the famous story of Euclid telling his patron and monarch, Ptolemy, that "there is no royal road to geometry.") In our modern academic institutions, publication exposes scholarly work to public scrutiny, and serves in part to ensure that scholarly claims are not based on power over information.

Elsevier and others subvert this fundamental scholarly activity when they privatize scholarship, a simple fact that we obfuscate when, with an Orwellian twist of language, we call it "publication." It is true that scholars who freely hand over their work to privatizers make the system possible, but who can blame an untenured faculty member who will be rewarded for contributing to the dysfunction?

We should instead unambiguously reiterate that scholarly publication in a digital world means that a work is openly accessible for others to inspect, critique, and build upon, and we should insist that in reviews for tenure and promotion, only scholarly publications meeting this definition qualify as published work.

How quickly would Elsevier's pool of submissions dry up if enough universities adopted and enforced such a requirement for real scholarly publication?

What humanists do

2013-12-11T05:30:00.002-08:00

I recently stumbled across an interview with the very articulate Astronomer Royal Martin Reese that included this observation:

But the aim of science is to unify disparate ideas, so we don't need to remember them all. I mean we don't need to record the fall of every apple, because Newton told us they all fall the same way.

(The full transcript of the interview is here, under the arresting title "Cosmic Origami and What We Don't Know.")

I think that this remark really captures a quintessential difference between the natural sciences and the humanities. Humanists, too, unify disparate ideas, but we must record each unique phenomenon that we study. If we develop a unified view of oral poetry, for example, we will never conclude that "I'm familiar with the Iliad, so I don't have to remember the Odyssey," or "I've studied Greek poetry so I don't need to know about the Serbo-Croatian oral poetry that Parry and Lord recorded." We don't study apples. Recording and remembering are basic to scholarship in the humanities.

This has important implications for how we work in a digital world. We record and remember through citation, so before anything else we must develop a sound infrastructure for citation.

markdown + criticmarkup

2013-10-25T07:57:00.002-07:00

It's been a year since I last posted about markdown, and in that time, the number of interesting applications and services has continued to grow rapidly. (If you use markdown and haven't looked at leanpub yet, you owe it to yourself to take a peek!)

I've been looking recently at using criticmarkup together with markdown, and it seems really promising. Editorial on the iPad and both Multimarkdown Composer and Marked2 on OS X support displaying criticmarkup within your editor.

It would be nice to have equally convenient ways to automate accepting or rejecting suggested additions, deletions or changes, but the OS X system services in the criticmarkup site's toolkit did not work correctly when I installed them, so I'm gisting a couple of perl scripts that accept or reject criticmarkup in standard input.

gist to accept criticmarkup
gist to reject criticmarkup

It shouldn't be a big job to turn those into system services. I've never used OS X's Automator, but maybe that would make a good afternoon project...?

The instrumentalist fallacy and academic publication

2013-09-28T06:33:00.004-07:00

It is easy to confuse a tool with the task it is supposed to accomplish. This does not necessarily cause problems when the tool and task are perfectly matched, but easily leads to misapplication of the tool. A bicycle is a wonderful means of transportation, but should not be used to travel across bodies of water, for example. I call this confusion "the instrumentalist fallacy," and I deal with it daily in my digital scholarly work.

The academic review process has instutionalized the instrumentalist fallacy in some specially harmful ways. I read this account of a group of mathematicians who used a github repository to coauthor a book: any one can clone their source, and offer improvements for the authors to pull into a subsequent version. One of the principal authors has an enlightening post about the process here.

Note the contrast that both the wired interview and the author's blog post point out: the authors chose an open collaborative process because it resulted in better scholarship, but understood that they would receive less professional recognition or credit for it.

The instrumental fallacy equates the instrument— the traditional publishing process — with its goal, vetting the quality of scholarly work. Is it too radical to suggest that the way to assess the quality of a scholarly publication might be ... to read it?

See:

What scholarship looks like

2013-08-19T08:12:00.000-07:00

The Leipzig "Open Philology" workshop reinforced a fact that I (re)learn constantly from my work advising Holy Cross' "Manuscripts, Inscriptions and Documents Club": that the most important changes brought about to scholarship by new technology are not technological, but intellectual and social.

It's not easy for someone of my generation to imagine how significant research in Classics can be collaborative, and can engage people of a wide range of ages (even people without university-level degrees, something my training conditions me to view as a heresy), but there's no mistaking it when you get to watch it happen. In Leipzig, the best example was "Team Croatia": five participants from Zagreb, led by their gifted teacher and scholar, Neven Jovanović (far right in the photo below).

A mediocre cell-phone snap shows what this kind of activity can look like: two computers, but one temporarily ignored as three pairs of eyes focus intently on the same screen. A single pair of hands is not enough to capture the action in real time: if this were a piano composition for four hands, this movement would be marked "presto".

If we're going to lay a digital foundation for classical studies, this is the kind of team that will make it happen.

Update: thanks to Neven for helping me correctly spell the names of Team Croatia: Juraj Ozmec, Željka Salopek, Jan Šipoš and Anamarija Žugić. (Pictured above with Neven: Anamarija Žugić and Juraj Ozmec).

Milk and honey in Leipzig

2013-08-19T04:49:00.003-07:00

I took part this month in the Leipzig "Open Philology Workshop" organized by Greg Crane. While I was only able to participate in some of the changing three-ring circus of events, I got a view onto the promised land. Out of the many highlights of the workshop, here are three that are individually significant, and, taken together, will have enormous consequences for classicists.

1. A billion words of Greek

I worked with a large team planning to digitize a billion words of Greek. Thanks in no small part to work by Bruce Robertson and Federico Boschetti improving OCR of polytonic Greek, we designed a detailed work flow automating many of the steps in moving from a physical volume in a library, to an openly licensed, citable, digital edition.

We live in a very different world than just a few years ago. When the costs of digitization were extremely high, both private interests (like publishers) and academic projects (even projects with the sponsorship of professional organizations and funding from national agencies) successfully persuaded individuals and libraries to give up their scholarly freedom (along with, of course, exorbitant licensing fees) for access to proprietary data banks of texts. Without the same barriers of cost, we can now insist instead on digital corpora comprising the kinds of texts we should always have demanded: structured for scholarly citation, and licensed for scholarly reuse. At this point, whether the Billion Words project literally achieves its goal of digitizing 10^9 words of Greek over the next five years is immaterial: when the first digital edition comes out of that pipeline, we can begin to put behind us the historically brief but shameful aberration when we thought it was acceptable to trade away our freedom to read and share classical texts in exchange for more convenient access to ancient Greek for a privileged few.

2. Perseus lexical inventory and morphology services

Bridget Almas and Marie-Claire Beaulieu are extending the Perseus lexical inventory and morphological services to keep each in sync with the other as they are dynamically edited.

This is exceptionally important, and indeed urgent, precisely because of the Billion Words project. As the contents of its new digital editions can be automatically tested, we will be able to extend the lexicon when unattested material appears, and improve the morphological analyzer when it fails to recognize valid forms. Not only will the Billion Words project improve the lexical inventory and morphological analyzer: repeating automated testing of the Billion Words corpus with the iteratively updated inventory and analyzer will allow the Billion Words project to state with unprecedented clarity what levels of validation each work in its corpus has passed.

3. A text citation tool

I was caught completely by surprise by Hugh Cayless' work on a javascript tool letting users select arbitrary pieces of (or even points in) a TEI document displayed in a web browser. While the CTS URN notation can easily express such arbitrary ranges of text, the challenges in building an interface highlighting spans of text that can cross multiple XML element boundaries and that might start and end in elements that do not constitute well-formed XML are so difficult that I would have said it was impossible to implement practically for real, complex texts.

Characteristically, Hugh showed a working implementation that was visually appealing, very responsive, and worked flawlessly on exceptionally complex passages from Servius' commentary on the Aeneid. So much for my scepticism. Equally characteristically, while Hugh's initial use case was a very limited application, he recognized the generality of the problem he had solved, and plans to fork the citation tool as a separate project that can express selections as CTS URNs. Chris Blackwell and I look forward to packaging Hugh's TEI Text Citation Tool along with Chris' Image Citation Tool as part of the standard suite of CITE services and utilities that we work with on the Homer Multitext project.

A whole greater than the sum of the parts

Bruce Robertson, Bridget Almas, and Hugh Cayless have long track records as three of the most talented contributors to the digital study of classics I have ever seen, so I suppose it is unsurprising that they would each, yet again, contribute something remarkable. What was different in Leipzig in August, 2013, was the synergy that their work illustrates. The internet can facilitate many kinds of collaboration, but nothing can fully replicate what happens when people sit in the same room, talk over coffee or dinner, and have unscheduled opportunities to follow up easily in further face to face conversations. While each of the three highlights I've chosen here deserves more discussion in future posts, consider their connections to each other: we can see the real beginnings of a vast digital corpus of Greek; the corpus is being automatically tested, and related to a citation-based inventory of Greek vocabulary, and to a morphological analyzer that can relate surface forms in the texts to lexical entities in the inventory; the moment the digital edition appears, a UI that runs in any web browser will let users cite any part of the corpus with technology-independent canonical citations.

Is there another discipline in the humanities that offers this kind of digital foundation in 2013? Perhaps, but I am not familiar with anything rivaling what I saw happening in Leipzig.

What's wrong with wikipedia

2013-05-26T11:31:00.001-07:00

The reason wikipedia, for all its usefulness, is absolutely wrong for scholarship in the humanities is not the fact that it's crowd-sourced. Contrary to what some people imagine, the problem is not the lack of a recognized editorial authority: to the contrary, the problem with wikipedia is precisely that its explicit editorial policy gets the authority of evidence in the humanities wrong.

I can't say it more succinctly than wikipedia itself does. I took the following screen grab today from the wikipedia article on "RDF Schema." If it's hard to read, here's a larger version. The text reads, "This article relies on references to primary sources. Please add references to secondary or tertiary sources."

This is not just slightly misdirected: it is 180 degrees off target. There is no way to misunderstand more completely the logic of an argument using evidence in the humanities.

Reading the Iliad in Worcester

2013-05-05T15:49:00.002-07:00

Friday was the next-to-last day of classes at Holy Cross. Driving home, I was thinking about how to respond in Monday's final meeting to some of the questions students in my intermediate Greek class have been raising. We have been reading the Iliad, most recently book 22. Perhaps they were conditioned to expect a simpler, Hollywood narrative, but many students were finding the complexity and ambiguity of the Iliad both more powerful and more challenging than they had expected. Several were troubled that when Achilles tells Hector, "Don't talk to me of 'agreements': lions and men don't make treaties; wolves and sheep don't have understandings" (22.261-22.263), he suggests that he and his hated enemy belong to different species. There is no possibility of human relation between the two of us, Achilles says, and the end will be bloodshed. (22.264-22.267). But which of the two heroes does Achilles' simile really dehumanize?

When I crested the hill on Hammond Street, I was, unexpectedly, stuck in traffic. Main Street was completely blocked off, and a police detail was directing single lines of cars through the resulting jam. I didn't see any smoke, so I assumed it wasn't a fire, but it was obvious from the flashing blue lights and the line of TV "live-coverage vans" with their extended satellite dishes that something out of the ordinary had happened.

I only found out after I finally got home that the blockade was due to protestors outside Graham Putnam and Mahoney Funeral Parlors, the funeral home that has taken in the body of Tamerlan Tsarnaev. (For a brief profile of Peter Stefan, the remarkable director of Graham Putnam and Mahoney, see this column from the Worcester Telegram and Gazette.)

The angry crowd was protesting the idea of burying a mass killer.

So on Monday, we'll think about why the poem's final resolution arrives not in book 22 with the slaying of Hector, or in book 23 with the funeral games honoring Patroclus, but in 24.804:

ὣς οἵ γ᾽ ἀμφίεπον τάφον Ἕκτορος ἱπποδάμοιο.

So they saw to the burial of Hector, tamer of horses.

GUT

2013-04-17T07:25:00.001-07:00

"Grand Unification Theory" may be a touch grandiose, but the underlying libraries used in the Homer Muiltitext project now generate RDF statements that fully express all three types of CITE-architecture information: textual archives, archives of data collections, and indices relating citable objects to other citable objects or to raw data. There will be lots of interesting connections to explore in the resulting unified graph of scholarly material.

In parallel with this, I've now implemented the CTS protocol, the CITE Collections Service protocol, and its extension with the CHS Image protocol in servlets drawing on a SPARQL endpoint, so creating a complete CITE environment can be reduced to:

- build all RDF (automatically), and import into a triple store
- drop the three servlets for CITE services into a servlet container
- install the iipsrv fastcgi for working with binary image data. This is the most troublesome step on many platforms, but happily iipsrv is now available as a package under debian.

Not bad. Chris Blackwell is preparing an image for the < $50 raspberry pi with these requirements preinstalled: a complete CITE Box roughly the size of an Altoids container.

As we review the schemas used in the services this month, we'll begin looking at defining a more permanent RDF vocabulary. I'm not sure at this point if we need to break out a generic CITE vocabulary distinct from a specific HMT vocabulary, or whether one ontology will suffice. We'll be looking at other projects' work: thanks to Joel Kavlesmaki for pointing to the useful list here.

CITE Collection Inventory

2013-04-14T10:54:00.000-07:00

In parallel with Friday's update to the schema for CTS text inventories, CITE Collection inventories now include an optional urn attribute on the schema for Collections. Bump your build system's dependency for the cite library up to 0.12.2 to include this change.

As with the CTS TextInventory, we plan to make the Collection inventory's urn attribute mandatory in 0.13, and will drop the parallel name attribute in 0.14.

Updating the CTS TextInventory schema

2013-04-12T07:57:00.000-07:00

Scott Mcphee points out the absurdity of a Canonical Text Service (CTS) definition that uses CTS URNs for all retrieval requests, but doesn't include CTS URNs in the service's TextInventory. The historical explanation for the inconsistency is embarassingly simple: the TextInventory schema predates the invention of CTS URNs, and has not been revisited since! That oversight is rectified with today's release of version 0.12.1 of the CITE schemas package.

Ultimately, we want to arrive at catalog entries with urn attributes that look like this:

<textgroup urn="urn:cts:greekLit:tlg0012">
<groupname xml:lang="eng">Homeric poetry</groupname>
<work urn="urn:cts:greekLit:tlg0012.tlg001" xml:lang="grc">
<title xml:lang="eng">Iliad</title>
<edition urn=":cts:greekLit:tlg0012.tlg001">
<label xml:lang="eng">Allen (OCT 1931)</label>
</edition>
</work>
</textgroup>

With release 0.12.1, the urn attribute is now optional but strongly recommended, alongside the previous projid attribute. With release 0.13.0, the urn attribute will be required, and the projid attribute deprecated. With release 0.14.0, the projid attribute will be dropped.

So grab cite-0.12.1-schemas.zip from our nexus repository to get started with a modern TextInventory identifiying texts by URN. You can manually download a zip bundle from the repository, or update your maven coordinates with groupId "edu.harvard.chs", artifactId "cite" and version "0.12.1".

[Updated: bumped version from 0.12.0 to 0.12.1 after adding trailing slash to dc namespace as requested by Bridget Almas]

How hard is it to imagine "popular scholarship"?

2013-04-11T10:14:00.002-07:00

I heard an interesting talk yesterday at Clark University by Robert Anderson, former director of the British Museum, on "The British Museum and Library at the New Millennium:" wonderful anecdotes from the early history of the museum, and a compelling argument for the essential intellectual unity of what museums and libraries do.

The British Museum Great Court.
Photograph by Eric Pouhier,
licensed under cc-by-sa license.

Two details troubled me. First, while the rare book library at Clark was filled, I saw only one student, and I probably fell well below the median age of the audience. The talk was sponsored by the "Friends of the Goddard Library," but if this audience was representative, the library won't have too many friends in a few more years.

Second, both Anderson's talk and some of the discussion afterward made some curious assumptions about scholarship. As the director at the time of the separation of the British Library from the Museum, and the opening of the fabulous facility at the new Euston Road location, Anderson offered insightful comments on the tensions of an institution committed both to free public access and to serving the needs of specialist scholars. He brought up a problem familiar to anyone who has worked at the BL recently: it's such a popular place, that all the desks fill up early in the morning with students looking for a comfortable place to work (with free wifi and good coffee!), but who aren't necessarily taking advantage of any of the unique offerings of the British Library. This can impose a real hardship on people working on projects that depend on BL material. Two assumptions emerged in the discussion that struck me as odd: that the results of scholarly research would only be of interest to a small circle of specialists; and that digital material should be openly viewable, but scholarly research was being well served by a policy that allows free reuse of scholarly material only in print publications with a very limited print run.

Interior of the British Library.
Photograph by Maria Giulia Tolotti
licensed under cc-by-sa license.

Let's parse that logic a little more closely: scholarly reuse of BL material is OK as long as not too many people care to read it; and that's fine, because scholars' research is only of interest to a handful of other specialists, and expensive print media are an adequate way to meet this need. (The host's introduction of Anderson referred light-heartedly, in what was evidently intended to be humor, to the fact that his most recent multi-volume publication costs hundreds of dollars.)

If we think the goal of scholarly research is to produce high-priced monographs of interest only to other specialists, is it really a surprise that the general reading public sees in the British Library a wonderful café? If we think of "digital access" as a way of entertaining or at best informing a wide public, without inviting scholars to build upon the digital foundations of the BL's collections, is it any wonder that visitors to the BL are not drawn to the library's unique resources, but instead spend their time with the amazing hodge podge of entertainment and information that populates the internet?

(Footnote: I was able to include the photographs by Eric Pouhier and Maria Giulia Polotti, without regard for how many people might view them, because both are available from wikimedia commons under the terms of a cc-by-sa license.)