Vitruvian design for scholarship in the humanities: 2013

Monday, December 30, 2013

The APA privatizes, too

Unlike many of my colleagues and friends in Classics Departments around the US and abroad, I will not be travelling to Chicago this week for the annual meeting of the American Philological Association. The APA continues to accept donations to a recently completed capital campaign with the goal of supporting a digital "Center for Classics Research and Teaching." (See the description here.) The APA claims that its center will "make high quality information about the Classical World available in accessible formats to the largest possible audience by using technology in new and exciting ways," but has never clearly addressed the fact that, as proposed, the center will include material for APA members only.

Like Elsevier and some other distributors, in other words, the APA wants to control who can read scholarly work as part of its "business model." Like Elsevier, the APA leadership is doubtless sincere in its belief that its "business model" is paramount. But like Elsevier, the APA winds up in a Wonderland, where, with Humpty Dumpty, we can make words mean whatever we choose. The idea that closed-access material could be available to "the largest possible audience" is ludicrous. In 2012, over a billion IPv4 addresses were in use, and, while difficult to estimate, the number of individual internet users is certainly much higher. It must exceed the APA's membership by at least six orders of magnitude. (That is, the number of internet users is surely at least 100,000 times greater than the number of APA members.)

More simply, like Elsevier, the APA's plan privatizes scholarly work that should be published. In criticizing Elsevier's business practices, I argued that

Scholarly publication in a digital world means that a work is openly accessible for others to inspect, critique, and build upon, and we should insist that in reviews for tenure and promotion, only scholarly publications meeting this definition qualify as published work.

We should hold professional organizations to the same standard.

Unfortunately, just as these essential scholarly values are often ignored in reviews of individuals for tenure and promotion, they are often likewise neglected in evaluation of funding requests from educational institutions, federal programs and private philanthropic organizations. There is no quick or easy way to change these entrenched practices that directly oppose the basic working method of scholarship. But I have the choice not to become a member of (and support with my membership fees) an organization that is building a system of information apartheid.

If you are at the APA this week, try to get a clear answer to a yes/no question: will the APA's digital publications be openly accessible for others to inspect, critique, and build upon?

Thursday, December 19, 2013

Elsevier does not publish: it privatizes

If you were shocked that Elsevier has apparently issued a takedown notice to the University of Calgary, you should consider auditioning for Claude Rains' role in Casablanca. Elsevier has never hidden the fact that its business model depends on restricting access to scholarly work. Alicia Wise of Elsevier responds to the post linked above with this question:

the business model is based largely on paid access post-publication, and if freely accessible on a large scale what library will continue to subscribe?

The question may be sincerely intended, but its logic is straight from Alice in Wonderland: if Elsevier cannot profit by making scholarship publicly available — that is, by publishing it — then it must privatize the information, and sell access only to clients who cede to Elsevier control over who may read the scholarly work.

The intellectual roots of western scholarship reach back to ancient Greece, and the radical idea that scholarly understanding is not determined by political or social power. (This is exemplified in the famous story of Euclid telling his patron and monarch, Ptolemy, that "there is no royal road to geometry.") In our modern academic institutions, publication exposes scholarly work to public scrutiny, and serves in part to ensure that scholarly claims are not based on power over information.

Elsevier and others subvert this fundamental scholarly activity when they privatize scholarship, a simple fact that we obfuscate when, with an Orwellian twist of language, we call it "publication." It is true that scholars who freely hand over their work to privatizers make the system possible, but who can blame an untenured faculty member who will be rewarded for contributing to the dysfunction?

We should instead unambiguously reiterate that scholarly publication in a digital world means that a work is openly accessible for others to inspect, critique, and build upon, and we should insist that in reviews for tenure and promotion, only scholarly publications meeting this definition qualify as published work.

How quickly would Elsevier's pool of submissions dry up if enough universities adopted and enforced such a requirement for real scholarly publication?

Wednesday, December 11, 2013

What humanists do

I recently stumbled across an interview with the very articulate Astronomer Royal Martin Reese that included this observation:

But the aim of science is to unify disparate ideas, so we don't need to remember them all. I mean we don't need to record the fall of every apple, because Newton told us they all fall the same way.

(The full transcript of the interview is here, under the arresting title "Cosmic Origami and What We Don't Know.")

I think that this remark really captures a quintessential difference between the natural sciences and the humanities. Humanists, too, unify disparate ideas, but we must record each unique phenomenon that we study. If we develop a unified view of oral poetry, for example, we will never conclude that "I'm familiar with the Iliad, so I don't have to remember the Odyssey," or "I've studied Greek poetry so I don't need to know about the Serbo-Croatian oral poetry that Parry and Lord recorded." We don't study apples. Recording and remembering are basic to scholarship in the humanities.

This has important implications for how we work in a digital world. We record and remember through citation, so before anything else we must develop a sound infrastructure for citation.

Friday, October 25, 2013

markdown + criticmarkup

It's been a year since I last posted about markdown, and in that time, the number of interesting applications and services has continued to grow rapidly. (If you use markdown and haven't looked at leanpub yet, you owe it to yourself to take a peek!)

I've been looking recently at using criticmarkup together with markdown, and it seems really promising. Editorial on the iPad and both Multimarkdown Composer and Marked2 on OS X support displaying criticmarkup within your editor.

It would be nice to have equally convenient ways to automate accepting or rejecting suggested additions, deletions or changes, but the OS X system services in the criticmarkup site's toolkit did not work correctly when I installed them, so I'm gisting a couple of perl scripts that accept or reject criticmarkup in standard input.

gist to accept criticmarkup
gist to reject criticmarkup

It shouldn't be a big job to turn those into system services. I've never used OS X's Automator, but maybe that would make a good afternoon project...?

Saturday, September 28, 2013

The instrumentalist fallacy and academic publication

It is easy to confuse a tool with the task it is supposed to accomplish. This does not necessarily cause problems when the tool and task are perfectly matched, but easily leads to misapplication of the tool. A bicycle is a wonderful means of transportation, but should not be used to travel across bodies of water, for example. I call this confusion "the instrumentalist fallacy," and I deal with it daily in my digital scholarly work.

The academic review process has instutionalized the instrumentalist fallacy in some specially harmful ways. I read this account of a group of mathematicians who used a github repository to coauthor a book: any one can clone their source, and offer improvements for the authors to pull into a subsequent version. One of the principal authors has an enlightening post about the process here.

Note the contrast that both the wired interview and the author's blog post point out: the authors chose an open collaborative process because it resulted in better scholarship, but understood that they would receive less professional recognition or credit for it.

The instrumental fallacy equates the instrument— the traditional publishing process — with its goal, vetting the quality of scholarly work. Is it too radical to suggest that the way to assess the quality of a scholarly publication might be ... to read it?

See:

Monday, August 19, 2013

What scholarship looks like

The Leipzig "Open Philology" workshop reinforced a fact that I (re)learn constantly from my work advising Holy Cross' "Manuscripts, Inscriptions and Documents Club": that the most important changes brought about to scholarship by new technology are not technological, but intellectual and social.

It's not easy for someone of my generation to imagine how significant research in Classics can be collaborative, and can engage people of a wide range of ages (even people without university-level degrees, something my training conditions me to view as a heresy), but there's no mistaking it when you get to watch it happen. In Leipzig, the best example was "Team Croatia": five participants from Zagreb, led by their gifted teacher and scholar, Neven Jovanović (far right in the photo below).

A mediocre cell-phone snap shows what this kind of activity can look like: two computers, but one temporarily ignored as three pairs of eyes focus intently on the same screen. A single pair of hands is not enough to capture the action in real time: if this were a piano composition for four hands, this movement would be marked "presto".

If we're going to lay a digital foundation for classical studies, this is the kind of team that will make it happen.

Update: thanks to Neven for helping me correctly spell the names of Team Croatia: Juraj Ozmec, Željka Salopek, Jan Šipoš and Anamarija Žugić. (Pictured above with Neven: Anamarija Žugić and Juraj Ozmec).

Milk and honey in Leipzig

I took part this month in the Leipzig "Open Philology Workshop" organized by Greg Crane. While I was only able to participate in some of the changing three-ring circus of events, I got a view onto the promised land. Out of the many highlights of the workshop, here are three that are individually significant, and, taken together, will have enormous consequences for classicists.

1. A billion words of Greek

I worked with a large team planning to digitize a billion words of Greek. Thanks in no small part to work by Bruce Robertson and Federico Boschetti improving OCR of polytonic Greek, we designed a detailed work flow automating many of the steps in moving from a physical volume in a library, to an openly licensed, citable, digital edition.

We live in a very different world than just a few years ago. When the costs of digitization were extremely high, both private interests (like publishers) and academic projects (even projects with the sponsorship of professional organizations and funding from national agencies) successfully persuaded individuals and libraries to give up their scholarly freedom (along with, of course, exorbitant licensing fees) for access to proprietary data banks of texts. Without the same barriers of cost, we can now insist instead on digital corpora comprising the kinds of texts we should always have demanded: structured for scholarly citation, and licensed for scholarly reuse. At this point, whether the Billion Words project literally achieves its goal of digitizing 10^9 words of Greek over the next five years is immaterial: when the first digital edition comes out of that pipeline, we can begin to put behind us the historically brief but shameful aberration when we thought it was acceptable to trade away our freedom to read and share classical texts in exchange for more convenient access to ancient Greek for a privileged few.

2. Perseus lexical inventory and morphology services

Bridget Almas and Marie-Claire Beaulieu are extending the Perseus lexical inventory and morphological services to keep each in sync with the other as they are dynamically edited.

This is exceptionally important, and indeed urgent, precisely because of the Billion Words project. As the contents of its new digital editions can be automatically tested, we will be able to extend the lexicon when unattested material appears, and improve the morphological analyzer when it fails to recognize valid forms. Not only will the Billion Words project improve the lexical inventory and morphological analyzer: repeating automated testing of the Billion Words corpus with the iteratively updated inventory and analyzer will allow the Billion Words project to state with unprecedented clarity what levels of validation each work in its corpus has passed.

3. A text citation tool

I was caught completely by surprise by Hugh Cayless' work on a javascript tool letting users select arbitrary pieces of (or even points in) a TEI document displayed in a web browser. While the CTS URN notation can easily express such arbitrary ranges of text, the challenges in building an interface highlighting spans of text that can cross multiple XML element boundaries and that might start and end in elements that do not constitute well-formed XML are so difficult that I would have said it was impossible to implement practically for real, complex texts.

Characteristically, Hugh showed a working implementation that was visually appealing, very responsive, and worked flawlessly on exceptionally complex passages from Servius' commentary on the Aeneid. So much for my scepticism. Equally characteristically, while Hugh's initial use case was a very limited application, he recognized the generality of the problem he had solved, and plans to fork the citation tool as a separate project that can express selections as CTS URNs. Chris Blackwell and I look forward to packaging Hugh's TEI Text Citation Tool along with Chris' Image Citation Tool as part of the standard suite of CITE services and utilities that we work with on the Homer Multitext project.

A whole greater than the sum of the parts

Bruce Robertson, Bridget Almas, and Hugh Cayless have long track records as three of the most talented contributors to the digital study of classics I have ever seen, so I suppose it is unsurprising that they would each, yet again, contribute something remarkable. What was different in Leipzig in August, 2013, was the synergy that their work illustrates. The internet can facilitate many kinds of collaboration, but nothing can fully replicate what happens when people sit in the same room, talk over coffee or dinner, and have unscheduled opportunities to follow up easily in further face to face conversations. While each of the three highlights I've chosen here deserves more discussion in future posts, consider their connections to each other: we can see the real beginnings of a vast digital corpus of Greek; the corpus is being automatically tested, and related to a citation-based inventory of Greek vocabulary, and to a morphological analyzer that can relate surface forms in the texts to lexical entities in the inventory; the moment the digital edition appears, a UI that runs in any web browser will let users cite any part of the corpus with technology-independent canonical citations.

Is there another discipline in the humanities that offers this kind of digital foundation in 2013? Perhaps, but I am not familiar with anything rivaling what I saw happening in Leipzig.

Sunday, May 26, 2013

What's wrong with wikipedia

The reason wikipedia, for all its usefulness, is absolutely wrong for scholarship in the humanities is not the fact that it's crowd-sourced. Contrary to what some people imagine, the problem is not the lack of a recognized editorial authority: to the contrary, the problem with wikipedia is precisely that its explicit editorial policy gets the authority of evidence in the humanities wrong.

I can't say it more succinctly than wikipedia itself does. I took the following screen grab today from the wikipedia article on "RDF Schema." If it's hard to read, here's a larger version. The text reads, "This article relies on references to primary sources. Please add references to secondary or tertiary sources."

This is not just slightly misdirected: it is 180 degrees off target. There is no way to misunderstand more completely the logic of an argument using evidence in the humanities.

Sunday, May 5, 2013

Reading the Iliad in Worcester

Friday was the next-to-last day of classes at Holy Cross. Driving home, I was thinking about how to respond in Monday's final meeting to some of the questions students in my intermediate Greek class have been raising. We have been reading the Iliad, most recently book 22. Perhaps they were conditioned to expect a simpler, Hollywood narrative, but many students were finding the complexity and ambiguity of the Iliad both more powerful and more challenging than they had expected. Several were troubled that when Achilles tells Hector, "Don't talk to me of 'agreements': lions and men don't make treaties; wolves and sheep don't have understandings" (22.261-22.263), he suggests that he and his hated enemy belong to different species. There is no possibility of human relation between the two of us, Achilles says, and the end will be bloodshed. (22.264-22.267). But which of the two heroes does Achilles' simile really dehumanize?

When I crested the hill on Hammond Street, I was, unexpectedly, stuck in traffic. Main Street was completely blocked off, and a police detail was directing single lines of cars through the resulting jam. I didn't see any smoke, so I assumed it wasn't a fire, but it was obvious from the flashing blue lights and the line of TV "live-coverage vans" with their extended satellite dishes that something out of the ordinary had happened.

I only found out after I finally got home that the blockade was due to protestors outside Graham Putnam and Mahoney Funeral Parlors, the funeral home that has taken in the body of Tamerlan Tsarnaev. (For a brief profile of Peter Stefan, the remarkable director of Graham Putnam and Mahoney, see this column from the Worcester Telegram and Gazette.)

The angry crowd was protesting the idea of burying a mass killer.

So on Monday, we'll think about why the poem's final resolution arrives not in book 22 with the slaying of Hector, or in book 23 with the funeral games honoring Patroclus, but in 24.804:

ὣς οἵ γ᾽ ἀμφίεπον τάφον Ἕκτορος ἱπποδάμοιο.

So they saw to the burial of Hector, tamer of horses.

Wednesday, April 17, 2013

GUT

"Grand Unification Theory" may be a touch grandiose, but the underlying libraries used in the Homer Muiltitext project now generate RDF statements that fully express all three types of CITE-architecture information: textual archives, archives of data collections, and indices relating citable objects to other citable objects or to raw data. There will be lots of interesting connections to explore in the resulting unified graph of scholarly material.

In parallel with this, I've now implemented the CTS protocol, the CITE Collections Service protocol, and its extension with the CHS Image protocol in servlets drawing on a SPARQL endpoint, so creating a complete CITE environment can be reduced to:

- build all RDF (automatically), and import into a triple store
- drop the three servlets for CITE services into a servlet container
- install the iipsrv fastcgi for working with binary image data. This is the most troublesome step on many platforms, but happily iipsrv is now available as a package under debian.

Not bad. Chris Blackwell is preparing an image for the < $50 raspberry pi with these requirements preinstalled: a complete CITE Box roughly the size of an Altoids container.

As we review the schemas used in the services this month, we'll begin looking at defining a more permanent RDF vocabulary. I'm not sure at this point if we need to break out a generic CITE vocabulary distinct from a specific HMT vocabulary, or whether one ontology will suffice. We'll be looking at other projects' work: thanks to Joel Kavlesmaki for pointing to the useful list here.

Sunday, April 14, 2013

CITE Collection Inventory

In parallel with Friday's update to the schema for CTS text inventories, CITE Collection inventories now include an optional urn attribute on the schema for Collections. Bump your build system's dependency for the cite library up to 0.12.2 to include this change.

As with the CTS TextInventory, we plan to make the Collection inventory's urn attribute mandatory in 0.13, and will drop the parallel name attribute in 0.14.

Friday, April 12, 2013

Updating the CTS TextInventory schema

Scott Mcphee points out the absurdity of a Canonical Text Service (CTS) definition that uses CTS URNs for all retrieval requests, but doesn't include CTS URNs in the service's TextInventory. The historical explanation for the inconsistency is embarassingly simple: the TextInventory schema predates the invention of CTS URNs, and has not been revisited since! That oversight is rectified with today's release of version 0.12.1 of the CITE schemas package.

Ultimately, we want to arrive at catalog entries with urn attributes that look like this:

<textgroup urn="urn:cts:greekLit:tlg0012">
<groupname xml:lang="eng">Homeric poetry</groupname>
<work urn="urn:cts:greekLit:tlg0012.tlg001" xml:lang="grc">
<title xml:lang="eng">Iliad</title>
<edition urn=":cts:greekLit:tlg0012.tlg001">
<label xml:lang="eng">Allen (OCT 1931)</label>
</edition>
</work>
</textgroup>

With release 0.12.1, the urn attribute is now optional but strongly recommended, alongside the previous projid attribute. With release 0.13.0, the urn attribute will be required, and the projid attribute deprecated. With release 0.14.0, the projid attribute will be dropped.

So grab cite-0.12.1-schemas.zip from our nexus repository to get started with a modern TextInventory identifiying texts by URN. You can manually download a zip bundle from the repository, or update your maven coordinates with groupId "edu.harvard.chs", artifactId "cite" and version "0.12.1".

[Updated: bumped version from 0.12.0 to 0.12.1 after adding trailing slash to dc namespace as requested by Bridget Almas]

Thursday, April 11, 2013

How hard is it to imagine "popular scholarship"?

I heard an interesting talk yesterday at Clark University by Robert Anderson, former director of the British Museum, on "The British Museum and Library at the New Millennium:" wonderful anecdotes from the early history of the museum, and a compelling argument for the essential intellectual unity of what museums and libraries do.

The British Museum Great Court.
Photograph by Eric Pouhier,
licensed under cc-by-sa license.

Two details troubled me. First, while the rare book library at Clark was filled, I saw only one student, and I probably fell well below the median age of the audience. The talk was sponsored by the "Friends of the Goddard Library," but if this audience was representative, the library won't have too many friends in a few more years.

Second, both Anderson's talk and some of the discussion afterward made some curious assumptions about scholarship. As the director at the time of the separation of the British Library from the Museum, and the opening of the fabulous facility at the new Euston Road location, Anderson offered insightful comments on the tensions of an institution committed both to free public access and to serving the needs of specialist scholars. He brought up a problem familiar to anyone who has worked at the BL recently: it's such a popular place, that all the desks fill up early in the morning with students looking for a comfortable place to work (with free wifi and good coffee!), but who aren't necessarily taking advantage of any of the unique offerings of the British Library. This can impose a real hardship on people working on projects that depend on BL material. Two assumptions emerged in the discussion that struck me as odd: that the results of scholarly research would only be of interest to a small circle of specialists; and that digital material should be openly viewable, but scholarly research was being well served by a policy that allows free reuse of scholarly material only in print publications with a very limited print run.

Interior of the British Library.
Photograph by Maria Giulia Tolotti
licensed under cc-by-sa license.

Let's parse that logic a little more closely: scholarly reuse of BL material is OK as long as not too many people care to read it; and that's fine, because scholars' research is only of interest to a handful of other specialists, and expensive print media are an adequate way to meet this need. (The host's introduction of Anderson referred light-heartedly, in what was evidently intended to be humor, to the fact that his most recent multi-volume publication costs hundreds of dollars.)

If we think the goal of scholarly research is to produce high-priced monographs of interest only to other specialists, is it really a surprise that the general reading public sees in the British Library a wonderful café? If we think of "digital access" as a way of entertaining or at best informing a wide public, without inviting scholars to build upon the digital foundations of the BL's collections, is it any wonder that visitors to the BL are not drawn to the library's unique resources, but instead spend their time with the amazing hodge podge of entertainment and information that populates the internet?

(Footnote: I was able to include the photographs by Eric Pouhier and Maria Giulia Polotti, without regard for how many people might view them, because both are available from wikimedia commons under the terms of a cc-by-sa license.)

Sunday, March 17, 2013

CTS is complete under OHCO2

My preceding post promised to compare experiences implementing the Canonical Text Services protocol with three equivalent data structures for text: trees (formatted in XML), tables, and graphs (expressed in RDF). Before turning to the first of these data structures, however, I should expand briefly on the comment in that post that, in developing the CTS protocol, "we relied heavily on the OHCO2 model." More precisely, I mean that we developed CTS so that it fully expresses the semantics of OHCO2: hence the title of the present post.

The CTS protocol uses CTS URNs to cite passages of texts. The semantics of CTS URNs by themselves give us two of the four OHCO2 properties, since a CTS URN specifies where in a citation hierarchy a passage of text is situated, and where in a hierarchy of versions a particular version is situated. A URN like urn:cts:greekLit:tlg0012.tlg001.msA:9.119 for example, refers to a passage set in a version of the Iliad (the work tlg0012.tlg001) identified as msA (i.e., the Venetus A manuscript), and refers to a citable line (119) contained within a citable book (9).

The remaining two OHCO2 properties are provided by a pair of CTS requests. The GetPrevNext request places a passage within an ordered sequence; the GetPassage request returning the contents of the passage supports a mixed content model.

After some initial experience developing applications built on CTS, Chris Blackwell suggested that it would be convenient for developers to have both GetPrevNext and GetPassage information available via a single request. We introduced the CTS GetPassagePlus request for just this purpose. His intuition is now gratifyingly justified by the observation that the GetPassagePlus request tells us everything about a cited passage of text that the OHCO2 model guarantees.

Sunday, March 10, 2013

Data structures for texts

My best scholarship that no one has ever read is probably the work I did with Gabe Weaver on the structure of citable texts. (I sense potential for a dinner-party game similar to “Humiliation” in David Lodge’s novel Changing Places…)
We proposed a model of citable text as an ordered hierarchy of citation objects (the “OHCO2” model). In OHCO2, every citable node has four defining properties:

every node belongs to a citation hierarchy
every node belongs to a FRBR-like version hierarchy
nodes belonging to the same version are ordered
nodes support a mixed content model

Two representations of a text that preserve these properties for every citable node are considered equivalent under OHCO2.
As I worked with Gabe, Chris Blackwell and others on both the Canonical Text Services protocol (CTS) and the CTS URN notation, we relied heavily on the OHCO2 model. I have recently completed a new implementation of the CTS protocol — the third of three implementations I have written using three different technologies for working with three completely different representations of text. Since all of the representations are OHCO2 equivalent, we know that they preserve the semantics of citable text, and we can consider other criteria to compare the advantages and disadvantages of these formats for specific purposes. In a following series of posts, I want to highlight some of the pluses and minuses of the following OHCO2-equivalent formats for representing citable texts:

XML
tabular structures
RDF triples

I’ll tag this series with the label "text data structures".

Wednesday, February 27, 2013

The maturity of a discipline

If a scholarly commuity cannot identify what material it studies, it has not yet matured to a point where digital technology matters very much: scholarly discussion requires being able to cite evidence. I worked on the CITE architecture for scholarly reference in part because I come from a background in Classics where, by and large, we have a reasonable tradition of citing works by logical, canonical reference schemes. (Of course there are exceptions, like that little corpus of Plato that we continue to cite, bizarrely, by physical pages in the sixteenth century edition of Stephanus...)

The suggestion that scholars need to be able to identify and cite their evidence seems to me a pretty minimal measure of the maturity of a discipline, but if I hold up classicists' conventions as a positive example of canonical citation conventions, people occasionally misunderstand this as an elitist attack on their field of study. No: I want to apply the same standard to subjects I work on.

For example, in the study of ancient science, we have not advanced much beyond the stituation described almost 40 years ago by Neugebauer (never one to sugar coat his judgment of the state of scholarship) as follows:

For classical antiquity and the Middle Ages no systematic collection of mathematical or astronomical treatises exists. No attempt has ever been made to compile basic collections comparable to the Loeb Classical Library or the Budé Collection, or Migne's Patrologia, the Monumenta Germaniae Historica, the Bonn Corpus of Byzantine historians, etc. ... This fact alone suffices to show that the so-called 'History of Science' is still operating on an exceedingly primitive level.

A History of Ancient Mathematical Astronomy, 1975, p 15

Can we leap straight to a digital corpus for ancient science, like a third-world country bypassing costly and slow expansion of landlines and immediately delivering phone service through cellular networks?

Vitruvian design for scholarship in the humanities