Monday, October 11, 2010

Thinking about citation

In digital scholarship, citation is the most fundamental form of ontology. Identifying an entity in a recognized form of citation allows scholars to agree that the object exists, while leaving room to disagree over how to represent it. Once we agree that there is an object I call "the Parthenon in Athens," and can unambiguously cite that object, we allow software to recognize that the same object might be represented in a GIS with spatial data, in a photo gallery with a collection of photographs, or in an architectural database with structured fields of textual data.

For digital scholars, it's hard to avoid talking or even thinking about scholarly reference systems without getting bogged down in technology-specific tar pits. Take the W3C's Resource Description Framework as an example. Its triplet model is brilliantly simple and powerful: whether you think of it as "subject-verb-object" or "object-directed link-object", it is general, extensible, and lends itself readily to both abstract graph models and real machine implementations. Its syntax is expressed in terms of URIs, which could equally well be URLs (real addresses in the "http" schema), or URNs (abstract names in the "urn" schema). (See the W3C's document "URI Clarification" or this discussion "Untangle URIs, URLs, and URNs" to sort out the acronyms.) But when was the last time you saw a scholarly project using RDF to describe relations among objects identified by abstract name, rather than by address? In an RDF graph, "http" is a good URI scheme for an application retrieving material on the internet, but a poor choice for a description of relations among persistent and immutable objects, a case where the "urn" scheme would be more appropriate.

For canonically citable texts, work at the Center for Hellenic Studies on has led to:

  1. identifying abstract properties of canonical texts
  2. developing a human readable and machine actionable notation for citation that expresses these properties (the CTS URN notation)
  3. defining a service for identifying and retrieving texts identified by this notation (the Canonical Text Services protocol)
I've written a bit about this in a dry article on "Digital infrastructure and the Homer Multitext Project" but here I would just note that this three-tiered hierarchy has been very useful both in trying to think about citation outside of any particular technological context, and for defining machine-actionable tests to evaluate implementations of text services. Only level [3] deals with specific technologies on the internet. If you don't want to interact with a Canonical Text Service, the CTS URN notation [2] can still be used by any application referring to canonically cited texts. If you don't like the CTS URN notation, you can still evaluate any alternative notation by seeing whether it implements the abstract properties of [1]. And if you are dissatisfied with the identification of those properties, you can start from scratch and redefine what you think a canonically citable text really is.

I think this graded distinction of abstract property, reference notation, and application could be equally valuable in citing other kinds of material. In a following series of posts, I'll look successively at each level to analyze how we might cite uniquely identified objects.

Saturday, August 28, 2010

Is 2010 "the year of open data" in Classics?

Tim Berners-Lee, the creator of the World Wide Web, has called for "raw data now"; in a TED talk this spring, he showed examples of what can happen when people have access to openly licensed and freely reusable data sets.


The American Philological Association thinks the internet is a gated community. The lead story on the APA's website is the continuing effort to raise funds for a "portal" that will help members find resources available only to subscribers.

Compare Berners-Lee's talk (freely licensed so I can legally embed it in this blog post), with the APA's video presentation of its campaign (from the APA website either in Quicktime or Windows Media format). Which vision of sharing scientific and scholarly data do you see as the future of Classics?


Friday, August 27, 2010

What's the difference?

How do we compare two texts? For line-by-line comparison of electronic files, Unix systems have had a diff command since time immemorial. (A version had already been around long enough for Hunt and McIlroy to write an article about it in 1976, before I had ever laid eyes on a computer, and before any of my students were born.) For XML documents, XMLUnit is a java library that can describe differences not only in the text nodes of two documents, but in their XML structure as well.

Both diff and XMLUnit describe differences between electronic documents in specific formats, and report on differences in terms of that format (line-by-line structure from diff; XML structure from XMLUnit). Neither solution is adequate for the Homer Multitext project. Its collection of texts is not defined in terms of specific document formats: instead, texts are managed and manipulated by the more abstract references of canonical citation
(e.g., Homeric papyri; and texts from Byzantine manuscripts). How would we compare two passages of texts identified by canonical citations?

Traditional Homeric scholarship suggests an approach. Homerists refer to "vertical" vs. "horizontal" variation across versions of the Iliad. "Vertical variation" refers to entire lines that are present in one version but not another; "horizontal variation" refers to differences across versions in "the same" line (that is, a line that is canonically cited by the same reference). This can be generalized to any canonically citable text, if we think of vertical variation as the differences not just of Iliadic lines, but of the citation units of any text. (By this way of thinking, manuscripts of the Gospel of John that include the longer ending would show "vertical" variation compared to manuscripts that stop after the shorter ending.) Horizontal variation then would be a description of the variation within a single citation unit (whether that is a line of the Iliad or verse of John).

We could then reduce vertical difference to an operation on two ordered lists of citation values. Analogously, if we can tokenize the textual content of a citable passage, we could treat horizontal difference as an operation on two ordered lists of tokens. (Tokenizing texts has its own challenges, but I'll blog on that topic some other time.)

This is a happy result. First, we can acquire the data for a comparison with existing requests in the CTS protocol, to determine citation values, or to find textual content of a passage. Second, we can apply a single algorithm to both vertical/structural variation, and horizontal/intra-node variation. Third, comparing two ordered sequences of tokens is one of the best studied problems in computer science (since applications like comparing DNA sequences have attracted a lot more attention than differences in Homeric manuscripts). I googled up this site and had a basic collation class running in a few minutes (but a good programmer would be able to do this far more easily than my stumbling efforts).

This raises the far more interesting question: what exactly do we want to know about the vertical and horizontal differences between two passages? Some of the things we can determine include:
  • what tokens are unique to passage A?
  • what tokens are unique to passage B?
  • what is the longest common sequence of the two lists?
  • what is the complete ordered union of the two lists, and the status of each token in the list (A only, B only, in both, in both but in a different position) ?
I've blogged a very minimal example comparing two manuscripts of the Iliad. There's lots more to do.



Tuesday, August 24, 2010

Who does Classics? Where?

David Bamman's presentation was only one of several high points at last week's meeting at Tufts University on "Greek, Latin, Arabic." We heard from the Alpheios project about recent development of their language learning tools. I'm thrilled to be using alpheios this fall both as a teacher of intermediate Latin and a student of first-semester Arabic, but what continues to impress me most about the project is the thoughtfulness of its architecture. The lexica (such as Liddell-Scott-Jones for Greek, and Lewis-Short for Latin) and linguistic information (very comprehensive morphological analyses, and for some sets of texts, syntactic tree banks of the kind David Bamman's research uses) are cleanly organized as services that are accessible over the internet. If there is another project in the digital humanities that has grasped this fundamental architectural principle as clearly as the Alpheios project has, I'm not familiar with it.

Also in attendance was Google's Will Brockman, who was able to comment on the recent public release of scans of over 500 Greek and Latin texts. (Six copies from three different editions of Pomponius Mela! Can you do that in your home library?)

A dynamically constructed lexicon; network services exposing Greek and Latin lexical and linguistic information to the internet ; a corpus of freely available texts — individually, these are major contributions to the study of Classics. Collectively, they really do lay the foundations for a radically altered discipline — and they exist today. If I wasn't constantly hearing from fellow classicists that our discipline is in crisis, I would think that there has never been a better time to study Greek and Latin.

Oh, and who does this work? David Bamman is a senior researcher at the Perseus project, not a member of an academic department. The Alpheios project is independent of any academic institutional affiliation. Google — well, you've heard of Google.

I was reminded of Brockman's blog post when he first announced Google's free release of Greek and Latin texts in June. He cites three examples of studying ancient texts that excite him: reading a Latin text in the Perseus project's interactive edition; reading an article about Sophocles in English from the Suda On Line; and consulting the high-resolution photography of the Venetus A manuscript from the Homer Multitext project. His selection caught my eye, because I've been involved in all three projects, and know some of the back stories. None of the junior members of the original Perseus project were tenured at their original home institutions: all moved to other jobs, or left the field altogether. When an external review committee visited the University of Kentucky in the 1990s, after an extensive presentation about the Stoa prominently including the Suda On Line, a classicist asked the late Ross Scaife, "In what way does any of this constitute scholarship?" (A curious question about the first effort ever to translate into any language the rich and complex text of the Suda.) The Homer Multitext project has not faced such overt hostility, but it is interesting to note that it originates from the Center for Hellenic Studies (a branch of Harvard University in Washington, D.C., independent of the Department of the Classics), and that the two editors and two project architects all hold academic positions at institutions that do not grant PhDs in Classics. (Speaking for myself, I couldn't be happier about that.)

Connect the dots however you like. I draw two conclusions: first, that the study of classics is far too important to leave to classicists; and second that the study of Greek and Latin is still exciting enough to attract brilliant contributions from committed scholars who are not shackled with a title like "Professor of Classics." In 2010, I'm starting to envy my students, and wish I had a few more decades to continue this work.

Monday, August 23, 2010

Greek, Latin, Arabic

Last week, I attended a meeting on "Greek, Latin, Arabic" at Tufts University. Some of the most stimulating discussion focused on how it's possible to exploit more or less limited corpora of structured digital texts to find valuable information in a larger, less structured morass (think: the internet). Lots of interesting research worth blogging about (although you won't be likely to hear about any of it if you go to an APA convention), but I want to comment briefly on David Bamman's presentation, because I think his work is as significant as any research I've seen in classics in the past 30 years.

For some time, Bamman has been pursuing interesting work in two distinct areas: (1) automatic alignment of texts in different languages, and (2) using dependency treebanks to represent the syntax of Greek and Latin. I've followed his progress for a couple of years, but last week was the first occasion when I've begun to realize how he can weave these two strands of work together. (That's only a comment on my own obtuseness.) If you want to jump over the methods, and go straight to an astonishing result, follow this link.

It's a dynamically induced lexicon. Stop, and reread that sentence. It's a dynamically induced lexicon.

I've read this (preprint of an) article about projecting markup across translations a couple of times. That's not really enough, and I'll probably reread it tomorrow, but let me reduce one result to this summary: at the level of individual words, Bamman achieves about a 70% success rate aligning roughly five million words of Greek with seven million words of English. That in itself is fairly astonishing, but Bamman also leverages his work building treebanks to model the syntax of Greek and Latin texts. Crossing the syntactic models from his treebanking with information gleaned by aligning versions texts in different languages, Bamman builds a dynamic lexicon that can take a Greek term and trace how translations in a language like English render that term in different sets of texts, including recognizing the syntactic constructions in which the term appears; or conversely, he can take an English term, give the most closely corresponding Greek terms, and from there, again lead you through the history of the Greek term, as glossed or explicated (automatically) by its English translations.

In his presentation, as in the written publications I have seen to date, Bamman's work simultaneously shows the general implications of his research for computational linguistics, and how the Latin and Greek case studies he has chosen are distinctive. That tension between generality and specificity is, I think, often the hallmark of really great scholarship — a category David Bamman's work clearly falls into, in my view. If there are any lovers of classical languages or literature who still doubt whether they are computational linguists, Bamman should persuade you that in 2010 we all are all computational linguists.