Friday, August 27, 2010

What's the difference?

How do we compare two texts? For line-by-line comparison of electronic files, Unix systems have had a diff command since time immemorial. (A version had already been around long enough for Hunt and McIlroy to write an article about it in 1976, before I had ever laid eyes on a computer, and before any of my students were born.) For XML documents, XMLUnit is a java library that can describe differences not only in the text nodes of two documents, but in their XML structure as well.

Both diff and XMLUnit describe differences between electronic documents in specific formats, and report on differences in terms of that format (line-by-line structure from diff; XML structure from XMLUnit). Neither solution is adequate for the Homer Multitext project. Its collection of texts is not defined in terms of specific document formats: instead, texts are managed and manipulated by the more abstract references of canonical citation
(e.g., Homeric papyri; and texts from Byzantine manuscripts). How would we compare two passages of texts identified by canonical citations?

Traditional Homeric scholarship suggests an approach. Homerists refer to "vertical" vs. "horizontal" variation across versions of the Iliad. "Vertical variation" refers to entire lines that are present in one version but not another; "horizontal variation" refers to differences across versions in "the same" line (that is, a line that is canonically cited by the same reference). This can be generalized to any canonically citable text, if we think of vertical variation as the differences not just of Iliadic lines, but of the citation units of any text. (By this way of thinking, manuscripts of the Gospel of John that include the longer ending would show "vertical" variation compared to manuscripts that stop after the shorter ending.) Horizontal variation then would be a description of the variation within a single citation unit (whether that is a line of the Iliad or verse of John).

We could then reduce vertical difference to an operation on two ordered lists of citation values. Analogously, if we can tokenize the textual content of a citable passage, we could treat horizontal difference as an operation on two ordered lists of tokens. (Tokenizing texts has its own challenges, but I'll blog on that topic some other time.)

This is a happy result. First, we can acquire the data for a comparison with existing requests in the CTS protocol, to determine citation values, or to find textual content of a passage. Second, we can apply a single algorithm to both vertical/structural variation, and horizontal/intra-node variation. Third, comparing two ordered sequences of tokens is one of the best studied problems in computer science (since applications like comparing DNA sequences have attracted a lot more attention than differences in Homeric manuscripts). I googled up this site and had a basic collation class running in a few minutes (but a good programmer would be able to do this far more easily than my stumbling efforts).

This raises the far more interesting question: what exactly do we want to know about the vertical and horizontal differences between two passages? Some of the things we can determine include:
  • what tokens are unique to passage A?
  • what tokens are unique to passage B?
  • what is the longest common sequence of the two lists?
  • what is the complete ordered union of the two lists, and the status of each token in the list (A only, B only, in both, in both but in a different position) ?
I've blogged a very minimal example comparing two manuscripts of the Iliad. There's lots more to do.


filologanoga said...

If we could tokenize the textual content of a citable passage, could we then also compare the same passage in two languages (Latin and Greek, Greek and English)?

Neel Smith said...

One of the beautiful features of the Canonical Text Services protocol is that we can directly do "vertical differences" (differences in citation values) across versions, including translations.

For aligning and comparing the textual contents of works in different languages, see David Bamman's stunning work. He's able to align individual words in English and Greek at something like a 70% success rate, I believe.

(I blogged that briefly here.)