How do we compare two texts? For line-by-line comparison of electronic files, Unix systems have had a
diff
command since time immemorial. (A version had already been around long enough for
Hunt and McIlroy to write an article about it in 1976, before I had ever laid eyes on a computer, and before any of my students were born.) For XML documents,
XMLUnit is a java library that can describe differences not only in the text nodes of two documents, but in their XML structure as well.
Both diff
and XMLUnit describe differences between electronic documents in specific formats, and report on differences in terms of that format (line-by-line structure from diff
; XML structure from XMLUnit). Neither solution is adequate for the Homer Multitext project. Its collection of texts is not defined in terms of specific document formats: instead, texts are managed and manipulated by the more abstract references of canonical citation
Traditional Homeric scholarship suggests an approach. Homerists refer to "vertical" vs. "horizontal" variation across versions of the Iliad. "Vertical variation" refers to entire lines that are present in one version but not another; "horizontal variation" refers to differences across versions in "the same" line (that is, a line that is canonically cited by the same reference). This can be generalized to any canonically citable text, if we think of vertical variation as the differences not just of Iliadic lines, but of the citation units of any text. (By this way of thinking, manuscripts of the Gospel of John that include the longer ending would show "vertical" variation compared to manuscripts that stop after the shorter ending.) Horizontal variation then would be a description of the variation within a single citation unit (whether that is a line of the Iliad or verse of John).
We could then reduce vertical difference to an operation on two ordered lists of citation values. Analogously, if we can tokenize the textual content of a citable passage, we could treat horizontal difference as an operation on two ordered lists of tokens. (Tokenizing texts has its own challenges, but I'll blog on that topic some other time.)
This is a happy result. First, we can acquire the data for a comparison with existing requests in the CTS protocol, to determine citation values, or to find textual content of a passage. Second, we can apply a single algorithm to both vertical/structural variation, and horizontal/intra-node variation. Third, comparing two ordered sequences of tokens is one of the best studied problems in computer science (since applications like comparing DNA sequences have attracted a lot more attention than differences in Homeric manuscripts). I googled up
this site and had a basic collation class running in a few minutes (but a good programmer would be able to do this far more easily than my stumbling efforts).
This raises the far more interesting question: what exactly do we want to know about the vertical and horizontal differences between two passages? Some of the things we can determine include:
- what tokens are unique to passage A?
- what tokens are unique to passage B?
- what is the longest common sequence of the two lists?
- what is the complete ordered union of the two lists, and the status of each token in the list (A only, B only, in both, in both but in a different position) ?