Hugh Cayless's transcoding transformer library (available from the Epidoc project's sourceforge site here) is indispensable for anyone working with ancient Greek texts in java or groovy. How reliable is it?
I decided to test it against two significant lists of unique Greek strings. For each list, I converted the TLG's beta code word to UTF-8, then converted the resulting UTF-8 back to beta code, and compared that result to the original. (For an overview of the TLG's beta code conventions, see this guide.)
The first list was composed of 858715 words excluding proper names. The transcoder round tripped to its starting point in 858709 cases. Six failures doesn't sound bad (99.999% success rate). But look more closely: in five of the six failures, the TLG entry in fact breaks the TLG's encoding rules about order of accents, breathings and iota subscripts, while the transcoder correctly follows the rules with the consequence that its conversion back to beta code actually corrects a data entry error in the TLG! The sixth case is a sequence found only in a papyrus fragment. The beta code series should represent an omicron with rough breathing and circumflex – an accentuation that is not possible in Greek.
The second word list I tried was composed of proper names, including the tricky sequences beta code introduces in its conventions for capitalization. Out of 53167 capitalized words, the transcoder round tripped perfectly in all but one – again, an error in the TLG data entry that the transcoder corrected!
That's a total of 911882 unique strings. (That's going way beyond carefully chosen unit tests!) Remarkably, the transcoder had a 100% success rate in correctly formed words.
Wednesday, August 6, 2008
Epidoc transcoding transformer bats 1.000
Labels:
Epidoc,
Greek,
tools,
transcoding transformer
Subscribe to:
Post Comments (Atom)
3 comments:
Hi Neel,
We've just started using Hugh's transcoder here in Chicago as well. We requested two tweaks, one to deal with non-capitalized beta code as in use by Perseus, and one to use the 'tonos' Unicode characters rather than the deprecated 'oxia' characters where tonos characters exist. I think that in both cases, Hugh got back to us in under ten minutes with the re-write! A public thank-you, therefore.
Truly impressive and very reassuring.
So while I can appreciate highly theorized concerns about the preparation needed to appreciate Plato "properly", the Anglo dissertation Saxon empiricist in me looks at these Google search results and still wonders — just who exactly owns Plato?
Post a Comment