Hugh Cayless's transcoding transformer library (available from the Epidoc project's sourceforge site here) is indispensable for anyone working with ancient Greek texts in java or groovy. How reliable is it?
I decided to test it against two significant lists of unique Greek strings. For each list, I converted the TLG's beta code word to UTF-8, then converted the resulting UTF-8 back to beta code, and compared that result to the original. (For an overview of the TLG's beta code conventions, see this guide.)
The first list was composed of 858715 words excluding proper names. The transcoder round tripped to its starting point in 858709 cases. Six failures doesn't sound bad (99.999% success rate). But look more closely: in five of the six failures, the TLG entry in fact breaks the TLG's encoding rules about order of accents, breathings and iota subscripts, while the transcoder correctly follows the rules with the consequence that its conversion back to beta code actually corrects a data entry error in the TLG! The sixth case is a sequence found only in a papyrus fragment. The beta code series
should represent an omicron with rough breathing and circumflex – an accentuation that is not possible in Greek.
The second word list I tried was composed of proper names, including the tricky sequences beta code introduces in its conventions for capitalization. Out of 53167 capitalized words, the transcoder round tripped perfectly in all but one – again, an error in the TLG data entry that the transcoder corrected!
That's a total of 911882 unique strings. (That's going way beyond carefully chosen unit tests!) Remarkably, the transcoder had a 100% success rate in correctly formed words.
Wednesday, August 6, 2008
Epidoc transcoding transformer bats 1.000
Labels:
Epidoc,
Greek,
tools,
transcoding transformer
Subscribe to:
Post Comments (Atom)
15 comments:
Hi Neel,
We've just started using Hugh's transcoder here in Chicago as well. We requested two tweaks, one to deal with non-capitalized beta code as in use by Perseus, and one to use the 'tonos' Unicode characters rather than the deprecated 'oxia' characters where tonos characters exist. I think that in both cases, Hugh got back to us in under ten minutes with the re-write! A public thank-you, therefore.
Truly impressive and very reassuring.
So while I can appreciate highly theorized concerns about the preparation needed to appreciate Plato "properly", the Anglo dissertation Saxon empiricist in me looks at these Google search results and still wonders — just who exactly owns Plato?
Very interesting article. It impressed me.
http://www.paperwritings.com/
Check out your local schools, theatres, parks, museums or historical help with essay writing society for bricks or tiles to inscribe with your name.
Sánchez de Lozada had created his resignation and ultimate leaving from the country. breast enhancement pills
I have been learning this topic for a lengthy period. You have provided excellent information in your publish and some aspects I have not seen in other material I have analysis by others. exposedreviews.com
In August 1938, German authorities decreed that by January 1, phd thesis writing services 1939, Jewish men and women bearing first names of "non-Jewish.
He also had an outstanding a sense of educational composing sites in sydney comedy, always experiencing informing a laugh. Park Slope Lawyer
Not to pat myself on the returning for composing about Obama and his deficiency of interest toward African-american, in addition to its descendents who assisted him get in workplace, I have pushed the problem emphatically what my ip address Click Here.
Wow, wonderful blog layout! How long have you been blogging for? you make blogging look easy. breast pills
Wonderful landmarks you've always distributed to us. Thanks. personal loans
Learning components can consist of guides, classrooms, enjoying toys, charts. Caring for Dresses with Embellishments
It doesn't matter if you are very serious problems in the first place. blogging for profits
Post a Comment