Saturday, February 4, 2012

Ancient Greek is broken

It is 2012, and it is not possible to edit an original document from archaic or classical Greece digitally.

The inscriptions recording the construction of the Parthenon cannot be edited digitally; the Athenian Tribute Lists reflecting the annual payments members of the Delian League made to Athens in the fifth century B.C.E. cannot be edited digitally; votive offerings to Apollo at Delphi, dipinti on classical Greek pottery, graffiti scratched by Greek mercenaries on the colossal statues at Abu Simbel in Egypt — none can be edited digitally.

We are prevented from fully and accurately editing archaic and classical Greek by inadequate or erroneous technical standards defining the representation of languages, writing systems and digital character encodings. Unlike Claude Rains’ famously pretended reaction in Casablanca, I am genuinely shocked that most of the standards keeping us from editing classical Greek have been adopted unmodified from recommendations by professional classicists. (Think about that the next time you want to evaluate the state of digital scholarship in the humanities.)

Each of these three shortcomings is worth discussing separately, so I plan to post more detailed comments on them individually, but here is a brief summary of the problem.

1. Language

A text must identify what languages its content represents. We do that with International Standards Organizations (ISO) codes for language. The registration authority for the ongoing work to develop a comprehensive set of three-letter codes for languages is SIL
International
.

While some languages codes are organized in families (so that related dialects or languages can be recognized by software to process the contents appropriately), archaic and classical Greek are lumped under a single grc code. (This at least is an improvement on the previous iso639–2 list of codes where Mycenaean Greek written in Linear B could not be distinguished from classical Greek!)

We tell students reading Plato that the text is in the Attic dialect, and would not ask them to consider interpretations that are only possible in other dialects. The string τό, for example, might be a form of the relative pronoun in Ionic Greek, but in Attic it can only be the definite article (“the”).

We should treat our software equally kindly, by encoding explicitly the dialectical variant of ancient Greek used in a text.

2. Writing system

If we are editing an ancient Greek document, we must identify the document’s writing system, since archaic and classical Greek city states used a variety of distinct alphabets. In 403 BC, the Athenians voted to adopt a as their official writing system the alphabet used in Ionia, replacing the Attic alphabet they had used up to that time. The language spoken in Athens did not change, but the writing system did.

The Ionian alphabet is the direct ancestor of the modern Greek alphabet. In this alphabet, the letter epsilon represents a short vowel that is contrasted with a long vowel represented by the letter eta. In the classic Attic alphabet, on the other hand, the two sounds that were distinct in the Ionian alphabet were represented by the single letter epsilon. A glyph essentially identical in appearance to the Ionic eta instead represented a consonant, pronounced like a modern English H (or like the “rough breathing” in modern writing of ancient Attic). Any reader (or any computer program) that tries to interpret a text written in the “old Attic” alphabet as though it were written in the modern, Ionic alphabet will fail spectacularly, even though the language is unchanged.

ISO standard 15924 defines codes to identify the writing system of a text. The current version includes no way to distinguish archaic and classical Greek alphabets from the alphabet of modern printed texts.

3. Digital character set

Once we have identified the language and the writing system of our text, we have to record its contents. The Unicode consortium defines the standard that is by far the most comprehensive and widely supported digital character set today.

Of the sections of the Unicode specification that I have looked at closely, few are as misconceived as the ancient Greek section. I’ll save a fuller catalog of its problems for a separate post, but can briefly contrast one example of the clean design of the Arabic section of Unicode.

In Arabic, a single letter might have distinct forms when written separately, initially, medially or finally. A free-standing letter kaf ك looks quite different from the first letter of the word “book”

كتاب

for example. Software following the Unicode specification can represent all instances of kaf with the same code point: the different letter forms are treated as presentational variants depending on the position of the letter in relation to other letters.

Now use this tool to search the Unicode specification for the term “sigma”. We have two distinct upper-case sigmas, and no fewer than three lower case sigmas, with a lunate form and a terminal sigma being given distinct code points.

While medial and terminal sigma are, like the different forms of Arabic kaf, contextually determined variant glyphs, lunate sigma is simply a font choice used by editors who do not wish to distinguish a final form of sigma from other forms (often because they are editing fragmentary texts like papyri where it might be difficult to decide where word breaks occur in a handful of isolated letters). In all cases, an editor should be able to encode a simple sigma, and searching or parsing of the digital text would work on any form of sigma, while publishers who preferred the papyrologists’ lunate form of the letter could use a font with that glyph for sigma; publishers preferring a text with the two traditional print forms could use a font with a variant form of
terminal sigma.

Because of the false definition of lunate sigma as a distinct character, however, you now have to check manually for lunate forms of sigma versus other forms of sigma if you want to parse or search a text encoded in Unicode Greek. Do you want to do that? Do you want to rely on the authors of your software having to do that?

Solutions?

International standards processes are slow. While it’s reasonable for standards bodies like ISO to rely on the recommendations of professional organizations with expertise in a specific domain, in a field like classics this can be problematic. The American Philological Association is a professional organization often thought to represent the field of classics, but its role in recommendations to international standards like the Unicode consortium, and its complete
absence from discussion like the ongoing revision of international language codes suggest that, because of the what I’ve called the recursive arithmetic of tenure, it institutionalizes conventional wisdom and obsolete assumptions, and helps sustain cargo-cult scholarship.

But in recent months we’ve seen example after example of traditional institutions that have been overtaken by motivated groups using the internet to organize. Can we form enough of an on-line community to move better standards through ISO and the Unicode Consoritum, in alliance with or independent from existing professional groups?


5 comments:

Gabriel Bodard said...

A couple of years ago we started trying to come up with a solution to your problems 1 and 2 [i.e. dialects and alphabets] within the Markup list community. This effort petered out in the usual way (and I don't know if I can find our documentation now), but in essence what we were aiming for was a two-pronged movement:

1. come up with a simple but usable scheme of Greek dialects and alphabets/scripts, and propose sub-codes for them in a form that would likely be palatable to ISO and SIL in the long-term, with a view to proposing these as additions to the canon.

2. express these dialects/scripts using the existing method for user-defined sub-tags in ISO/SIL in the meantime, so that we have a mechanism useful to scholars immediately.

It sounds like you are saying we should resurrect this effort now. Should we start the circular emails again?

Unknown said...

I can't *tell* you the number of hours I have spent dealing with the sigma insanity in Unicode. It is *utterly* mad. Greek had the misfortune of being an early entrant into the Unicode space, and presumably they learned from their mistakes when it came to dealing with Arabic.

I'd love it if some work could be done on retroactively fixing this. They did it with deprecating combining vowels with oxia vs. tonos (so έ is \u03AD, not \u1F73). Deprecating ς and in favor of a multiform σ would be a bigger deal, but surely achievable, and lunate sigma, as you say, should simply be a question of the font.

+1 to Gabby on fixing the language codes. The infrastructure is in place for doing it, and I don't see any reason subtags for Greek dialects shouldn't go in.

Neel Smith said...

Yes, I recall the Markup list discussions. How do we go further now?

Like Hugh, +1 to starting with language codes. How do we organize enough of a community to push this through *now*? I'm not a great social activist. How can we generate the energy to make this happen?

Unknown said...

Well, as with pushing for any change, awareness and education are paramount! So, thank you Neel for posting about this.

I just discovered that the Unicode Consortium posts their updates and releases on Twitter. Perhaps we can see where social media can spark some discussion? Worth a try!

Assma said...

The current version includes no way to distinguish archaic and classical Greek alphabets from the alphabet editing services of modern printed texts.