Monday, August 23, 2010

Greek, Latin, Arabic

Last week, I attended a meeting on "Greek, Latin, Arabic" at Tufts University. Some of the most stimulating discussion focused on how it's possible to exploit more or less limited corpora of structured digital texts to find valuable information in a larger, less structured morass (think: the internet). Lots of interesting research worth blogging about (although you won't be likely to hear about any of it if you go to an APA convention), but I want to comment briefly on David Bamman's presentation, because I think his work is as significant as any research I've seen in classics in the past 30 years.

For some time, Bamman has been pursuing interesting work in two distinct areas: (1) automatic alignment of texts in different languages, and (2) using dependency treebanks to represent the syntax of Greek and Latin. I've followed his progress for a couple of years, but last week was the first occasion when I've begun to realize how he can weave these two strands of work together. (That's only a comment on my own obtuseness.) If you want to jump over the methods, and go straight to an astonishing result, follow this link.

It's a dynamically induced lexicon. Stop, and reread that sentence. It's a dynamically induced lexicon.

I've read this (preprint of an) article about projecting markup across translations a couple of times. That's not really enough, and I'll probably reread it tomorrow, but let me reduce one result to this summary: at the level of individual words, Bamman achieves about a 70% success rate aligning roughly five million words of Greek with seven million words of English. That in itself is fairly astonishing, but Bamman also leverages his work building treebanks to model the syntax of Greek and Latin texts. Crossing the syntactic models from his treebanking with information gleaned by aligning versions texts in different languages, Bamman builds a dynamic lexicon that can take a Greek term and trace how translations in a language like English render that term in different sets of texts, including recognizing the syntactic constructions in which the term appears; or conversely, he can take an English term, give the most closely corresponding Greek terms, and from there, again lead you through the history of the Greek term, as glossed or explicated (automatically) by its English translations.

In his presentation, as in the written publications I have seen to date, Bamman's work simultaneously shows the general implications of his research for computational linguistics, and how the Latin and Greek case studies he has chosen are distinctive. That tension between generality and specificity is, I think, often the hallmark of really great scholarship — a category David Bamman's work clearly falls into, in my view. If there are any lovers of classical languages or literature who still doubt whether they are computational linguists, Bamman should persuade you that in 2010 we all are all computational linguists.






No comments: