Thursday, March 26, 2009

MIT faculty mandate open access to publication

Blogged here by Hal Abelson, chair of the committee that composed the resolution: the MIT faculty have voted unanimously to adopt a resolution that includes these phrases:

[E]ach Faculty member grants to MIT a nonexclusive, irrevocable, paid-up, worldwide license to exercise any and all rights under copyright relating to each of his or her scholarly articles, in any medium

and
The policy is to take effect immediately

Think that's clear enough for any of the hair-splitting legalists out there?

I believed there was time when humanists set the moral direction for the academy. We're lucky that our scientists' pursuit of truth seems to be generating enough of a draft to pull us along.

Wednesday, November 26, 2008

The vocabulary of ancient Greek

What is the vocabulary of ancient Greek? That is, what set of words, or lexical entities, actually occur in our extant texts?

The First Thousand Years of Greek project (announced here) aims to simplify posing such straightforward questions, but we need more than online texts to talk unambiguously about words. One essential piece of infrastructure is an inventory of uniquely identified lexical entities in Greek. In print publications, lexical entities have traditionally been identified by a word's lemma form. While lemmata are valuable labels, they are potentially ambiguous. Instead, basic principles of information design dictate that arbitrary identifiers guaranteed to be unique should be associated with lemma strings, so that references to a lexical entity can be unambiguously machine processed (using the identifier), and remain intelligible to human readers (using the labelling lemma string).

The Perseus project has given classicists two monumental resources that must be coordinated with an inventory of lexical entities: the digital LSJ lexicon of Greek, and the Morpheus morphological parsing system that can associate surface forms of words with a lemma. Taken together with the invaluable list Peter Heslin has created by running Perseus' morphological parser over the word list of the TLG project's E disk, they provide an obvious starting point for an inventory of Greek lexical entities would be to compare these two resources.

The digital LSJ has already been provided with unique identifiers for each entry, and each entry includes a lemma string. Perseus' morphological analyses identify entities by lemma. Where there is a one-to-one mapping between the parser's lemma and the LSJ lemma (normalized so that LSJ's markings of long and short vowels are removed), we can fairly assume that they represent the same entity, and could simply adopt the LSJ identifier to refer to the more general notion of the lexical entity — an unambiguous reference that could be associated with an entry in the lexicon, with morphological analyses, or with any other information.

While this simple (and easily automated) task takes care of the vast majority of the vocabulary in both the LSJ and in the parser's output, there are several categories of problematic cases. They include:


  • entities where LSJ's orthography differs from the parser's orthography. This is actually a large group with several subcategories, some of which can probably be reliably resolved automatically. For example, LSJ and Morpheus sometimes disagree on whether the lemma form of a verb should be active or middle/passive voice: a a careful script could accommodate that kind of variation, but human intervention would be necessary when LSJ and Morpheus use alternate forms of the lemma.

  • entities that appear in the parser's list of lemmata, but not in LSJ. This occurs frequently with compound verbs that are not given separate articles in LSJ. In these cases, since there is no LSJ identifier to reuse, we would, obviously, need to create new identifiers for those entities not in LSJ.

  • "ghost entities." For reasons that are not clear to me, LSJ routinely lists verbal adjectives in -τέον as distinct entities, unconnected to the verb from which they are formed. (E.g., the adjective λυτέον is a distinct entry, unrelated to the verb λύω.) Whatever the reasoning, in a digital environment, this is the wrong taxonomy: the morphological analysis should allow applications to distinguish verbal adjectives from other forms deriving from the same verbal root, while the identifier for the lexical entity should recognize verbal adjectives and conjugated forms of a verb alike as forms of the same entity. Mapping these LSJ and Morpheus lemmata to the correct verbal lemmata will be a relatively straightforward task, but again will need human supervision for some common cases (e.g., δοτέον < δίδωμι).

  • entities in LSJ but not in the list of lemmata generated by running the parser over the TLG E word list. Presumably, these result from the contributors to LSJ covering texts that are beyond the scope of the TLG E disk's corpus. As a basic principle, we should make absolutely explicit what digital corpus of texts an inventory of lexical entities is based on. Since our first pass is working from Heslin's analysis of the TLG E corpus, we should not enter these LSJ IDs into our inventory — at least, not yet. As the inventory is checked against further texts, new vocabulary may appear, and at that time new candidates for addition to the inventory will need to be checked in both LSJ and Morpheus.


That is a substantial, but I think manageable, list of tasks. One easy way to begin would be to limit the scope of coverage further, and rather than beginning from the entire TLG E word list, start with a word list created from a specified corpus of texts. As lemmatized word indices for the First Thousand Years of Greek are released, we will guarantee that all surface forms of a word are resolved to a uniquely identified lexical entity.

Thursday, October 30, 2008

Beyond text

If you are interested in the architecture of scholarly resources, run, don't walk, to Gabe Weaver's new sourceforge site, episteme. The nascent site (opened to coincide with the public release of "digital product" from the Archimedes Palimpsest project) documents his work representing and manipulating information encoded as mathematical diagrams.

There's already a lot to think about here, but one intriguing aspect is that entities in figures are referred to with identifiers that can be coordinated with canonical references to passages of textual content from the same document. (Short-term consequence for me personally — urgent need to re-think my presentation for the "Text and Graphics" panel at next week's TEI meeting in London. Ouch.)

Oh, and if you just want to enjoy some beautiful drawings, there's an Easter egg with a larger display of the image above — a collage of figures from book 1 of Archimedes' treatise On Floating Bodies. You can see it here.

(Updated Oct. 31: Episteme now includes interactive eye candy, too.)

Wednesday, August 6, 2008

Epidoc transcoding transformer bats 1.000

Hugh Cayless's transcoding transformer library (available from the Epidoc project's sourceforge site here) is indispensable for anyone working with ancient Greek texts in java or groovy. How reliable is it?

I decided to test it against two significant lists of unique Greek strings. For each list, I converted the TLG's beta code word to UTF-8, then converted the resulting UTF-8 back to beta code, and compared that result to the original. (For an overview of the TLG's beta code conventions, see this guide.)

The first list was composed of 858715 words excluding proper names. The transcoder round tripped to its starting point in 858709 cases. Six failures doesn't sound bad (99.999% success rate). But look more closely: in five of the six failures, the TLG entry in fact breaks the TLG's encoding rules about order of accents, breathings and iota subscripts, while the transcoder correctly follows the rules with the consequence that its conversion back to beta code actually corrects a data entry error in the TLG! The sixth case is a sequence found only in a papyrus fragment. The beta code series o(= should represent an omicron with rough breathing and circumflex – an accentuation that is not possible in Greek.

The second word list I tried was composed of proper names, including the tricky sequences beta code introduces in its conventions for capitalization. Out of 53167 capitalized words, the transcoder round tripped perfectly in all but one – again, an error in the TLG data entry that the transcoder corrected!

That's a total of 911882 unique strings. (That's going way beyond carefully chosen unit tests!) Remarkably, the transcoder had a 100% success rate in correctly formed words.

Thursday, July 10, 2008

Half empty or half full?

I frequently assert that classicists, along with biblical scholars, share the distinction of using logical citation schemes to refer to the works they study. This practice is important, since it means that references can apply to any version of a work, in print or digital form. (Briefly, in an earlier post.)

I have made this claim so often, that I decided it would be a good idea to find out if it were true.

The TLG offers the largest corpus of ancient Greek, so one way to evaluate how classicists cite their works would be simply to count and summarize the citation schemes used in the TLG. Sadly, athough this would have been possible until 2000 when the TLG distributed data to its licensees, there is in 2008 no way around the preconceived query interface of the TLG web site. (The fact that such a simple question as "what citation schemes are used?" is now out of reach illustrates the catastropic consequences for classical studies of the TLG's decision to reverse its decades-old policy of distributing data, in favor of selling access to predetermined user interfaces.)

As in an earlier post estimating the size of the surviving Greek corpus by period, we can still use the 2000 version of the TLG Canon distributed on the TLG E disk to get an impression of classicists' citation practice, however.

As in that post, we'll want to limit ourselves to works transmitted by manuscript copying. I'll take the simplest approach possible: count the number of "works" that use each citation scheme. I won't attempt to normalize in any way the definition of a work: the five-line Homeric Hymn to the Dioscuri is one work, as is the entire Iliad. With that caveat in mind, let's look at the results.

The TLG E canon includes 3810 works transmitted by manuscript and having defined citation schemes. (Note that the Canon includes works not in the E disk; 584 of these works did not yet have a defined citation scheme at the time of the E disk's publication, so I exclude them from our results.) These 3810 works are represented by an astonishing 194 distinct citation schemes!

As we might expect, however, the distribution of these schemes is very uneven: 104 citation schemes are used for a single work; only 16 citation schemes are used for more than 13 works. Let's look more closely at these top 16 citation schemes, which cover 3426 (90%) of the works surveyed.

Citation schemeNumber
volume/page/line1014
section/line710
page/line517
line348
chapter/section/line334
stephanus page/section/line114
book/chapter/section/line75
jebb page/line54
book/section/line49
bekker page/line44
kuehn volume/page/line39
harduin page/section/line32
epistle/section/line25
chapter/line25
epistle/line23
scholion/line23
Total physical schemes1814 (53%)
Total logical schemes1612 (47%)
Grand total3426

The overall results are not encouraging. The entries in black are logical schemes: they total only 47% of the 3426 works. The entries in red refer instead to physical artifacts like book pages, 53% of the group. It's small consolation that the numbers are a worst-case scenario: some works may be cited by both logical and physical reference; where the TLG uses a logical reference, we can be sure that a logical scheme exists, but where the TLG uses a physical reference system, we can't always exclude the possibility that an alternative logical scheme is available. For example, the 44 works cited by Bekker page are, of course, the Aristotelian corpus: many of these have alternative citation schemes by chapter or section.

If we break the numbers down further by the chronological period of the original text, however, the picture changes. With the notable exception of Plato, where Stephanus' great edition became the standard for citation, citation by logical scheme is much more prevalent in works of the classical period. The following table breaks out from the previous listing works dating before about 300 BC.

Citation schemes in works of classical date
section/line 229
line 98
bekker page/line 43
stephanus page/section/line 38
chapter/section/line 20
volume/page/line 18
page/line 16
book/chapter/section/line 11
fable/line 9
book/line 5
ode/line 4
book/section/line 4
tetralogy/section/line 3
demonstratio/line 3
epistle/section/line 3
book/demonstratio/line 2
thevenot page/line 2
epistle/line 2
idyll/line 1
page+column/line 1
sententia/line 1
lexical entry/line 1
proverb/line 1
folio/line 1
fable/version/line 1
exordium/section/line 1
usener page/line 1
Total physical schemes120 (23%)
Total logical schemes399 (77%)
Grand total519


The 519 works are cited in 27 different citation schemes. We could think of that as an "average density" of about 19-20 works per citation scheme, essentially the same as for the overall corpus (194 schemes for 3810 works is also a density of about 19-20 works per citation scheme). But in this listing, only 23% (120) of the classical works use physical reference systems. The corpora of Plato and Aristotle constitute the bulk of this material (81 works); apart from the two great philosophical corpora, only 39 works of the classical period are cited in the TLG by physical reference system – about 8%.

It's probably the height of political incorrectness to suggest that the most traditional canon of work has been the object of better quality scholarly study (although it's plausible enough that more scholarship should produce better results), but by the single, one-dimensional yardstick of how a work is cited, editors of classical texts have done a far better job capturing the logical structure of their texts than have editors of ancient Greek overall.

So for classicists interested in creating a digital corpus of Greek, the "news" is mixed. Roughly half the works in the TLG E Canon already depend on logical reference systems, so we already have a good standard in place for many of our texts. The classical period is in markedly better shape.

Friday, April 11, 2008

Citation schemes: empty content elements considered harmful

Classicists have, by and large, relied on standard, logical citation schemes to cite works of ancient literature. In the scheme of the Functional Requirements for Bibliographic Records (or FRBR), we could say that classicists have cited notional works using references that could then be applied to any manifestation or expression of that work.

In the print world, this practice has made it possible for scholars to apply a reference to different printed editions or translations of a work. As the internet becomes our library, this practice can turn references into machine-actionable entry points to the library (whether the reference is automatically discovered, or manually cited by a scholars). It is therefore a vital prerequisite that digital editions encode standard, logical citation data such as the book/chapter/section divisions of Thucydides, or the book/line divisions of the Iliad.

The TEI Guidelines (as so often) offer more than one way to approach the problem. It is valid TEI to encode citation values as attributes on containing elements that define the logical structure of a document. Book/chapter/section in Thucydides might be represented by a successive hierarchy of TEI div elements, for example, or book/line in the Iliad by div elements containing l elements; the citation values could be placed in the @n attribute of each container.

Alternatively, since the earliest work of the TEI in the 1980s, the Guidelines have included empty elements (such as the milestone) that could be used to mark transitional points in a document. It is easy to find examples of scholarly texts using such empty elements to mark the beginning of a new unit like a chapter or section.

Arguably, there was little difference between these two approaches in SGML. In XML, however, scholars should avoid using empty elements to encode citation data.

A host of supporting and related technologies have developed around XML in its first decade. One of the most important is XPath, a notation for referring to parts of an XML document by the document's structure. Higher-level technologies such as XSLT or implementations of the DOM model in many programming languages in turn support XPath expressions. The result is that programmers working in many environments can succinctly retrieve a unit like "book 2, chapter 5" of Thucydides with a simple XPath expression like

/TEI.2/text/body/div[@type='book' and @n='2']/div[@type='chapter' and @n='5']
Content between empty elements, on the other hand, cannot be addressed directly with XPath expressions.

Placing citation data on empty elements cuts programmers off from a galaxy of technologies they can use when citation data is kept on containing elements. Empty citation elements should never be necessary if the citation scheme is in fact a logical hierarchy: if it is not, consider whether there is a problem either with your choice of citation scheme or with your design of the rest of the document's structure.

Separation of concerns applies to document content, too

Twenty years ago, before the internet was open to the public, the print publishing industry was a leader in SGML document markup, and scholarly markup projects tended to think of "documents" as the content bound between a pair of covers. This heritage is clearly reflected in the TEI Guidelines' thorough inventory of elements to identify "front" and "back" material of documents, or a variety of groupings or collections of texts.

The major syntactic differences between XML and SGML — insistence on a single hierarchy of elements, each with explicitly marked end — were introduced in part to adapt markup to the needs of a very different environment: a network of computers exchanging information dynamically. The already well-understood distinction between semantic markup and presentational markup certainly contributed to the articulation of "separation of concerns" in the design of network applications. Individuals with different skills could apply appropriate technologies to the different parts of a network application, so in creating an application to run in a web browser, programmers might write the controlling code in javascript, and design specialists define its appearance with CSS. In a network of semantically structured content, XML plays the vital roles of defining the data structure (explicitly via a schema or DTD, or implicitly in the case of well-formed XML), and of providing a format for data exchange. The question of what this XML should look like — the kind of question the TEI has considered since the 1980s — had to be rethought. Humanists might rephrase Sun Microsystem's famous slogan, "The network is the computer," as "The network is the library."

When applications can exchange structured content, it is straightforward to create compound documents. Asymmetrically, it can be more difficult to disaggregate a complex document into component parts, since an application then needs a more detailed knowledge of the internal structure of a necessarily more complex document. An application could easily juxtapose a document in original language with a document in translation, or weave together a commentary with a text associated through a common citation system, for example, but disentangling interleaved translation or commentary from a complex document is more problematic.

I've been thinking about this in designing a set of TEI documents to represent the multiple texts of the famous Venetus A manuscript of the Iliad. There are four distinct sets of scholia, in addition to the manuscript's text of the Iliad. I chose to treat each set as an independent document, and as I am now reaching the stage of putting together applications drawing on those documents, I am glad that I did: cleanly separated, discrete documents are making that job much easier than it otherwise would be.

I expect that I will never use the elaborate TEI mechanisms to document the relation of a transcribed document to graphic images. In keeping with the guiding principle of separate, discrete documents, I'm associating images of the manuscript with ranges of text through external indices: here, too, the standoff markup of a separate, simple (non-TEI) document is easy to marshall together with the TEI document of the transcribed text.

In many ways, TEI P5, with its support for XML namespaces, is nudging scholars towards this kind of document organization. But we need to push harder: it's time to move away from monolithic TEI replicas of print or even manuscript sources. In editing scholarly texts for use on the internet, let each logical component stand alone.

Coordinating separate documents in a networked library requires a common understanding of how to cite them. I'll follow up with a note on how editors of TEI texts should think about that part of their markup.