Wednesday, March 5, 2008

The first thousand years of Greek

How much Greek survives from the classical period? From the Hellenistic period?

Those questions were impossible to quantify when I was an undergraduate. It still might be difficult to get a very precise answer if we wanted to consider inscriptions and papyri, but if we limit ourselves to ancient Greek transmitted to us by manuscript copying, we can get a pretty satisfactory answer for the first thousand years or so of ancient Greek very quickly using the Canon from the TLG E disk.

The data in the Canon can be systematically manipulated using the Diogenes perl library. For each work in the TLG, the Canon contains three fields of information that are of special interest for this question: one indicates the method of transmission; another contains the word count of the TLG's on line text; and a third field contains a date description. The method of transmission is important, because the TLG includes "works" that are known only through testimonia or citation — "fragments," as classicists misleadingly call them — where we instead want to estimate how much Greek actually exists. (We don't care about geographic "fragments" of Hipparchus that are really passages of Strabo. To get an idea of how much of the TLG is made up of this doubling of content, the TLG E disk contains roughly 75-76 million words; over 4 million words — roughly 5% of the whole TLG E disk — are quoted "fragments" or testimonia!)

While it would be possible to write perl code to query the TLG Canon directly via the Diogenes API, most people would probably find it easier to transform the contents of the Canon into some format where they can use standard technologies. (I have created both a hierarchical XML version of the Canon, and a normalized relational database version; possible topics for another blog entry perhaps.)

The word counts are integer values; the methods of transmission are indicated by a controlled vocabulary: manuscript transmission is either 'Cod' or 'cod'. The only challenge is parsing the Canon's quasi-regular strings describing dates, but there are only a little over 100 unique strings, so scripting a little text munging in your favorite language that supports regular expressions is pretty straightforward.

The Canon's dates are to a precision of a century, so I interpret all dates as ranges. A date of "first century AD" could be interpreted as a range of 1-100 AD, and a date of "third or second century BC" could be interepreted as a range of 299 - 100 BC, for example.

At this point, it's time to let the computer do the counting. Here are some results to consider: through 300 AD, the TLG contains over 20 million words, but their chronological distribution is very uneven:


For works dated after or equal to... ... but before Number of words Running total
Earliest Greek writing500 BC384528384528
500 BC400 BC22517662636294
400 BC300 BC17629444399238
300 BC200 BC9212555320493
200 BC100 BC1786555499148
100 BC1 AD17453207244468
1 AD200 AD758375914828227
200 AD300 AD537309520201322


Caveats


Roughly 10% of the contents of the TLG E corpus (7680878 words) have dates given as "INCERTUM" or "VARIA": these are completely omitted from the counts. We can't really know how Greek is distributed beyond the period of the TLG E Canon's coverage, because the TLG project no longer makes the Canon available, except through its "one-size-fits-all" interface (or to answer the questions raised here, "one size fits none"). This is the more troubling since the TLG's online corpus is now a third again as large as it was in 2000 when the E disk was prepared (by the estimate of the TLG website, 99 million words vs. 76 million words for the E disk).