Vitruvian design for scholarship in the humanities: March 2008

How much Greek survives from the classical period? From the Hellenistic period?

Those questions were impossible to quantify when I was an undergraduate. It still might be difficult to get a very precise answer if we wanted to consider inscriptions and papyri, but if we limit ourselves to ancient Greek transmitted to us by manuscript copying, we can get a pretty satisfactory answer for the first thousand years or so of ancient Greek very quickly using the Canon from the TLG E disk.

The data in the Canon can be systematically manipulated using the Diogenes perl library. For each work in the TLG, the Canon contains three fields of information that are of special interest for this question: one indicates the method of transmission; another contains the word count of the TLG's on line text; and a third field contains a date description. The method of transmission is important, because the TLG includes "works" that are known only through testimonia or citation — "fragments," as classicists misleadingly call them — where we instead want to estimate how much Greek actually exists. (We don't care about geographic "fragments" of Hipparchus that are really passages of Strabo. To get an idea of how much of the TLG is made up of this doubling of content, the TLG E disk contains roughly 75-76 million words; over 4 million words — roughly 5% of the whole TLG E disk — are quoted "fragments" or testimonia!)

While it would be possible to write perl code to query the TLG Canon directly via the Diogenes API, most people would probably find it easier to transform the contents of the Canon into some format where they can use standard technologies. (I have created both a hierarchical XML version of the Canon, and a normalized relational database version; possible topics for another blog entry perhaps.)

The word counts are integer values; the methods of transmission are indicated by a controlled vocabulary: manuscript transmission is either 'Cod' or 'cod'. The only challenge is parsing the Canon's quasi-regular strings describing dates, but there are only a little over 100 unique strings, so scripting a little text munging in your favorite language that supports regular expressions is pretty straightforward.

The Canon's dates are to a precision of a century, so I interpret all dates as ranges. A date of "first century AD" could be interpreted as a range of 1-100 AD, and a date of "third or second century BC" could be interepreted as a range of 299 - 100 BC, for example.

At this point, it's time to let the computer do the counting. Here are some results to consider: through 300 AD, the TLG contains over 20 million words, but their chronological distribution is very uneven:

For works dated after or equal to...	... but before	Number of words	Running total
Earliest Greek writing	500 BC	384528	384528
500 BC	400 BC	2251766	2636294
400 BC	300 BC	1762944	4399238
300 BC	200 BC	921255	5320493
200 BC	100 BC	178655	5499148
100 BC	1 AD	1745320	7244468
1 AD	200 AD	7583759	14828227
200 AD	300 AD	5373095	20201322

Caveats

Roughly 10% of the contents of the TLG E corpus (7680878 words) have dates given as "INCERTUM" or "VARIA": these are completely omitted from the counts. We can't really know how Greek is distributed beyond the period of the TLG E Canon's coverage, because the TLG project no longer makes the Canon available, except through its "one-size-fits-all" interface (or to answer the questions raised here, "one size fits none"). This is the more troubling since the TLG's online corpus is now a third again as large as it was in 2000 when the E disk was prepared (by the estimate of the TLG website, 99 million words vs. 76 million words for the E disk).

Vitruvian design for scholarship in the humanities

Wednesday, March 5, 2008

The first thousand years of Greek

Caveats

Blog Archive

About Me