Vitruvian design for scholarship in the humanities

Wednesday, February 8, 2012

Digital scholarship must be technology-agnostic

As smart phones and tablets assume an ever-larger role in browsing the web, “responsive design” has become a hot topic among web designers. How far is it possible to design a single web site that can adapt its display depending on the characteristics of the reading device? Are there times when it’s simply necessary to maintain separate resources for phones vs. large-screen computers?

Designers of digital scholarship face even more demanding requirements. We know that we will replace our digital technologies, but it’s part of our responsibilities to preserve and transmit the scholarly record we work with. Our predecessors have not always set an ideal example for us. The work of Hellenistic scholars of the Iliad like Aristarchus of Samothrace was originally composed for papyrus scrolls. By the time of our earliest complete manuscripts of the Iliad, the tenth and eleventh century, the standard form of “publication” was the codex, or manuscript book. In a large codex, the wide margins offered invitingly convenient space to annotate the Iliadic text with selected notes from earlier scholars, as we see in the famous Venetus A manuscript.

(See interactive version)

As a consequence, virtually all ancient scholarship on the Iliad ceased to be copied as separate texts, and is today known to us only from the snippets preserved in these marginal notes, or scholia. The convenience of this early “hypertext” technology led directly to the loss of important scholarly work.

This illustrates a fundamental and somewhat paradoxical principle that should guide all our work on digital scholarship: it must be technology-agnostic. Well designed digital work will be machine-actionable, but will also be capable of expressing its content when moved to other media, even non-digital media.

One area where we must apply this principle rigorously is in our citation practice. It is tempting to yield to the convenience of using a URL to refer to on-line work: after all, with a URL we can immediately see some kind of response in a web browser.

But this convenience is as dangerous as the medieval scribes’ use of the margins of manuscripts for scholia. URLs are addresses: they will change or vanish; more fundamentally, the web that they point to will ultimately vanish (and, on a time scale that looks back to Aristarchus of Samothrace and other scholars of the library at Alexandria, it will certainly vanish sooner rather than later).

I’ve worked over the past several years with colleagues at the Center for Hellenic Studies to develop a URN notation for citing texts. (Some formal documentation is beginning to appear here ) URNs offer a formally specified notation for referring to some kind of resource, without reference to any particular technology. One of my favorite examples is the ISBN, which can be expressed with URN syntax. Many computer applications work with ISBNs: sales clerks in book stores read them with bar-code scanners, and you can search Amazon or bookfinder.com by ISBN for example. But until a few years ago, I routinely filled out request forms at my college bookstore by hand-writing ISBNs on a paper form, and they functioned perfectly well in that analog environment.

The Canonical Text Service URN (or CTS URN), like an ISBN, is a formally specified machine-parseable reference, but at the same time a simple text string that can be read by human beings and used outside of a digital environment. I have successfully disseminated URNs using chalk on blackboards, and pen on the back of a napkin. But since a CTS URN is also machine actionable, it can be passed in to a Canonical Text Service to retrieve cited passages of text. When our form of citation is not tied to a specific technology, we are free to imagine previously unforeseen re-uses of that material. Would it be handy if the printed copy of a book you want to carry with you were augmented with URNs represented as QR codes you could point your smart phone at to read a cited text? I don’t know, but it would not be difficult to implement. The QR code at the top of this blog entry represents the CTS URN

urn:cts:greekLit:tlg0012.tlg001:1.1

Here is a link passing the same URN to a Canonical Text Service.

Saturday, February 4, 2012

Ancient Greek is broken

It is 2012, and it is not possible to edit an original document from archaic or classical Greece digitally.

The inscriptions recording the construction of the Parthenon cannot be edited digitally; the Athenian Tribute Lists reflecting the annual payments members of the Delian League made to Athens in the fifth century B.C.E. cannot be edited digitally; votive offerings to Apollo at Delphi, dipinti on classical Greek pottery, graffiti scratched by Greek mercenaries on the colossal statues at Abu Simbel in Egypt — none can be edited digitally.

We are prevented from fully and accurately editing archaic and classical Greek by inadequate or erroneous technical standards defining the representation of languages, writing systems and digital character encodings. Unlike Claude Rains’ famously pretended reaction in Casablanca, I am genuinely shocked that most of the standards keeping us from editing classical Greek have been adopted unmodified from recommendations by professional classicists. (Think about that the next time you want to evaluate the state of digital scholarship in the humanities.)

Each of these three shortcomings is worth discussing separately, so I plan to post more detailed comments on them individually, but here is a brief summary of the problem.

1. Language

A text must identify what languages its content represents. We do that with International Standards Organizations (ISO) codes for language. The registration authority for the ongoing work to develop a comprehensive set of three-letter codes for languages is SIL
International.

While some languages codes are organized in families (so that related dialects or languages can be recognized by software to process the contents appropriately), archaic and classical Greek are lumped under a single grc code. (This at least is an improvement on the previous iso639–2 list of codes where Mycenaean Greek written in Linear B could not be distinguished from classical Greek!)

We tell students reading Plato that the text is in the Attic dialect, and would not ask them to consider interpretations that are only possible in other dialects. The string τό, for example, might be a form of the relative pronoun in Ionic Greek, but in Attic it can only be the definite article (“the”).

We should treat our software equally kindly, by encoding explicitly the dialectical variant of ancient Greek used in a text.

2. Writing system

If we are editing an ancient Greek document, we must identify the document’s writing system, since archaic and classical Greek city states used a variety of distinct alphabets. In 403 BC, the Athenians voted to adopt a as their official writing system the alphabet used in Ionia, replacing the Attic alphabet they had used up to that time. The language spoken in Athens did not change, but the writing system did.

The Ionian alphabet is the direct ancestor of the modern Greek alphabet. In this alphabet, the letter epsilon represents a short vowel that is contrasted with a long vowel represented by the letter eta. In the classic Attic alphabet, on the other hand, the two sounds that were distinct in the Ionian alphabet were represented by the single letter epsilon. A glyph essentially identical in appearance to the Ionic eta instead represented a consonant, pronounced like a modern English H (or like the “rough breathing” in modern writing of ancient Attic). Any reader (or any computer program) that tries to interpret a text written in the “old Attic” alphabet as though it were written in the modern, Ionic alphabet will fail spectacularly, even though the language is unchanged.

ISO standard 15924 defines codes to identify the writing system of a text. The current version includes no way to distinguish archaic and classical Greek alphabets from the alphabet of modern printed texts.

3. Digital character set

Once we have identified the language and the writing system of our text, we have to record its contents. The Unicode consortium defines the standard that is by far the most comprehensive and widely supported digital character set today.

Of the sections of the Unicode specification that I have looked at closely, few are as misconceived as the ancient Greek section. I’ll save a fuller catalog of its problems for a separate post, but can briefly contrast one example of the clean design of the Arabic section of Unicode.

In Arabic, a single letter might have distinct forms when written separately, initially, medially or finally. A free-standing letter kaf ك looks quite different from the first letter of the word “book”

كتاب

for example. Software following the Unicode specification can represent all instances of kaf with the same code point: the different letter forms are treated as presentational variants depending on the position of the letter in relation to other letters.

Now use this tool to search the Unicode specification for the term “sigma”. We have two distinct upper-case sigmas, and no fewer than three lower case sigmas, with a lunate form and a terminal sigma being given distinct code points.

While medial and terminal sigma are, like the different forms of Arabic kaf, contextually determined variant glyphs, lunate sigma is simply a font choice used by editors who do not wish to distinguish a final form of sigma from other forms (often because they are editing fragmentary texts like papyri where it might be difficult to decide where word breaks occur in a handful of isolated letters). In all cases, an editor should be able to encode a simple sigma, and searching or parsing of the digital text would work on any form of sigma, while publishers who preferred the papyrologists’ lunate form of the letter could use a font with that glyph for sigma; publishers preferring a text with the two traditional print forms could use a font with a variant form of
terminal sigma.

Because of the false definition of lunate sigma as a distinct character, however, you now have to check manually for lunate forms of sigma versus other forms of sigma if you want to parse or search a text encoded in Unicode Greek. Do you want to do that? Do you want to rely on the authors of your software having to do that?

Solutions?

International standards processes are slow. While it’s reasonable for standards bodies like ISO to rely on the recommendations of professional organizations with expertise in a specific domain, in a field like classics this can be problematic. The American Philological Association is a professional organization often thought to represent the field of classics, but its role in recommendations to international standards like the Unicode consortium, and its complete
absence from discussion like the ongoing revision of international language codes suggest that, because of the what I’ve called the recursive arithmetic of tenure, it institutionalizes conventional wisdom and obsolete assumptions, and helps sustain cargo-cult scholarship.

But in recent months we’ve seen example after example of traditional institutions that have been overtaken by motivated groups using the internet to organize. Can we form enough of an on-line community to move better standards through ISO and the Unicode Consoritum, in alliance with or independent from existing professional groups?

Friday, February 3, 2012

Unplanned reuse

There’s really only one thing you can do with a book: read it. You can learn from it, cite it or feel that your life has been changed by it, but you can’t directly reuse it (well, apart from making it an
accessory piece of furniture, but that doesn’t make use of the contents of the book). One of the distinctive differences of digital scholarship is that, if it is well designed, it can be used for purposes the original author may not have foreseen. The original author may even discover unintended reuse for digital work, as I did recently.

I had been working on an image service using a URN notation to retrieve and view images of the famous Archimedes Palimpsest. Using a URN like

urn:cite:hmt:chsimg.081v–088r_Arch03v_Sinar_pseudo_no-veil

the service lets you do things like

Retrieve a binary image at a given size. . This is bifolio 81v–88r at 50 pixels wide.
Retrieve a region of interest . This extracts from the same image a region with a mathematical figure, the construction of Archimedes, Floating Bodies 1.proposition.1
open a pannable/zoomable version of the image in a web browser, either with or without a highlighted region of interest. Try these two links to the same bifolio illustrated in the static images above:
1. with no highlighted region
2. including highlighting of the mathematical figure

For a course I taught in English translation, I put together a text service, allowing you to retrieve passages of text by canonical reference. With a URN like this

urn:cts:greekLit:tlg0552.tlg008.chs03:1.proposition.1

the service lets you retrieve archival XML source for a passage. This request gets the XML source for Archimedes, Floating Bodies, postulate 1 — not necessarily a thing of beauty to the casual reader of Archimedes. But it’s trivial to associate an XSLT stylesheet to format the archival XML for reading in a browser, so here is the same passage associated with stylesheet for easy reading.

At some point, the penny dropped, and I realized it would also be trivial to mash up the two services. When I started work on the image service, I had not imagined that the digital images of the Greek palimpsest would be of any interest to Greekless readers of Archimedes, but the mathematical figures in the manuscript are extremely important even if you’re reading Thomas Heath’s public-domain English translation.

A minor addition to the XSLT stylesheet uses the markup indicating the presence of canonically identified figures in Heath’s translation to embed references to the image service.

Try this view of book 1, proposition 1, where any reader (Greek scholar or not) now gets to follow the text in Heath’s translation together with images in the only surviving Greek manuscript of Floating Bodies. Images of regions are embedded in the text, and are linked to the zoomable view of the whole bifolio.

Tuesday, January 31, 2012

A checklist for writers

Technology both shapes and reflects our values. What do we value in scholarly writing, and how well do our technological choices match those values?

I look for software that supports four necessary or possible qualities of good scholarly writing:

expository writing should be explicit and unambiguous
the writing process is iterative: good writing only comes from rewriting
academic writing in the natural sciences is often collaborative; this is becoming less rare in the humanities (although not necessarily in the cargo-cult humanities )
born-digital writing should be reusable

In a digital enviroment, to write explicitly and unambiguously means more than choosing our words well: it also means expressing the structure and contents of our writing explicitly and unambiguously. Our writing should embody the fundamental principle of separating concerns in our digital work: our first goal is to express our ideas clearly, not to exercise our typesetting skills, so we need a format that that can explicitly and unambiguously express structure. We might choose an XML-based semantic markup system, or some semantically classed “markdown” system such as markdown or textile. What we should not choose is a “word processor.” Even if you can approximate a semantic structure using a carefully chosen set of “styles” (a tell-tale term!), you will be planting your semantic hints in a thick forest of code focused on the particulars of displaying your text visually. Note that it’s perfectly possible to express this irrelevant information using XML formats like OpenDocument. Our question is not “is this an XML format?” but “does this format express the semantics of my document?”

In considering how to support the remaining items in our list, we should look for examples beyond the humanities, since expository prose is not the only form of writing that shares these qualities. In particular, each is characteristic of good composition in computer programming, and computer programmers routinely use software that directly takes account of each of these qualities.

Programmers use version control systems to work with the entire history of a document to update, restore or compare versions. Version control systems also simplify collaboration, and allow mulitiple contributors to work simultaneously on a document. Changes can be silently integrated and shared; if two authors simultaneously make conflicting changes, version control systems can recognize that, and offer authors options to reconcile conflicts manually. There are many good, freely available version control systems. One reason that humanists are less familiar with them than they should be is that version control systems work best with textual data: the binary formats that word processors produce are a major obstacle to integrating our writing in version control, but once we have adopted a text-based semantic format, that obstacle vanishes, and we have a writer’s desktop that lets us write iteratively and collaboratively.

Programmers also provide a model for reusing our writing. Units of code are often packaged in libraries that other programs use. Programmers working on large projects manage the potentially complex interrelations and dependencies of of different libraries and programs using build systems. We are not yet accustomed to thinking about automating the reuse of our writing, but there is no technical obstacle to doing so. We could use build systems to assemble chapters into a book, incorporate common navigational headers into all the pages on a web site, or automatically update an index if one section of a text changes, to name just a few obvious examples.

So our checklist of required tools for writers includes:

an editor that works comfortably with semantically structured text
a version control system
a build system

I plan to add a series of posts with the tag writing to look at how we can work with tools like these to write more effectively in a digital setting. Meanwhile, take the checklist to your college or university IT department, and ask what specific software they support for semantic editors, version control systems and build systems. I would love to learn of an academic institution that is not just pressing commercial word processing software on its students and faculty, but I don’t know of one.

Monday, January 30, 2012

The recursive arithmetic of tenure

The long career path from college student to a tenured academic job is designed to be conservative. A student in the humanities who discovers a passion for an academic subject in his or her first year of college can expect that four years of college will be followed by, say, six years of graduate school that not only provide training in a discipline, but initiate the student in its culture. The (increasingly rare) PhD who then immediately walks into a tenure-track job typically faces seven years of scrutiny before a tenure decision. Newly tenured professors have proven that their work meets the professional standards of their colleagues — seventeen years after entering college.

Like many professors, I hope that a college education is a formative experience in the lives of my students. Imagine that the newly tenured professor was inspired, seventeen years ago, by an exciting teacher and scholar. That person of course would have climbed the rungs of the same professional ladder, so the youngest tenured professor who could have inspired today’s youngest tenured professor might in turn have first been inspired as a new college student … 34 years ago.

In 1977, the late Steve Jobs was just starting a company he had formed the previous year to sell the computers he and Steve Wozniak were building in his father’s garage.

We’re trying to cross an ocean by standing at the shore and waiting for continental drift to carry us to the other side.

The humanities-that-must-not-be-named

I’m not thrilled with the term “digital humanities.” When people refer to the “humanities,” I think I know what they mean: those disciplines that are concerned with human activity and everything it produces, and take as their task both to preserve and transmit that culture on the one hand, and to understand and interpret it on the other. But what is the sense of qualifying that noun with the adjective “digital”?

In the twenty-first century, the phrase can’t really stand in opposition to an implied “analog humanities”: no such thing exists. (When was the last time anyone submitted a hand-written or manually typed manuscript to be edited with grease pencil before being manually typeset with hot lead?) “Digital humanities” refers instead to scholarship in the humanities that consciously takes account of the fact that we all work digitally now.

What troubles me is that our use of the marked term “digital humanities” implies that the unmarked term, “humanities,” is being used to refer to scholarship that does not reflect on the media we all work in (a usage that is sadly accurate in the academy today). I am particularly disturbed because I would like to imagine that an education in the humanities encourages the kind of critical self-awareness that would enable us to think more meaningfully about our relation to the environment we live and work in, including our technological environment and the ways it is interwoven with our institutions and values.

By using “digital humanities,” we’re allowing the term “humanities” to stand for an uncritical scholarly practice that is at odds with the goals of a humanistic education.

cargo cult plane I can understand why there is not a spontaneous groundswell of support in academic departments around the world for a term meaning “work that unthinkingly perpetuates obsolete forms of scholarly practice,” or “scholarship that is oblivious to the media we use today,” but rather than accept without reservation the marginalizing label “digital humanities,” I’ll offer my own suggestion. We could extend Richard Feynman’s “cargo-cult science” to “cargo-cult scholarship” more generally, and refer to the “cargo-cult humanities.”

Sunday, January 29, 2012

“Digital natives”

I recently attended a workshop at my home institution where I heard teachers confidently assert that today’s students are so adept at technological tasks that we can rely on them to help their older teachers develop important technological skills.

Really?

For more than 15 years, I’ve introduced Classics students at Holy Cross to XML markup. To build on any prior experience they might have, I routinely begin by asking who has ever peeked behind a web page to view its HTML source. Fifteen years ago, I would usually find anywhere from a quarter to a half of the students would say yes. Today, if I ask a group of 20–25 students, I will get one or two “yes” answers.

I do not know if my students were telling me the truth fifteen years ago (or today), but that doesn’t much matter for my present point. Fifteen years ago, far more students either had seen HTML or felt some kind of pressure to pretend that they had.

What does it mean? I suspect that the “digital natives” I teach have indeed grown up so familiar with information technology that they are more oblivious to it than their elders. I worry that they are also incurious, or at least need to learn to be curious about it.

My personal experience makes up only a limited sample, of students in Classics at a small liberal-arts college, but the trend among those students is very clear. Unless someone can show me better evidence, I’ll remain very sceptical about a priori assertions concerning the skills that “digital natives” will confer on their teachers.

Note: new tags

I’m using a couple of new tags on this post: “sceptical” (for obvious reasons), and “yam” [yet another meeting] to help me find posts responding to ideas I’ve gathered from yammering at meetings. I hope to post soon on a couple of additional “sceptical” topics, and several “yam” topics (since January is a big month for meetings in the academic world).