Tuesday, February 19, 2008

Scholarly markup in XML's second decade

XML is now ten years old. (For those interested in an insider's view of how that all happened, Tim Bray has republished XML People.) For scholarly projects involving semantically structured texts, it is practically a given that they will rely on XML.

But in actual practice, texts produced by current projects often don't look very different from scholarship based on SGML in the 1980s. In the next postings on this blog, I want to discuss three suggestions based on my experience with XML over the last decade, and how it contrasts with my experience of SGML in the preceding decade. In each case, I'll focus on how to follow these suggestions using the Text Encoding Initiative's guidelines.


  1. Separation of concerns applies to document content, too. (Now here.)

  2. Citation schemes: empty content elements considered harmful (Now here.

  3. What's the diff? Rethinking the critical apparatus.



Stay tuned.

Wednesday, January 9, 2008

Looking for an honest man — on Linux PPC

Peter Heslin's Diogenes 3.1 is extremely cleanly designed, and ultra portable. The server functionality is written in perl, and the new user interface is a XUL application. One result is that Heslin can provide simple binary installations for Mac OS X, various Windows operating systems, and Linux on x86 architecture.

This design also makes it easy to install and run Diogenes on any operating system with perl and a XUL application environment. Using Ubuntu Linux 7.04 on a PPC system, for example, after you download and install the Linux version of Diogenes, you can run Diogenes at least three different ways:


1) use xulrunner to run the graphic interface

If xulrunner is not already installed on your system, use Synaptic or apt-get to install it. (xulrunner is in the Development section of the Ubuntu universe repository.) You can now start Diogenes from a terminal with the command
   xulrunner /usr/local/diogenes/application.ini


Better still, edit the properties for the Diogenes menu item that was created by the Diogenes installer. In the Launcher Properties, enter the command to start xulrunner as illustrated here. Now you can run diogenes from the menu selection.

2) use Firefox 3 to run the graphic interface

Version 3 of Firefox includes a full XUL environment that can run external XUL programs like Diogenes. Beta version 2 of FF3 was released in December; when a stable release version appears, look for it to show up as an upgrade to Firefox in your Ubuntu repository. When Firefox 3 is installed on your system, you may alternatively start Diogenes with the command
   firefox -app /usr/local/diogenes/application.ini

As with option 1, you can edit the Diogenes menu item to run this command.
Technically inclined users who are eager to play with the beta version can download source code for the beta release, and follow the very clear instructions here to install it. All the prerequisites are standard libraries that are available in Ubuntu repositories.

3) Browse and search texts from the command line

The command line user program (named dio) works just as it does on any other Linux. Run dio with no arguments to see its various options.


The importance of this flexibility is not that it opens up Diogenes to a vast number of Greek scholars using Linux PPC, Solaris, or some other particular operating system. Its importance is rather that it keeps Diogenes open to any platform meeting its simple requirements — including future platforms.

Diogenes on your XO laptop, iPhone, or other device, anyone?

Thursday, December 27, 2007

Open access to federally funded research

Linked from Slashdot today: tucked into the appropriations act just signed by President Bush, a requirement that the NIH must provide online access to research it has funded. This is a tremendous precedent, the first time that the US federal government has made open online access a condition of receiving federal funding for research.

The NIH is the focus, not the NEH, in part because people understand that medical research matters (as the respective budgets of the NIH and NEH also show). But the NIH was also in the Congressional spotlight because of the sustained advocacy of leading scientists, such as the open letter to Congress signed by 25 Nobel laureates in 2004 and by 26 Nobel winners in 2007.

Meanwhile, the American Philological Association, the professional organization that purports to represent classical studies, has inaugurated a multimillion dollar fundraising campaign to establish a "Digital Portal" centered on subscription-based access to a bibliography of print publications.

fuimus Troes, fuit Ilium

Zoom!

In the Ur-web of the early 1990s, images came in fixed sizes. You might get a thumbnail-sized image, a smaller version or a larger version, but generally what appeared in your browser was a full, one-to-one view of a distinct image as it was delivered to you from a Web server.

Today, it's increasingly common for server- and client-side applications to manipulate what is, at least notionally, a single image that a user can navigate through. Google defined the current state of the art in browser-based image navigation when it introduced Google Maps in 2005. Its clever use of AJAX to load adjacent tiles at appropriate scales creates the illusion of continuous navigation of the whole earth.

The same technology can be applied to any image. At University College, London, the Centre for Advanced Spatial Analysis has developed "The Google Maps Image Cutter," an application to generate from any digital image the image tiles required by a Google maps-style web application.

A couple of projects I'm working on apply this technique to browse images that cannot be displayed in full detail in a single view because of their high resolution or awkward shape. The Center for Hellenic Studies' Homer Multitext Project has Google-mapped high-resolution photographs of Iliadic manuscripts. I've recently Google-mapped drawings and photographs of several dozen inscriptions in the Lycian language.

This is an easily implemented and effective way to let users explore an image. It comes at the cost of one tiny little white lie: we have to pretend to Google that the coordinate space of our rectangular image works like a Mercator projection of a spheroid (the earth).

This is innocent enough, if we recognize what we're doing, but it should provoke more serious reflection about how we use images and cite them in scholarly work. We need to define recognizable ways of referring to parts of an image independently of the state of a user's panning and zooming. I'll post more on that topic before long. For now, enjoy the pictures.

Tuesday, December 11, 2007

Vingt ans après

Tonight, several of the Perseus project's original musketeers are gathering to observe the twentieth anniversary of the grant proposal to the Annenberg Foundation that jump-started the project. I'm sure that gray hair, sagging waist lines and altered career paths will prompt private reflections, but here's the fact that grabs me now: the Perseus project is older than three quarters of the undergraduates I teach.

My current students were still toddlers when the first public version of Perseus was released on CD. I doubt any of them have heard of, much less remember, Apple's HyperCard; it will be hard for them to imagine how exciting it was when a hypertext system first became available on personal computers.

They were learning to read or just beginning elementary school when Perseus made its astonishingly rapid transition to a Web delivery system. They probably are unaware that the internet was not always open to commercial use, and have little experience that would help them appreciate the importance of design decisions early in the history of Perseus. Can they grasp how the choice of SGML for markup of texts made it possible to generate both HyperCard stacks and Web pages from a single source?

Now they are in college, and the Perseus project has open-sourced both its code and key data including all its ancient texts (as I observed on Thanksgiving). Will they understand how this opens up to them unprecedented opportunities to build on the work of their predecessors, or have we conditioned them to see themselves only as passive consumers?

Are we raising up a new generation to join in the hard work ahead of us? All for one, and one for all!

Saturday, November 24, 2007

Thanksgiving

If you study ancient Greek, you can be thankful in 2007. This fall, two of our discipline's most important scholarly instruments have gone through extraordinary metatmorphoses. First, Peter Heslin released version 3 of Diogenes (http://www.dur.ac.uk/p.j.heslin/Software/Diogenes/); then this month, the Perseus project (http://www.perseus.tufts.edu/hopper)
announced that source code and text data are being made available under open licenses.

Diogenes now directly integrates automated morphological analyses of ancient Greek from the Perseus project's morphological parser. The Perseus project's new open licenses guarantee that Peter Heslin will not be the last scholar to draw on the rich resources created at Perseus over the past two decades.

Perhaps these developments would be unremarkable in disciplines where contributions through collaborative work and critical assessment of evidence are valued more highly than career advancement. In the humanities, they stand out against a bleak landscape of subscription services and other forms of restrictions on access to scholarly work.

Taken together, Diogenes and Perseus illustrate the kind of cross-pollination that is possible when reuse of digital scholarly works is not outlawed. If enough classicists notice, we may have more good Thanksgivings ahead of us in the future.

Thursday, November 1, 2007

Remembering Ted Brunner

This summer I read the Washington Post's lengthy obituary of Ted Brunner. Few classical scholars are made the subject of so many column inches in a national paper, so I was surprised this fall to discover that none of my Classics students knew who Ted Brunner was. The same quite serious majors who recognized the authors of eminently forgettable footnotes on Greek or Latin texts had apparently never heard of the director of one of the later twentieth century's most influential digital projects in the humanities. We classicists really have a lot of teaching to undo.
I leave it to others who knew Ted better than I to eulogize or analyze him. I offer only two observations from first-hand experience.
First, he remained always relentlessly focussed on data. The TLG was not about producing software: if you wanted software, Ted's attitude was that you should write your own. (Dinosaurs like myself will recall how far he could take this position. In the early years of the TLG, the project's at best arcane, in many ways bizarre data formats were almost aggressively undocumented: you got a nine-track tape, and if you wanted to understand the data, you were welcome to reverse-engineer the format as best you could.) In his own way, Ted Brunner was an early advocate of separation of concerns, and his view has been validated by the range of software developed over the past two decades for using the TLG's data. Most recently, Peter Heslin's release of version 3.x of Diogenes is a stunning piece of work (and deserves far more recognition than it has received). It integrates the TLG data with output from the Perseus project's morphological parser — a piece of software that in turn would probably never have been developed if the TLG had not existed. What a pity that since Ted's retirement the TLG has turned its back on this principle, and permits access to material digitized since 2000 only through its own, one-size-fits-all web interface.
Second, however sharply he could react to people he saw as threatening the TLG's work, he was extremely generous with his time to anyone interested in the TLG, no matter how unimportant. When I was a very lowly graduate student at Berkeley, I had a chance to visit the TLG project at Irvine, and Ted set aside an entire morning to give me a personal tour and answer my questions. (I am sure that I am not the only visitor to the TLG to come away with a vivid memory of Ted starting the standard pre-recorded TLG slide show and proudly pointing out that the narrator's incredible bass voice was none other than the voice of Tony the Tiger.)
So two small points — he focused on his data, and was generous to people who could not obviously or immediately help him.
I hope someone could remember as much about me after reading my obituary.
The Feast of All Saints, 2007.