Sunday, August 26, 2012

Mark[up|down]


People sure hate the pointy brackets.  I've been writing markup since version 2 of  the Text Encoding Initiative Guidelines in SGML in the 1980s, and as easy as modern tools like oXygen make it today, even I'm not crazy about it.  Over the past year or so, I've looked at every "markdown" alternative to markup that I could find:  to name a few, textile, markdown, multimarkdown, reStructuredText, and more wiki languages than I can list.

All of them seem to have a similar history:  somebody wanted to have a quicker and easier way to express HTML, cooked up a tool to convert some simple "markdown" conventions to HTML, and realized, "Hey, this is useful!"  As a result, all of the markdown languages share two main drawbacks.  The first is that they generally express only the semantics of HTML, or a subset of HTML.  That eliminates their application to any writing or editing that requires richer semantics such as an XML language could supply, but more fundamentally, the markdown languages suffer from a long-recognized limitation:  they're not specified.   It should be obvious that "Whatever my converter tool handles" is NOT a specification, but that's the state of most of the markdown schemes.  (See these comments from more than a decade ago about reStructuredText!)

For those cases where I really just want a quick and easy way to bang out HTML-like content, John Grueber's markdown seems to offer the best compromise.  First, if you really need some particular piece of HTML beyond what markdown offers, you can just embed it in your text (although at the price of reintroducing some pointy brackets).  But what really persuades me is the pegdown processor.
Pegdown uses parboiled's "parsing expression grammars" (PEGs) so it comes closer to a separately specified definition of the language than a code library full of regular expressions emitting some kind of converted text.  pegdown will give you an abstract parse tree for your markdown content, which makes me feel much more confident using markdown down from code I write.

Add to that the ever growing number of editors and other tools that support markdown in all kinds of contents, and I'm converted.  So was the text of this post — from markdown to html.

Tuesday, July 31, 2012

In a small discipline, proxy repositories

Software builds on other software.  With a build system like gradle, once you declare how your code depends on other code, the build system checks your declaration with listed repositories,  and downloads appropriate packages as they are needed.  If you are coding in a JVM language, you can find an enormous proportion of the libraries you might want from maven central, either directly or via a proxy.

But if you routinely work with ancient Greek, or in any similarly specialized domain, the situation is different.  Hugh Cayless' epidoc transformer package is indispensable for my routine work, for example, but for a few minutes yesterday, the one repository where it's regularly hosted was down.  I was paralyzed.

The solution is as easy as it is obvious:  smaller communities, like those interested in ancient Greek, need to ensure that the collections of material they depend on are proxied and available from multiple repositories.

I'm using Nexus to host material developed for the Homer Multitext project, and yesterday configured it to proxy dev.papyri.info/maven, where the epidoc transcoder is housed.  The unified front to all the material hosted and proxied there is http://beta.hpcc.uh.edu/nexus/content/groups/public/.

Nexus is a "lazy" proxy:  it only acquires local copies of a proxied package when it is actually requested.  One way to guarantee that your favorite proxying site has all the packages you want is with a minimal build, that creates dependencies on everything you might want, and then simply lists their names.   The example below is a gradle build to do just this.  The repository URL and version strings for packages are kept in a separate properties file, but this example is otherwise complete:  running the showAll task will force the proxy server to retrieve any packages it does not already have locally stored.

repositories {
    maven {
        url "${repositoryUrl}"
    }
}
configurations {
    classics
}
dependencies {
    classics group: 'edu.harvard.chs', name : 'cite', version: "${citeVersion}"
    classics group: 'edu.harvard.chs', name : 'greekutils', version: "${greekUtilsVersion}"
    classics group: 'edu.holycross.shot', name : 'hocuspocus', version: "${hocusPocusVersion}"
    classics group : 'edu.unc.epidoc', name: 'transcoder', version : "${transcoderVersion}"
}
task showAll  {
    description = "Downloads and shows a list of useful code libraries for classicists."
    doLast {
        println "Downloaded artifacts (including transitive dependencies):"
        configurations.classics.files.each { file ->
            println file.name
        }
    }
}








Monday, July 9, 2012

"Abolish the journals"

I'm appearing on a panel next spring on the subject of "publishing" at the Classical Association of the Midwest and South.  Would it be too much to suggest that Walter Olson's critique of law reviews applies equally well to academic journals in the humanities?

Olson quotes Harold Havighurst:

Whereas most periodicals are published primarily in order that they may be read, the law reviews are published primarily in order that they may be written.
Sounds pretty much like the academic journals I'm familiar with in classics.

(H/T: groklaw news picks for the link to Olson's blog.)

Thursday, July 5, 2012

CC licenses for photography of manuscripts

If you're interested in manuscripts of Greek and Latin texts, this week saw a seismic shift in the scholarly landscape.  The e-codices project, which has been putting high-quality digital images of manuscripts in Switzerland on the web for several years, has now standardized on a Creative Commons license for all of its images.


In this decision, they are following the lead of a growing number of projects and institutions.  I greatly admire the similar work Will Noel has done at the Walters Art Gallery, where high-resolution photography of more than 250 manuscripts is on line, available under a CC license.

Photographed manuscripts now in e-codices number more than 900.  Like the Digital Walters Art Gallery, manuscript photography in e-codices is accompanied by a scholarly catalog entry.

The digital archivists are doing their job.  Now the only question is whether we can find the scholars of Greek, Latin and other languages to read these beautifully documented texts.  

Sunday, June 3, 2012

Who owns Plato?

I attended the workshop "édition des textes et recherche interdisciplinaire" at the École Normale Supérieure last week.  As I mentioned in a preceding post,  I'd been thinking about Eben Moglen's talk "Innovation under Austerity," and since I expected that introducing Moglen's argument might be a bit provocative for the traditional audience I expected at the ENS, I cleverly thought I would win them over, or at least delay their criticism, by paraphrasing one of Moglen's memorable soundbites:  "No one owns Plato."

Not so clever.  Apparently, when you gather in the august Salle des Actes at ENS, you can meet people who believe they do own Plato, and don't care to share with others who fall short of their standards, thank you very much.

 (In the foreground, keynote speaker Gregory Crane, director of the Perseus Project defensively photographs the photographer;  partially masked by the screen are the plaques on the walls of the Salle bearing the names of such distinguished scholars in many fields as Louis Pasteur and Fustel de Coulanges.)

Just for fun, I googled the phrase "plato download":  as the screen grab illustrates, google estimated something over 17 million hits for that phrase, including texts in Greek and translation in a variety of languages, podcasts and ebooks (as well as downloads of software packages named after the son of Ariston).  I also found the Wikipedia article on Ruhollah Khomeini noting that Khomeini considered Plato's views "in the field of divinity" to be "grave and solid".   (Since some of the would-be owners of Plato also object to Wikipedia, I can pass along its reference to Kashful-Asrar, p. 33 as the source of that assertion.)

So while I can appreciate highly theorized concerns about the preparation needed to appreciate Plato "properly", the Anglo-Saxon empiricist in me looks at these Google search results and still wonders — just who exactly owns Plato?

Let them hack: Eben Moglen on "disintermediation"

If you have not yet heard Eben Moglen's talk from last week's "Freedom to Connect" conference, with the title "Innovation under Austerity," it's worth listening to every minute of this audio recording including the Q&A session.

I had it on my ipod as I travelled to a conference to show off work students at Holy Cross have done over the past year for the Homer Multitext project, and was struck by how much of Moglen's main thesis is applicable to digital scholarship. He almost implied that innovation naturally happens under conditions of austerity; he unambiguously argued that the best way to promote innovation is to let young people hack on real problems, and get out of the way.

 In the Homer Multitext project, we're learning how to let young people hack on real problems reading unpublished or incompletely published manuscripts. This is not an easy lesson to grasp if your traditional training, like mine, has conditioned you to believe that this kind of work was granted only to the most senior and experienced scholars who had earned the privilege of access to real problems.  "Disintermediation," to use the jargon-term quoted by Moglen, may not look appealing to those of us, like museum curators or professors, who have been doing the mediating between young people and real research problems in the humanities. But in the audio recording of Moglen's talk, I think I can hear a little of the excitement I feel every single day I work with and learn from my 18- to 21-year old colleagues on the Homer Multitext project.

pull; update


Les Arènes de Lutèce (the Roman arena of Paris) is not much of an archaeological site, but it's a lovely French park, surprisingly peaceful despite its location in the bustling 5e arrondissement.   A group of eight or ten men and women, mostly of a certain age, is silently practicing Tai Chi behind me;  opposite us, French school children are clambering over every visible surface and cheerfully pushing, shouting and generally attempting to terrorize each other.  This is not Worcester, Massachusetts.

When I last sat here to soak in the sun more than 20 years ago, the scene was visually and aurally identical, but today I have in my laptop a computer that weighs less than a kilo, connected to the internet because public parks in Paris give you two hours of free wifi.  The seven busy researchers in the St. Isidore research lab at Holy Cross all use mercurial for version control of their work, so I've run hg pull; hg update, and have seen every change they've committed in the day or so since I last had time to look.

Juxtaposing geographic distance with the immediacy of electronic contact may seem like a pretty tired cliché in 2012, but working step-by-step through the progress of a team thousands of kilometers away makes me realize how little we've thought about a fundamental question:  how do we make our research reproducible?   Version control systems like mercurial or git are one important part of the technological puzzle, but they don't by themselves tell us how to organize our material or working practice so that others can easily replicate our work as fully automatically as possible.

I'm introducing a new tag "RR" for the theme of "reproducible research" since I think that is arguably the biggest overarching challenge of architectural design in digital scholarship today.