Current Research

Home Collections/Texts Perseus Catalog Research Grants Open Source About Help

Much work is on-going since this page was last updated!

Of particular note:

- the Perseus Catalog, a foundational resource that provides the basis for all of our research, created and maintained here at Tufts;

- the Perseids Project (note the spelling!), a collaborative editing platform for source documents on which users can create micropublications consisting of transcriptions, translations, linguistic annotations and commentaries of and on a variety of ancient source documents;

- work under the Humboldt Chair of Digital Humanities at the University of Leipzig that includes the second release of the The Ancient Greek and Latin Dependency Treebank, a project that began at Tufts and has continued at Leipzig.

For more information on these projects, see:

Research in 2008/09

The following projects cluster around a number of themes:

Enabling undergraduate research: Nothing in our view offers more benefits to classics in particular and the humanities in general than our ability to make it the norm for our students to contribute early and often in tangible ways, large and small, to the field. We can in this way make good on our promise to produce active citizens who expect to contribute to their world. The print infrastructure of classics had in the twentieth century grown so mature and so cumbersome that even the most advanced undergraduates in the most demanding programs could not expect that they would, as a matter of course, conduct meaningful research or contribute in any tangible, if small, way to the field. As we build a new digital infrastructure for Classics in particular and for the Humanities in general, the situation is now completely different. We can now provide our students with opportunities to begin contributing in small but tangible ways at a very early stage, to disseminate those digital contributions far more widely than any print publications, to allow many contributions to be reused in novel ways to support additional new scholarship and to put those contributions in an infrastructure designed to preserve them along with scientific data sets on which human civilization depends.

The section on research opportunities suggests specific opportunities for students and classes but we encourage students and faculty to suggest ways to contribute to these active research projects or in any other way.

At least three factors allow us to rethink the possibilities for undergraduate research.
- First, students have more and better access to primary sources than were accessible in print. We have the tools already whereby we much carry much further the visions behind publications such as the Loeb Classical Library and Budé Editions, providing a range of background and translation support. Scholars can now also include full citations for the primary sources behind their statements, knowing that electronic publications do not have the mechanical space limitations of print and that even primary sources previously available only in research libraries are or will soon be available to the world and will contain links to basic background information. Thomas Martin's Historical Overview in Perseus and Christopher Blackwell's Demos.
- Second, we simply cannot do all the work that needs doing if we only rely upon professional scholars and automated systems - we need to enlist our students in this task. The projects outlined here offer a wide range of tasks well within the range of supervised students with various levels of Greek and Latin. Projects such as the Center for Hellenic Studies Homer Multitext have students transcribing scholia and readings from the 10th century Venetus A that have never found their way into print and that are now visible to anyone who downloads the newly created high resolution scans of the manuscript. Student-centered and driven annotations can provide a new generation of commentaries that address the actual problems that readers confront as they struggle with linguistically or culturally challenging texts. Students or classes might systematically review and revise the entries for people and places and realia in digitized versions of the old Smith's encyclopedias.
- Third, wholly new scholarly instruments are now becoming available that open up new avenues of research. We have studied Greek and Latin for millennia but Treebanks (see below) allow us to place our ideas about Greek and Latin lexicography, linguistics and style on a quantifiable and explicit foundation. We urge students to adopt particular authors or works, adding new syntactic data to the larger treebanks and then using that data to conduct original research. We have, for example, already published a Treebank that includes Sallust's Catiline among other samples of Latin. Students could begin now comparing Sallust with the other samples while taking on the task of adding the Jugurtha and fragments.
2500 years later in 2010: the world that Marathon made: Some of the efforts outlined below are inherently broad, others much more focused on particular texts or problems. As a general theme for the coming year, however, we have chosen to focus our efforts on the Battle of Marathon in particular and the world that it helped create in general. Our ultimate goal is to prepare for a conference to commemorate Marathon 2500 years afterwards, in the late summer of 2010. We thus will, where possible, focus collection development on resources that allow us to better address this topic. The topic is, however, a very broad one and includes not only all of the conventional classical Greek period but major elements of Roman history. The topic also invites participation from scholars in contemporary Iran and raises the general topic of classical studies and its ancient ties with not only the geographic Middle East but Islamic scholarship as well.
A comprehensive, open source, fourth-generation library of Greek and Latin editions: The first digital texts contained transcription with markup representing the page-layout of the print source (e.g., representing that a word is in italics). A second generation of collections (of which the primary sources in Perseus provide one example) began to add semantic markup (e.g., representing that a word is in italics because it is a Latin quotation). Collections such as the Making of America and JSTOR then demonstrated a third, much larger generation of collections where readers search text automatically generated by Optical Character Recognition (OCR) software and basic library cataloging data, and then view scanned page images of the source. Where second generation collections extend the scope of first generation collections by adding semantic markup to carefully transcribed text, third-generation collections reverse production philosophies, emphasizing automation and scalability over the artisanal techniques of first and second-generation collections.

Third-generation collections focus on the quality of the page images and associated meta-data and assume that the automatically generated text will improve with each new generation of OCR software. In the 1970s and 1980s, when the first- and second-generation collections emerged, scanning and storage technology made libraries of scanned page images impractical - the several hundred megabytes of transcribed Greek from the Thesaurus Linguae Graecae (TLG) required special disk drives that cost tens of thousands of dollars. Texts had to be transcribed well enough to stand on their own and readers would have to rely upon printed copies to identify errors and to find the textual notes, introductions, indices, appendices and other scholarly apparatus. At least one major library has informally reported that it would not accept first- or second-generation collections unless they came with digital page images aligned to the transcribed texts.

Fourth-generation collections integrate not only carefully transcribed text and the original page images but also other forms of annotation (e.g., morphological and syntactic analysis, indices of people and places, markup for the particular sense of particular words in context).

The first fourth-generation texts became available at least as early as 2006, when Perseus aligned manually produced, TEI-compliant editions to page images that it scanned in-house. In 2007, Perseus tested the Open Content Alliance (OCA) workflow, in which scholars can pay to scan selected books from OCA partner libraries, and as a result a number of scholarly materials, including not only Greek and Latin but Syriac, Sanskrit and Old Norse, have become available for download from the OCA. In 2008-09, Perseus is creating a fourth-generation collection that includes:
- Expanded TEI-compliant XML transcriptions of Greek and Latin primary sources within Perseus.
- An open source collection of image-books representing at least one (and where possible more than one) edition of every classical Greek and Latin author within the OCA.
- Cataloging data in XML MODS and MADS format that is modeled after the Functional Requirements for Bibliographic Records (FRBR) to represent multiple editions, translations, commentaries, indices and other scholarly data. This catalogue is designed to provide the detail now offered by discipline-specific checklists of single-editions (such as the Greek works and authors in the printed Liddell Scott Jones Lexicon and the on-line TLG Canon) within an extensible, standards-compliant library infrastructure.
- Metadata to support access by book/chapter/section/verse or other conventional scholarly citations under the Canonical Text Services (CTS) Protocol. This metadata would make it possible to generate from a textual citation a dynamic link into electronic page images and/or XML-transcriptions.
Focused collections on selected Greek and Latin authors: To complement the general collection development and scalable services we are choosing a small number of authors on which to focus particular attention. For these authors, we will collect more editions and associated publications (especially commentaries, indices, specialized lexica), with targeted creation of TEI-compliant XML transcriptions. We will focus upon Herodotus, Aeschylus and Thucydides to illustrate classical Greece and the world that Marathon made. We also have a major commitment to Homer that reflects work already begun at Perseus and collaborations with projects such as the Homer Multitext Project of Harvard's Center for Hellenic Studies. On the Roman side, we will concentrate on Sallust and Propertius, whose corpora are small enough for close study and for which we can, for example, provide comprehensive Treebanks (see below), and on Livy and Cicero, whose corpora are large enough to demand automated methods.
Scalable methods to identify, transcribe and automatically tag Greek and Latin: These services include not only optimized OCR but algorithms that compare the OCR output from different editions of the same work to distinguish text from headers, textual notes and marginalia and OCR errors in the text from intentional editorial variations. The immediate goal is to create a searchable collection of Greek and Latin that provides better scholarly recall than manually produced collections on which scholars have traditionally relied: about 8% of the unique Greek and Latin words on a given pages from any standard edition only appear in the textual notes (in series such as the Loeb Classical Library which traditionally restrict readings to a minimum, this figure remains c. 4%). Curated collections that contain perfect transcriptions but only the reconstructed text can only deliver 92-96% of the words that the editor chose to print. OCR-generated text can already deliver 98-99% of the words from printed Greek and thus immediately provide better recall than perfect transcriptions. The result returned to the reader is, in addition, an image of the full printed edition.
Fragmentary authors: Humanists have been working with digital texts for a generation but we have in these first decades focused our efforts upon the large body of texts that survive more or less intact. Most of the works written in antiquity are, however, lost - less than 10% of the works of Aeschylus, Euripides and Sophocles, for example survive. Most classical authors exist, therefore, in a fragmentary state. In some cases, these texts are scraps of papyrus that survived in the sands of Egypt and are literally fragments. In most cases, however, our surviving fragments are, in fact, passages where surviving authors quote, summarize or simply allude to authors and works that have not survived. Print editions of fragmentary authors typically print excerpts about a fragmentary authors along with various categories of scholarly apparatus (the editor's commentary, a translation, variant readings etc.) In a digital world, such fragmentary editions should contain dynamic links that point to editions of the quoting source. The comprehensive collection of Greek and Latin source texts, with scanned page images and searchable OCR-generated text for all, and carefully transcribed TEI-compliant XML for some, gives us the foundation on which we can build the dynamic, hypertextual editions of fragmentary authors.

In 2008-09 we will begin work on the Greek fragmentary historians, using Müller's Fragmenta Graecorum Historicorum as a starting point. The output of this work will be both an initial edition of Greek fragmentary historians and the methods by which we represent pointers into source works and associated scholarly annotation. We will create a broad first pass at a comprehensive database of fragments for all Greek authors, but we will focus particular attention on those authors most relevant to the theme of the world that Marathon made.
From human-readable information to machine actionable knowledge: If a lexicon includes an entry such as "insula, -ae, f.," students of Latin can recognize this is a statement that there is a first declension feminine Latin noun with stem insul- and endings with nominative singular insula, genitive plural insularum etc. A machine can generate and recognize forms of this noun but it needs the information about stems and endings in a format that it can process. Commentaries contain information about particular passages - if we can represent the commentary entries in a format that machines can recognize. Encyclopedias contain many statements about birth and death dates, offices held ("X consul in Y"), kinship (e.g., X son of Y), and other propositional statements. While we cannot carefully transcribe every book about classics from our print libraries, a relatively constrained number of reference books contain a large body of information that could, if converted into a machine actionable format, drive a range of services. Every funded project on which we are working depends upon the conversion of some part of the print infrastructure into such machine actionable knowledge bases. We are therefore preparing to convert a range of such print resources into structured, machine-actionable form including lexica, grammars, commentaries, editions, editions of surviving texts and editions of fragmentary authors.
Born-digital knowledge bases: While print reference works contain a great deal of information that can be converted into machine actionable form, they cannot provide all of the data that we need to drive some of the services that are most promising for humanists.
- First, information available in print format does not always lend itself to automatic extraction - in the general case, the automatic analysis of full text is an unsolved problem. An encyclopedia or dictionary entry may contain propositional statements that automated systems could use but that we cannot extract from the text. Critical editions contain a wealth of statements about how one version of a text differs from various others but these print annotations are hard for automatic systems to decode.
- Second, our printed reference works leave out information that their authors collected and which automated systems need. The authors of lexica, for example, often have space to print only a selection of the passages that they have sorted into distinct word sense - their sorted slips of paper or file cards contained the wealth of training examples on which machine learning thrives but these are lost or available only as archival materials.
- Third, some categories of information do not have exact print antecedents. Classical philologists can see from the emerging field of corpus linguistics a wide range of annotations relevant to their work. These range from basic categories such as co-reference (e.g., determining whether hic, "this person," refers back to Caesar or Antony in a particular passage) to more broadly interpretive categories such as labeling expressions about time and events (e.g., languages such as TimeML and the Bruce Robertson's Historical Event Markup and Linking language). Even as we build up treebanks with core syntactic data we need to explore other categories of linguistic markup.
The Classical Greek and Latin Treebank Projects: Syntactic annotations record information about the relationships between the words: e.g., orationes in a given sentence is the object of dicit ("s/he speaks, says") and has the modifier ZZZ. Such annotations organize the words in a sentence into tree-like structures and can be collected into linguistic databases conventionally called Treebanks. These Treebanks can let us see phenomena such as the changing subjects and objects that a given verb takes over time, sentence structure (e.g., subject-verb-object vs. subject-object-verb), and individual style of particular authors, genres and periods. Automated systems can automatically analyze more than 90% of English sentences but these systems do so by analyzing pre-existing Treebanks with a million or more words. For complex, stylistically idiosyncratic and relatively small classical texts, manual annotation would be necessary in any case, but such manual annotation allows us then to place our understanding of these texts on a fundamentally new, more explicit foundation.

In August 2008, we published the latest version of the Latin Treebank, which now includes more than 50,000 words. At the same time, we began work on what will be a 1,000,000 word Treebank for classical Greek.
Text/Data-mining and the Automated production of new knowledge: Once we have converted even the simplest print resources into machine actionable knowledge, we can use that knowledge to generate new knowledge. Consider the examples of traditional print indices and translations. Conversion of print indices involves, at the simplest level, identifying the headwords and citations. This amount of structure allows machines to see that there are, for example, six figures named Alexander in a given corpus and a list of passages where each separate Alexander appears. A named entity identification system can use machine learning algorithms to analyze the context in which the different Alexanders appear to predict the most likely Alexander to which other passages refer. Likewise, if we add basic citations to an English translation (i.e., this passage of English corresponds to the Greek in Thucydides, Book 1, chapter 86), then we can identify words and phrases in the English translation that correspond to the Greek: e.g., Latin orationes corresponds to the English word "speeches" in one passage but to "prayers" in another. We can then use machine learning algorithms to predict in passages where there is no English translation whether orationes more likely corresponds to "speeches" or "prayers." We can also begin to use these lower level conclusions (e.g., Antonius in passage X designates the famous Marc Antony the Triumvir who appears also in Shakespeare, orationes in passage Y corresponds to "speeches") to identify more patterns that indicate people, places, word meanings and other topics (e.g., what other people and places appear in conjunction with Antony? What other words in Latin and Greek correspond to the English word "prayer" in various periods and genres?) At this point, we have moved from patterns that human beings have already labeled (e.g., passage X describes Marc Antony while passage Y describes another particular Antonius), to inferences that human beings make hundreds or thousands of times a day but do not have time to record (e.g., when readers automatically distinguish references to Alexandria, Egypt, vs. Alexandria, VA), to patterns that no human being would see by simply reading through a texts (e.g., a survey of the Latin and Greek terms corresponding to "prayer" that appear in texts containing hundreds of millions of words and written over two millennia).
Adapting linguistic and cultural information for particular readers: Once we begin assembling large bodies of information, we need methods to provide individual users with information adapted their general backgrounds and their immediate purposes. There are two ways in which to adapt large bodies of information. Personalization compares the behavior of a given user against that of previous users to suggest actions of interest (e.g., people who bought book X also bought books Y and Z). Early experiments showed that similar techniques were applicable for readers of Greek and Latin: once readers ask about four words from a particular passage, we can predict two thirds of the other words about which they will have questions.

Our work in 2008-09 focuses primarily upon customized reading support. Customization follows directions from the user (e.g., a user created profile that requests all new information about Pericles' Funeral Oration). Our work focuses upon customized vocabulary profiles in which we have digitized the vocabularies from textbooks of Greek, Latin and Arabic. We want to be able to answer two basic questions: first, we want readers to be able to identify words that they have not yet encountered in a given chunk of text and then to rank the unseen words according to various criteria of significance; second, we want to be able to find passages that best match the existing vocabulary of a particular reader.
The Scaife Digital Library: Named after the late Ross Scaife, the Scaife Digital Library is being developed as a distributed collection and a method whereby humanists from around the world can automatically aggregate their content. The Scaife Digital Library contains durable objects that (1) have received peer review, (2) are in sustainable formats such as the epiDoc TEI stylesheet, (3) have a long-term home such as an institutional repository separate from the producer of the object, and (4) are available under open licensing for third-party redistribution and/or further development.

All of the TEI-compliant XML texts already available for download from the Perseus Digital Library satisfy the conditions 1, 2, and 4. Placing these and other objects within the Tufts Digital Library will satisfy the third condition. We plan therefore to move as many Perseus objects as possible into the Tufts Digital Library, with a particular focus upon newly scanned image books and existing commentaries, lexica, encyclopedias and other materials not yet released under an open source license. Our goal at this stage is to provide basic identifiers that will allow users to retrieve these objects from the Tufts Digital Library.
Institutional Repositories for Advanced Humanities Content: The Scaife Digital Library addresses the problem of long-term preservation for particular objects but we need services as well with which to use the objects. Libraries have successfully maintained the products of intellectual labor for generations and have begun designing institutional repositories that can maintain digital content. These institutional repositories are, however, generally prepared to support very simple digital objects such as images and lightly structured journal articles. We are thus preparing to develop for one major institutional repository system, Fedora, the data models needed to support the more complex objects with which students of the three classical languages regularly work. These include the ability to extract reference articles (e.g., the entry on Alexander the Great in encyclopedia A), dictionary entries and particular word senses from machine readable dictionaries (e.g., word sense II.2.a for word X), and the text associated with canonical text citations (e.g., the Greek text and English translation for section 2, chapter 86, book 1 of Thucydides). To do this, we are starting to adapt the Perseus Digital Library system to work with Fedora as a backend system. The goal of this effort is ultimately to release a version of the Perseus Digital Library system that institutions can download as a turn-key solution for scholarly collections.
Grid-Enabled Open Services: The Perseus infrastructure has depended upon a traditional architecture where we apply programs stored on local servers to locally stored texts and other data. We are working with colleagues at Imperial College London to begin a distributed architecture that works with services and collections from multiple sources. Such an architecture is designed to allow scholars and projects to create their own configurations, perhaps substituting one morphological analyzer for another or adding new modules for particular text mining or visualization functions. Such an architecture also allows us to tap into much greater computational resources, drawing upon services driven by grid computing and/or the services from internet giants such as Google.

Research Themes

Enabling undergraduate research
2500 years later in 2010
A comprehensive, open source, fourth-generation library of Greek and Latin editions
Focused collections on selected Greek and Latin authors
Scalable methods to identify, transcribe and automatically tag Greek and Latin
Fragmentary authors
From human-readable information to machine actionable knowledge
Born-digital knowledge bases
The Classical Greek and Latin Treebank Projects
Text/Data-mining and the Automated production of new knowledge
Adapting linguistic and cultural information for particular readers
The Scaife Digital Library
Institutional Repositories for Advanced Humanities Content
Grid-Enabled Open Services