Since we submitted our first pre-proposal for the Perseus Project in September 1985, we have received generous support from many sources. These include major support from the Annenberg/CPB Projects (which invested $2.5 million with which the project began planning and developing collections on classical Greece in 1987) and the Digital Library Initiative Phase 2 (which provided $2.8 million in 1998 and allowed us to explore the issues of digital libraries for the humanities in general). The National Endowment for the Humanities, the National Science Foundation, the Institute for Museum and Library Services, the Fund for the Improvement for Postsecondary Education, the Department of Education, the Mellon Foundation, and the National Endowment for the Arts have all provided generous support. We also gratefully thank private individuals who have supported our research over the years. For an overview of the research that we conducted, see here.

For an overview of the research that we are currently pursuing, see here. The following lists currently active grants that support that research according the order in which they were funded.

  • A Reading Environment for Arabic: Department of Education: (2006: $432,000): This grant has allowed us first to extend the reading support infrastructure already available for Greek and Latin to Arabic as well, thus creating a much larger potential community to support the same underlying infrastructure. At the same time, this grant is also allowing us to improve our ability to provide customized vocabulary support: given a reader familiar with vocabulary through, for example, unit six of volume 2 of the Al-Kitaab Arabic textbook, what words in a given chunk of Arabic are new? What chunks of Arabic would best match that reader's current vocabulary? As an initial sample of Arabic texts, we will adapt a version of Arabic Wikipedia. This project involves the publication of an Arabic-enabled version of the Perseus Digital Library system and integration of that system with a Fedora Institutional Repository back end. A system that identifies Arabic, Greek and Latin on an arbitrary web page and then generates links to dictionaries in all three languages is available for download.

  • Scalable Named Entity Services for Classical Studies: National Endowment for the Humanities and the Institute for Museum and Library Services: (2007: $349,939): This extends work done on named entity analysis of 19th century American historical documents to publications about classical studies. There are four main goals of the project. First, we will produce a thesaurus of the most common proper names from Greco-Roman antiquity in Greek, Latin, English, French, German and Italian. Second, we will publish more fully tagged versions of the Smith's dictionaries of Geography and of Biography, providing basic information for 20,000 people and 10,000 places. Third, we will aggregate as many print-indices into a single digital database of people and places in particular passages. Fourth, we will apply named entity analysis to a testbed of materials about Greco-Roman antiquity.

  • The Dynamic Lexicon: The National Endowment for the Humanities: (2008: $284,999). This project involves creating new reference works for Greek and Latin from a large collection of texts and structured knowledge sources (such as treebanks) within the cyberinfrastructure of a digital library. Built on the technologies of parallel text analysis (including word sense induction and disambiguation) and automatic syntactic parsing, these reference works will allow us to present the possible senses for any Greek or Latin word while also providing syntactic information and statistical data about its use in any collection of texts or any subset of that collection - not simply, for example, how oratio is used in all of Latin literature, but only within the works of Cicero (where it means "oration" or more generally the power of oratory) or the works of Jerome (where it means "prayer"), including quantified measures of its syntactic usage. These methods will also let users search a text not only by word form, but also by word sense, syntactic subcategorization and selectional preference.

  • Cybereditions: The Mellon Foundation: (2008: $471,000). This project explores the problem of mining high value data for a demanding scholarly audience from the image books in emerging large digital collections. Our work begins where the services currently under development by libraries and by Internet giants such as Google end - we seek to identify ways by which we can bridge the gap between those general services and the services that we will need so that cyberinfrastructure can support scholars working with textual materials.

    We are building a workflow that leads from page image to actionable data. Humanists need access to the earliest phases of processing - we need to be able to define the page layouts of editions and commentaries and to recognize languages such as classical Greek for which general-purpose optical character recognition (OCR) engines provide little support. An application programming interface (API) that provides access to the searching or other services does little good if the crucial data has already been lost.

    This project will result in three basic deliverables. First, we will produce a testbed of image books with editions, commentaries and translations of the major classical authors, often in multiple editions, that survive from antiquity. We will make this testbed available as a part of the Open Content Alliance (OCA), where it will be freely available. Second, we will provide documentation and evaluate methods for each stage of the workflow. Third, we will provide the code and data sets that we produce under a creative commons license. Data sets in this case will include the textual data that we have been able to extract, with automatically added markup. This markup will include automatically suggested corrections as well as original OCR output (allowing for flexible searching).

  • PhiloGrid: The National Endowment for the Humanities and the Joint Information Systems Committee: (2008: $240,000 shared evenly with Imperial College London) PhiloGrid, a collaboration of the Perseus Digital Library at Tufts University in the United States and the Internet Centre at Imperial College London in the UK, proposes to create an expandable, Grid-enabled, web service-driven virtual research environment for Greco-Roman antiquity based initially upon open-source texts and services from the Perseus Digital Library. First, we will add to the Perseus DL Greek historians who exist only in fragmentary form. This task goes beyond simple data entry: we will create the first major digital collection of fragmentary authors designed from the start to interact with multiple source editions. Second, we will create a repository of philological data about the Greco-Roman world seeded with twenty years' worth of Perseus materials. The objects that we create will not only include books but every labeled object within each logical document. Third, we will convert the workflow that has evolved over the past ten years to process textual materials in Perseus into a grid-enabled workflow based on web services that can be applied to and customized for many collections. Although this project will concentrate upon the classics collections in the Perseus DL, the new workflows will also process non-classical Perseus content, and will thus from the start demonstrate their generality.

  • The Ancient Greek Treebank Project: Alpheios Project: (2008: $865,290). This project will enable us to create a treebank - a large collection of syntactically parsed sentences - for ca. one million words of Ancient Greek texts. Treebanks are fundamental datasets that provide not only reading support for students of Classical texts (for example, noting the subject of the sentence and which adjectives modify which nouns), but also provide the basic quantitative data on which to build larger linguistic and general philological arguments (see our call for *research opportunities*). The majority of the texts will consist of Homer, the tragedians and Plato, with selections from several other Classical authors as well. This work complements our ongoing work on creating a Latin treebank.

  • Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR: (2009). This project provides effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.