Mining a Million Scanned Books: Linguistic and Structure Analysis, Fast Expanded Search, and Improved OCR

A collaborative project with UMass Amherst and the Internet Archive
National Science Foundation Award Number: IIS - 0910165
Data-intensive Computing

Gregory Crane, PI
Perseus Project/Classics Department
134C Eaton Hall
Tufts University
Medford, MA 02155

David Bamman
Perseus Project
134C Eaton Hall
Tufts University
Medford, MA 02155

Project Summary

The Center for Intelligent Information Retrieval at UMass Amherst, the Perseus Digital Library Project at Tufts, and the Internet Archive are investigating large-scale information extraction and retrieval technologies for digitized book collections.

To provide effective analysis and search for scholars and the general public, and to handle the diversity and scale of these collections, this project focuses on improvements in seven interlocking technologies: improved OCR accuracy through word spotting, creating probabilistic models using joint distributions of features, and building topic-specific language models across documents; structural metadata extraction, to mine headers, chapters, tables of contents, and indices; linguistic analysis and information extraction, to perform syntactic analysis and entity extraction on noisy OCR output; inferred document relational structure, to mine citations, quotations, translations, and paraphrases; latent topic modeling through time, to improve language modeling for OCR and retrieval, and to track the spread of ideas across periods and genres; query expansion for relevance models, to improve relevance in information retrieval by offline pre-processing of document comparisons; and interfaces for exploratory data analysis, to provide users of the document collection with efficient tools to update complex models of important entities, events, topics, and linguistic features.

When applied across large corpora, these technologies reinforce each other: improved topic modeling enables more targeted language models for OCR; extracting structural metadata improves citation analysis; and entity extraction improves topic modeling and query expansion.

The testbed for this project is the growing corpus of over one million open-access books from the Internet Archive.

Undergraduate students involved in this project:

  • John Frederick Owen
  • Erin Shanahan

Publications:

  • David Bamman, Alison Babeu, Gregory Crane. Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection. In Proceedings of the 10th ACM/IEEE-CS Joint Conference on Digital libraries (JCDL 2010), pages 11-20, Australia : ACM Digital Library, 2010-06. (Full text)
  • Gregory Crane. Give us editors! Re-inventing the edition and re-thinking the humanities. In Online Humanities Scholarship: The Shape of Things to Come, University of Virgnia : Mellon Foundation, 2010-03. (Full text)