Optical Character Recognition of 19th Century Polytonic Greek Texts

Results of A Preliminary Survey

Bruce Robertson, Dept. of Classics, Mount Allison University

2012-01-19

Abstract

This is a quantitative overview of a strategy for performing optical character recognition on text images comprising ancient Greek. We produced 22 different classifiers to conduct OCR on 19th-century ancient Greek texts from around the world. For each classifier, we processed 10 page images from 158 books. The output was scored for its 'Greekness' on phonetic and lexical grounds, and summarized in a table. In the majority of cases, the output of each text's highest-scoring classifier is of sufficient quality to be useful in further research and image-fronted search engines. There is a good correlation between the best classifier or group of classifiers and the publisher and publication date. This confirms the usefulness of our approach, and will simplify OCR of occasional Greek words in other texts by the same publishers. Better line-segmentation strategies will provide the greatest single improvement in this process.

Introduction

To this date, polytonic, or ancient, Greek has not joined the revolution in reading and textual study brought about by libraries' participation in large-scale optical character recognition (OCR) projects like Google Books. There are good reasons for this. Ancient Greek comprises vowel accents and breathing marks that can easily confound a OCR engine, and over the years a great variety of font faces have been used to represent the language.

Greek ought not to be left behind in this, not only because many books were published primarily in Greek, but also, perhaps more importantly, because books in modern Western languages have, since the invention of the printing press, drawn on ancient Greek as an intellectual heritage. They quote Plato, Galen, Aeschylus and the Church Fathers to explore modern ideas. If OCR processes render these quotations as indecipherable misreadings, this particular web of meaning, tracing across languages and time, remains inaccessible. Conversely, even with so-called 'dirty' OCR output, such quotations could be automatically traced back to their source, and, in some cases, translations given to the Greekless reader.

Our 2009 Digging into Data project, Towards Dynamic Variorum Editions, aimed to make progress in mining these texts, extracting variant editions, citations, quotations, people and places from primary and secondary source texts in Classics. For ancient Greek, the first step in this process was to get large-scale OCR working on these primary and secondary source texts. Commercial OCR engines proved financially beyond our reach (as they insisted on charging per CPU), but in early 2011 a more suitable open-source framework opportunity came to our attention, and during the Spring and Summer of 2011 a team of undergraudates was assembled to work on Greek OCR.

A Large-Scale Survey

Preliminary results were sufficiently encouraging that we began to explore improvements in the algorithms underlying the OCR, and we produced an image-fronted text search web application based on the OCR output of about 5 books. However, it became clear that a quantitative and visual overview of the quality of OCR output was needed, first to determine if a best classifier can be assigned to each text, also to predict which classifier would suit texts with occasional Greek, and, finally, to progress reliably in improving the process. This survey provides that overview.

Method

From the curated collection of about 500 Greek and Roman text provided in high resolution by Google Books, we chose 158 by hand as being primarily written in polytonic Greek. Using the Gamera framework, a team of four undergraduate students generated 22 OCR classifiers for the different font families represented in these text and in later, 20th century, Greek text series.

Our starting point was the Greek OCR application provided by Dalitz and Brandt, based on the Gamera document analysis framework. We altered it to include glyph splitting and grouping strategies, SQL output and tweaked word-spaced generation.

(Classifiers and source code are available under GPL licences.)

10 pages were randomly selected from each book and processed by Gamera using each of the classifiers. (In all, therefore, 34760 pages of output were produced.) The results of each book-classifier were stripped of any recognized Latin text, and only the Greek output was then evaluated using Federico Boschetti's contextless Greek analyzer, which provides a score for the 'Greekness' of a text. (It is important to omit Latin text because a classifier ill-suited for a text might erronesouly recognize Latin letters as Greek, and some of this output would falsely increase that classifier's score.)

This is a breadth-first survey of our OCR approach. No preprocessing was performed on the images, nor was the output 'cleaned' (except to convert certain characters to forms that are expected by Boschetti's program). Beyond the addition of splitting and grouping strategies, basic Gamera functions were used in standard OCR processes like line-splitting. Thus this survey is meant as a baseline from which further improvements can be measured across a wide variety of 19th century Greek texts.

Results

The results are summarized in a table. For each processed book it lists: bibliographic information provided by Google Books; the name of the book's highest-scoring classifier; that classifier's score (from 0 to 1), the mean score of all the classifiers, and a 'Z-score', the number of standard deviations that the highest-scoring classifier scored above the mean, which is meant as a rough metric of excellence. It also provides a very small line chart (or 'sparkline') to summarize the classifiers' scores. The table can be sorted according to a column's values with a click on the column header.

A link on each line of the table leads to a more detailed page for each book. This page includes: a bar chart of classifier scores; thumbnails of the ten sampled pages with bounding boxes around the words as segmented by Gamera; and the output (including Latin glyphs) of the two highest scoring classifiers. The page image thumbnails reveal a readable image of their pages when clicked, and this can be compared to the OCR output.

Observations

There is a good correlation between the best-scoring classifier and the publisher or publisher/date-of-publication. For instance, those texts published 'E typographeo academico' were best classified with the Oxford classifier. Teubner texts from the mid-19th century onward usually had the 'New_Teubner' classifier as their best scoring one; 'Early_Teubner' scored well with German texts from the early 19th century. The Littre classifier succeeded with French publishing houses like J. B. Baillière, Didot and Dumont. This is not always the case: some 19th century Teubner texts were published in a sans-serif font.

This is an in important results: first, it is a sanity-check for our approach to the problem, proving that by targeting specific font families, we can improve our overall results; secondly, it will simplify the automated OCR of texts outside this collection, especially those in which Greek appears only occasionally.

Low mean classifier scores, and low best-classifier Z-score both indicate that the line segmentation algorithm failed, producing out-of-sequence letters and nonsense output. The 1820 edition of Orion of Thebes' Etymologicon illustrates this: most of its 2nd through 10th sample pages are smudged by bounding boxes that wrongly bound words vertically around the page. This issue is the one that most seriously effects the survey's overall results.

There is much to be gained from combining the results of two or more high-scoring classifiers. ('High-scoring' here might be defined as Z-scores over 2.0). For instance in processing Cramer (1841) the Oxford and Etymologicum classifiers both could contribute to a more accurate output, even through the former is clearly superior.

Future Work

One (possibly, two) undergraduates will work with me on this project through the Summer of 2012, and, based on these data, we will undertake the following tasks and improvements:

During my sabbatical in 2012/13, I anticipate undertaking the following projects with these data:

Acknowledgments

This work was funded by a SSHRC grant under the Digging into Data Challenge. The following Mount Allison Classics undergraduates produced the classifiers used in this project: Halcyon Avrill, Chelsea Green and Alexander Kirby. Emily Wilson, a Mount Allison Computing Science student now in graduate studies at UNB, experimented with the Gamera page segmentation algorithms, work that provides a clear path forward for improving that aspect of the project.

Various people involved in the Dynamic Variorum Editions DiD Challenge project inspired and guided this work, especially my fellow awardees, Greg Crane of Tufts University, and John Darlington and Brian Fuchs of University College, London. The expertise and code of Federico Boschetti, now at the Institute of Computational Linguistics (ILC) in Pisa, were clearly invaluable.

AceNET provided the high-performance computing for this survey. AceNET's Sergiy Khan gave kind technical support.

I'm grateful to Google Books for making these texts and the related metadata easily accessible. This project is quite clearly powered by Google. .

Feedback

Comments, suggestions and observations should be sent to Bruce Robertson at brobertson@mta.ca. (Please use the Google Book code to refer to specific texts.)