Perseus · Tufts
Perseus Home Page
Collections: Classics · Papyri · Renaissance · London · California · Upper Midwest · Tufts History
Configure display · Help · Tools · Copyright · FAQ · Publications · Collaborations · Support Perseus

Announcing a Greek Word Collocation Tool

September 9, 1999


The Perseus Project is pleased to announce the addition of a Greek word collocation tool to the Perseus digital library. This tool allows users to see the words that are likely to appear within five words of each other in Perseus Greek texts.

This sort of collocation information can yield interesting information about common patterns of language usage. For example, in English, collocation data shows that the mutual information score for the words 'strong' and 'tea' is much higher than the score for 'powerful' and 'tea'. This suggests that it is much more common to speak of 'strong tea' than 'powerful tea'. Collocation data can also provide a quick overview of the sense in which an author uses a word. For example, if the most common collocates of the word 'bank' in a collection of texts were words such as 'water', 'shade', or 'cool', we would know that the author probably was writing about rivers rather than financial institutions.

Collocation information yields similar information about Greek texts as well. Just as in English, commonly used word pairs have a high mutual information score. For example, the mutual information score for the Greek words agathos and kalos is quite high. It is also possible to use collocation data to determine the semantic range of a word in a Greek text. For example, the most common collocates of the word thuo are, as one might expect, the implements, objects, and personnel associated with sacrifice.

This collocation information is integrated with the Greek lexicon in Perseus. To see the collocation information for a word, simply look up that word in the Greek lexicon. A table will appear at the head of each dictionary entry showing the most common collocates for that word. You can view the dictionary entries for the collocates by clicking on the word. A complete list of collocations and mutual information scores are available by following the links at the bottom of the table. The complete collocation table is also linked to a word search tool allowing for quick study of the passages in which two words co-occur.

See, for example, the dictionary entries for kalos and thuo.

If you select an author in the Greek dictionary lookup tool, or if you link into the electronic lexicon while reading a Greek text in Perseus, an expanded version of the collocation table will appear in the dictionary entry, showing the most common collocations for that word in various corpora within Perseus. See, for example, the collocation data for the word kalos in Plato.

These lists of commonly co-occurring words are created by calculating a mutual information score for every Greek word pair in the Perseus corpus. A mutual information score is used in place of raw pair frequencies to account for the fact that the most frequent word pairs in any collection of texts will be combinations involving the definite article and other function words. As the mutual information score increases, the words have a higher likelihood of appearing together. The maximum mutual information score varies based on the size of the corpus being considered. Thus, a mutual information score of 80 suggests a strong association in Greek prose, but only a medium association in the whole Perseus corpus. While we are currently investigating ways to more precisely gauge the significance of the results in different corpora, for now the scores should be used as a guide for locating potentially interesting word pairs.

Although it appears that the same technique would yield interesting information about Latin texts, the Perseus Latin corpus is not large enough to produce statistically significant results. For this reason, similar collocation tables will not be available for Latin texts at this time.

This tool was created by Jeff Rydberg-Cox with the support of the National Endowment for the Humanities Division of Preservation and Access. We welcome any comments or suggestions about this tool at webmaster@perseus.tufts.edu.