Perseus Vocabulary Tool Help

The Perseus Vocabulary Tool is designed to allow users to explore the vocabulary of the non-English texts in the Perseus Digital Library. Using the Vocabulary Tool you can select a set of documents or document sections and then view a list of all of the words that appear in that selection.


Setting Up Your List

There are two ways to access the Vocab Tool depending on what you are looking for:

Subsection of a Text: If you want the vocab for a particular section of a text, say Book 1 of the Iliad, then you would view that section on the text page, and click load in the Vocabulary Tool box. By default, this box will show you the top 50% of words in that section, sorted by weighted frequency. The box will provide a link to further refine how the vocab is sorted and displayed, as described below.

Full Text(s): If you want to view the vocab for a full text or multiple texts, you will need to access the Vocabulary Tool page. Here, you are presented with a selection box that shows all of the works in a language in a Perseus collection (see above figure). You can select the documents for your vocabulary list by selecting the documents in this box. As usual, Macintosh users can select more than one work by holding the Command key as they click, and Windows users can select more than one work by holding the Control key. You may select as many works as you like, though your results may be slow to load if you choose too many texts.

Sort Order: It is possible to select several ways to sort your list. Different sort orders are useful for different tasks.

  • An alphabetical listing will allow you to generate a traditional word list that you can use to help you study a text.
  • A list ordered by either weighted or maximum frequency will allow you to generate a list of the most common words in a text. This will allow you to know what words are important to learn to help you read a text more effectively.
  • A list ordered by the key word score will provide an initial guide to the distinctive words in your selection of texts. Words with a high 'key word' score appear relatively often in your document selection but appear relatively infrequently in other documents in the collection in the Perseus Digital Library. The five or ten words with the highest key word scores are frequently the names of the important people, places, and concepts in your selection of works. See below for more information and detailed examples.

List Length: The tool also allows you to select the percentage of the words in a document that you want to include in your list. As with the sort orders, the different percentages are useful for different purposes. The vast majority of words in any text appear only once. If you are looking for a list that contains the essential vocabulary for your selected texts, pick a higher percentage. If you want a comprehensive list, pick a lower percentage or the "all words" option. Selecting an alphabetical listing of words works best when displaying all words in your selection.

Output Formats: The vocabulary tool provides two different ways to format your output. You can choose a table that will provide attractive output in a web browser or an XML file that you can import into other software programs. Note that some browsers have problems displaying very large tables; if you are requesting a very long list, the XML file may work better.

The defaults for these features are to sort by weighted frequency and to display the top 50%. For a typical text this gives a list of 100 to 300 distinct words.

Viewing the Results

After you make your selection, the system will calculate a custom vocabulary list for your documents.

If you selected works through the Vocabulary Tool page, several numbers will appear at the top of your vocabulary list to help you understand general characteristics about the vocabulary of your selection.

  • The number of words in your selection.
  • The number of unique words in your selection, or the number of words that will appear on your list of if you select the "All Words" option.
  • A vocabulary density score which is the ratio of the number of words in the document to the number of unique words in the document.
These three numbers are intended to help you understand the level of vocabulary complexity in your selection. A work with more complex vocabulary will have more unique words while a work with simpler vocabulary will have fewer unique words. The vocabulary density ratio provides a normalized mechanism for this same information. If the vocabulary density ratio is small, the vocabulary is more complex; as the number increases, the text becomes easier. Another way to think about this ratio is that it is an expression of the number of words on average that you will encounter between every new word.

Compare the word counts and vocabulary density scores for Aeschylus' Oresteia (a name for the trilogy of Agamemnon, Eumenides, Libation Bearers) and Xenophon's Anabasis. The Oresteia contains 19,707 words and 4,486 unique words with a vocabulary density score of 4.393. This means that, on average, one out of every four words that a reader encounters will be new. On the other hand, Xenophon's Anabasis contains 57,183 words with 4,007 unique words, for a vocabulary density score of 14.271. The higher vocabulary density score suggests a much simpler vocabulary; on average only one in every fourteen words will be new. In fact, the Anabasis is almost three times longer than the Oresteia but it contains only about 2/3 as many unique words.

Similarly, Livy's History, books 1-10, is 159,186 words long but contains only 7,446 unique words, so its vocabulary density is 21.379. Virgil's Aeneid, less than half as long (63,719 words), uses almost as many different words (6,677 of them), giving it a vocabulary density score of only 9.543. In other words, while Livy's vocabulary is larger than Virgil's, new words do not appear as frequently.

The Vocabulary List: The vocabulary list will appear along with a series of numbers to give you information about each word in the context of your list. The actual contents of your list will vary based on the way that you customized the list and the sort order that you requested.

  • Count: The row number is supplied to help you keep your place in the table. The count appears on every tenth row.
  • Word: The words in your vocabulary list are linked to the Word Study Tool, from which you can get a short definition, a link to the full lexicon entry, and frequency information for this word in the corpus as a whole.
  • Minimum, Maximum, and Weighted Frequencies: These numbers give you a sense of how common a word is in a text. A more detailed description of the three different ways that we count words is provided below. The maximum frequency, like on other pages in Perseus, will link to a search of that word.
  • Definition: The short definition provided is automatically extracted from various lexica in the Perseus collections. This definition is the one listed first in the dictionary entry for each word. Thus, the definition provided for words with multiple senses may not be entirely correct for the works that you have selected. If you would like to see the complete definition, you can look up the full definition in the dictionary using the Word Study Tool.
  • Lexicon Entries: This provides you with links to the entries in our various lexica for the particular word.
  • Key Term Score: As noted above, words with a high key term score appear relatively often in your selection of documents and relatively infrequently in the collection as a whole. Words with a high key-term score are an automatically extracted variety of keyword that provides an initial guide to important people, places, and concepts in your selection. Frequently appearing words that provide less guidance about the contents of your selection will have a low keyword score and the least important key words will have a score of 0.
    Note: The quality of the key words will vary based on the size and similarity of the works that you select. As with any automatic knowledge discovery procedure, these scores might provide an interesting guide to further exploration but they might not produce interesting or useful results for your selection of texts.

Refining Your Word List: To the right of your vocabulary list, you will find controls to refine your sort order, change the number of words that your list contains, or, if you chose full texts, the option to select new works by language.

Things You Can Do with the Vocabulary Tool

The Vocabulary Tool is very versatile and it can be used in several ways to help you read a text in the Perseus Digital Library.

  • A Comprehensive Vocabulary List for a Work: If you want a comprehensive vocabulary list that you can consult as you read and review a text, you should select the text that you are trying to read in the select box. Use the alphabetical sort order and show all words for the list size. This will produce a comprehensive list of words in alphabetical order that you can annotate and consult easily as you are reading a text.
  • A List of Essential Words for an Author: If you want to improve your mastery of a particular Greek or Latin author, you should select all of the works by that author in the select box. Select weighted frequency as your sort order and top 40% or top 50% as your list size option. This will provide you with a list of 'essential words' that you should memorize to maximize your understanding of that author.
  • A List of Basic Words for Intermediate-Level Reading: If you are an intermediate-level student, beginning to read unadapted texts, select five or six texts of interest in the select box. Select weighted frequency as your sort order and top 50% or top 60% as your list size option. This will give you a sense of the most important words in the language; when you are familar with these words, you can begin reading, confident that you will know half to two-thirds of the words on a typical page.
  • A List of Essential Words for a Comprehensive Greek or Latin Exam: If you are an advanced student preparing for comprehensive exams, select a large list of authors that are appropriate for the requirements of your exam in the language box. Select weighted frequency as your sort order and top 70% or top 80% as your list size option. This will provide you with a list of important words to help you prepare for your exam.
  • A List of Key Words for a Text: If you want a quick overview of the potentially important words and concepts in a text, select the text that interests you with a sort order of key word score and a list size of top 10%. This will provide a short list of potentially important words to be aware of as you read the text.

  • Word Frequency Tool (Greek or Latin): If you are searching for occurrences of specific Greek or Latin words, you may use the Word Frequency Tool. There are several options for displaying results. Sort Authors Alphabetically is the default option. Sort Authors by Type of Literature will sort results according to types such as comedy, history, tragedy,etc. Sort Authors by Date will list authors starting from the earliest work to the latest based on the best evidence we have for each author. Words in Author will sort results from the author with the most words in Perseus, to the author with the fewest. Maximum Instances will sort results from the most possible instances in a given author; Minimum Instances reverses this list and starts with the fewest. Maximum Frequency/10K will sort the results from the highest incidence of relative frequency to the lowest; Minimum Frequency/10K reverses this list and begins with the lowest relative frequency.
    Why are there Maximum and Minimum Frequencies? Although Perseus can disambiguate a vast majority of Greek and Latin words, there are some forms which may be derived from more than one lexicon entry. (E.g. "flies" may be an instance of the verb "to fly" or the noun "fly", so Perseus would include it in the count for both words. On the other hand, there's no doubt that "sneezed" is a form of "to sneeze") In cases where the maximum instances differ from the minimum, the maximum are all of the possible occurrences of a given lemma, and the minimum are all of the occurrences of the word which the computer has disambiguated. So, all ambiguous forms are included in a maximum count, and excluded from the minimum. This is also true of the relative frequency calculations.
    What is a Weighted Frequency? A weighted frequency tells you whether the actual frequency count for a word (if this were possible) would be closer to the minimum or maximum frequency score. The weighted frequency is determined by assigning a weight to each inflected form based on the number of possible dictionary forms from which the inflected form could be derived. For example, an unambiguous word would have a weight of 1, a word that could be derived from two dictionary headwords would receive a weight of 1/2, a word that could be from 3 different headwords is given the weight of 1/3, etc. The weighted frequency is calculated as the sum of the weights for each inflected form that appears in a text. If the weighted score is equal to the average of the minimum and maximum score, you know that the word is entirely ambiguous in all of its forms. On the other hand, if the minimum, maximum, and weighted scores are all the same, you know the word is entirely unambiguous in all of its forms. As the weight approaches the maximum score, it becomes more likely that the maximum count is closer to the actual count; the actual count would be greater than the weighted score and less than or equal to the maximum.
    Why use relative frequencies? Relative frequencies are based on occurrences of a given word per 10,000 words. For instance, in the case of the Greek verb pempô, Plutarch uses this verb 146 times, which is unimpressive compared with Xenophon's maximum of 350 times. Yet, the corpus of Plutarch on-line in Perseus is about 107,000 words compared with Xenophon's 312,000. So, the relative frequency in Plutarch is 13.67 at its maximum, compared with Xenophon's maximum of 11.21. When making comparisons between authors, it is most useful to know the relative frequency for a given word rather than the word count itself, since the size of the corpora vary.
    revised 22 Feb, LMC